Effectiveness Analysis of String Comparators for Distinguishing True from False Matches in Record Linkage

  • Sergio Miranda Freire UERJ
  • Rita de Cássia Braga Gonçalves Agência Nacional de Saúde Suplementar
  • André Cipriani Bandarra Agência Nacional de Saúde Suplementar
  • Miguel Gustavo Taranto Villela Agência Nacional de Saúde Suplementar
  • Alexandre Meire Agência Nacional de Saúde Suplementar
  • Maria Deolinda Borges Cabral UFRJ
  • Rosimary Terezinha de Almeida UFRJ

Abstract


The objective of this paper is to analyze the effectiveness of eight string comparators used in record linkage processes. A set of true pairs of names was identified in the databases Hospitalisation Claims and Prepaid Health Plan Users in Brazil. From this set, a set of false pairs was generated and then several string comparators were used in each pair of names. For each comparator, a ROC curve was plotted, its area and the mean time to perform the comparison were calculated. The algorithms presented overall similar performance, but more conclusive results will require a bigger and more representative sample of brazilian names.

References

Apache Commons Project (2008), “Implementations of common encoders and decoders”. [link]. Último acesso em 10/03/2009.

Bell, G.B. and Sethi, A. (2001), “Matching Records in a National Medical Patient Index”, Communications of the ACM, 44(9):83–88.

Camargo, K.R., Coeli, C.M. (2000), “Reclink: aplicativo para o relacionamento de bases de dados, implementando o método probabilistic record linkage”, Cad. Saúde Pública, 16(2):439-447.

Camargo, K.R., Coeli, C.M. (2007), “Reclink III: Relacionamento Probabilístico de Registros, Versão 3.1.6.3160. Manual, Rio de Janeiro.

CCS-SIS (1998), “Tfonetizar. Consórcio de Componentes de Software para Sistemas de Informação em Saúde”. [link]. Último acesso em 10/03/2009.

Cohen, W.W., Ravikumar, P. and Fienberg, S.E. (2003), “A Comparison of String Distance Metrics for Name-Matching Tasks”. [link]. Último acesso em 11/01/2009.

Elfeky, M.G., Verykios, V.S. and Elmagarmid, A.K. (2002), “TAILOR: A Record Linkage Toolbox”. In Proc. of the 18th Int. Conf. on Data Engineering. IEEE, 2002.

Fellegi, L.and Sunter, A (1969), “A Theory for Record Linkage”. Journal of the American Statistical Society, 64:1183–1210.

Jaro, M. A. (1978), "UNIMATCH: A Record Linkage System, User's Manual," Washington, DC: U.S. Bureau of the Census.

Jaro, M. A. 1995. Probabilistic linkage of large public health data files (disc: P687-689). Statistics in Medicine 14:491–498.

Kelman, C.W., Bass, A.J. and Holman, C.D. (2002), “Research use of linked health data - a best practice protocol”. Aust N Z J Public Health, 26:251–255.

McCallum, A., Nigam, K. and Ungar, L. (2000), “Learning to Match and Cluster Large High-Dimensional Data Sets for Data Integation”. In Proc. of the Sixth Int. Conf. on KDD, p. 169–170.

Monge, A. and Elkan, C. (1996). The field-matching problem: algorithm and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining.

NLP Group (2009), “SimMetrics, an open source extensible library of Similarity or Distance Metrics”. [link]. Último acesso em 10/03/2009.

Stephen, G. A. (1994), String Searching Algorithms. World Scientific Publishing Co. Pte. Ltd.

Winkler, W. E. (1994), “Advanced Methods for Record Linkage”, US Census Bureau Report RR94/05. [link]. Último acesso em 10/03/2009.

Winkler, W.E., McLaughlin, G., Jaro, M. And Lynch, M. (1994), “strcmp95.c Version 2”. [link]. Último acesso em 10/01/2009.

Yancey. W. (2005), “Evaluating String Comparator Performance for Record Linkage”, Report RRS2005/05, US Bureau of the Census.: [link]. Último acesso em 10/01/2009.
Published
2009-07-20
FREIRE, Sergio Miranda; GONÇALVES, Rita de Cássia Braga; BANDARRA, André Cipriani; VILLELA, Miguel Gustavo Taranto; MEIRE, Alexandre; CABRAL, Maria Deolinda Borges; ALMEIDA, Rosimary Terezinha de. Effectiveness Analysis of String Comparators for Distinguishing True from False Matches in Record Linkage. In: BRAZILIAN SYMPOSIUM ON COMPUTING APPLIED TO HEALTH (SBCAS), 9. , 2009, Bento Gonçalves/RS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2009 . p. 2119-2128. ISSN 2763-8952.