Effectiveness Analysis of String Comparators for Distinguishing True from False Matches in Record Linkage
Abstract
The objective of this paper is to analyze the effectiveness of eight string comparators used in record linkage processes. A set of true pairs of names was identified in the databases Hospitalisation Claims and Prepaid Health Plan Users in Brazil. From this set, a set of false pairs was generated and then several string comparators were used in each pair of names. For each comparator, a ROC curve was plotted, its area and the mean time to perform the comparison were calculated. The algorithms presented overall similar performance, but more conclusive results will require a bigger and more representative sample of brazilian names.References
Apache Commons Project (2008), “Implementations of common encoders and decoders”. [link]. Último acesso em 10/03/2009.
Bell, G.B. and Sethi, A. (2001), “Matching Records in a National Medical Patient Index”, Communications of the ACM, 44(9):83–88.
Camargo, K.R., Coeli, C.M. (2000), “Reclink: aplicativo para o relacionamento de bases de dados, implementando o método probabilistic record linkage”, Cad. Saúde Pública, 16(2):439-447.
Camargo, K.R., Coeli, C.M. (2007), “Reclink III: Relacionamento Probabilístico de Registros, Versão 3.1.6.3160. Manual, Rio de Janeiro.
CCS-SIS (1998), “Tfonetizar. Consórcio de Componentes de Software para Sistemas de Informação em Saúde”. [link]. Último acesso em 10/03/2009.
Cohen, W.W., Ravikumar, P. and Fienberg, S.E. (2003), “A Comparison of String Distance Metrics for Name-Matching Tasks”. [link]. Último acesso em 11/01/2009.
Elfeky, M.G., Verykios, V.S. and Elmagarmid, A.K. (2002), “TAILOR: A Record Linkage Toolbox”. In Proc. of the 18th Int. Conf. on Data Engineering. IEEE, 2002.
Fellegi, L.and Sunter, A (1969), “A Theory for Record Linkage”. Journal of the American Statistical Society, 64:1183–1210.
Jaro, M. A. (1978), "UNIMATCH: A Record Linkage System, User's Manual," Washington, DC: U.S. Bureau of the Census.
Jaro, M. A. 1995. Probabilistic linkage of large public health data files (disc: P687-689). Statistics in Medicine 14:491–498.
Kelman, C.W., Bass, A.J. and Holman, C.D. (2002), “Research use of linked health data - a best practice protocol”. Aust N Z J Public Health, 26:251–255.
McCallum, A., Nigam, K. and Ungar, L. (2000), “Learning to Match and Cluster Large High-Dimensional Data Sets for Data Integation”. In Proc. of the Sixth Int. Conf. on KDD, p. 169–170.
Monge, A. and Elkan, C. (1996). The field-matching problem: algorithm and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining.
NLP Group (2009), “SimMetrics, an open source extensible library of Similarity or Distance Metrics”. [link]. Último acesso em 10/03/2009.
Stephen, G. A. (1994), String Searching Algorithms. World Scientific Publishing Co. Pte. Ltd.
Winkler, W. E. (1994), “Advanced Methods for Record Linkage”, US Census Bureau Report RR94/05. [link]. Último acesso em 10/03/2009.
Winkler, W.E., McLaughlin, G., Jaro, M. And Lynch, M. (1994), “strcmp95.c Version 2”. [link]. Último acesso em 10/01/2009.
Yancey. W. (2005), “Evaluating String Comparator Performance for Record Linkage”, Report RRS2005/05, US Bureau of the Census.: [link]. Último acesso em 10/01/2009.
Bell, G.B. and Sethi, A. (2001), “Matching Records in a National Medical Patient Index”, Communications of the ACM, 44(9):83–88.
Camargo, K.R., Coeli, C.M. (2000), “Reclink: aplicativo para o relacionamento de bases de dados, implementando o método probabilistic record linkage”, Cad. Saúde Pública, 16(2):439-447.
Camargo, K.R., Coeli, C.M. (2007), “Reclink III: Relacionamento Probabilístico de Registros, Versão 3.1.6.3160. Manual, Rio de Janeiro.
CCS-SIS (1998), “Tfonetizar. Consórcio de Componentes de Software para Sistemas de Informação em Saúde”. [link]. Último acesso em 10/03/2009.
Cohen, W.W., Ravikumar, P. and Fienberg, S.E. (2003), “A Comparison of String Distance Metrics for Name-Matching Tasks”. [link]. Último acesso em 11/01/2009.
Elfeky, M.G., Verykios, V.S. and Elmagarmid, A.K. (2002), “TAILOR: A Record Linkage Toolbox”. In Proc. of the 18th Int. Conf. on Data Engineering. IEEE, 2002.
Fellegi, L.and Sunter, A (1969), “A Theory for Record Linkage”. Journal of the American Statistical Society, 64:1183–1210.
Jaro, M. A. (1978), "UNIMATCH: A Record Linkage System, User's Manual," Washington, DC: U.S. Bureau of the Census.
Jaro, M. A. 1995. Probabilistic linkage of large public health data files (disc: P687-689). Statistics in Medicine 14:491–498.
Kelman, C.W., Bass, A.J. and Holman, C.D. (2002), “Research use of linked health data - a best practice protocol”. Aust N Z J Public Health, 26:251–255.
McCallum, A., Nigam, K. and Ungar, L. (2000), “Learning to Match and Cluster Large High-Dimensional Data Sets for Data Integation”. In Proc. of the Sixth Int. Conf. on KDD, p. 169–170.
Monge, A. and Elkan, C. (1996). The field-matching problem: algorithm and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining.
NLP Group (2009), “SimMetrics, an open source extensible library of Similarity or Distance Metrics”. [link]. Último acesso em 10/03/2009.
Stephen, G. A. (1994), String Searching Algorithms. World Scientific Publishing Co. Pte. Ltd.
Winkler, W. E. (1994), “Advanced Methods for Record Linkage”, US Census Bureau Report RR94/05. [link]. Último acesso em 10/03/2009.
Winkler, W.E., McLaughlin, G., Jaro, M. And Lynch, M. (1994), “strcmp95.c Version 2”. [link]. Último acesso em 10/01/2009.
Yancey. W. (2005), “Evaluating String Comparator Performance for Record Linkage”, Report RRS2005/05, US Bureau of the Census.: [link]. Último acesso em 10/01/2009.
Published
2009-07-20
How to Cite
FREIRE, Sergio Miranda; GONÇALVES, Rita de Cássia Braga; BANDARRA, André Cipriani; VILLELA, Miguel Gustavo Taranto; MEIRE, Alexandre; CABRAL, Maria Deolinda Borges; ALMEIDA, Rosimary Terezinha de.
Effectiveness Analysis of String Comparators for Distinguishing True from False Matches in Record Linkage. In: BRAZILIAN SYMPOSIUM ON COMPUTING APPLIED TO HEALTH (SBCAS), 9. , 2009, Bento Gonçalves/RS.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2009
.
p. 2119-2128.
ISSN 2763-8952.
