Exploring a New Metric for Accurately Measuring Blocking Precision in Entity Resolution Tasks

Abstract


Entity resolution is a crucial task in data integration, aiming to identify records that refer to the same real-world entity. Blocking techniques are widely used to improve efficiency by reducing the number of record comparisons. However, traditional metrics, such as Pair Quality (PQ), fail to account for redundant comparisons, potentially distorting the assessment of blocking effectiveness. This paper introduces the PQ∗ metric, designed to provide a more accurate precision measure by eliminating the impact of redundant comparisons. We also propose the PQ∗C and P̂Q∗E algorithms, which efficiently calculate and estimate PQ∗, respectively. Experimental results show that, for all evaluated datasets, PQ and PQ∗ yield different results in every scenario where more than one blocking key is used. Furthermore, the difference between the two metrics increases as more blocking keys are employed for indexing.
Keywords: Entity Resolution, Blocking, Indexing, Blocking Precision

References

Araújo, T. B., Pires, C. E. S., Mestre, D. G., Nóbrega, T. P. d., Nascimento, D. C. d., and Stefanidis, K. (2019). A noise tolerant and schema-agnostic blocking technique for entity resolution. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, pages 422–430.

Christen, P. and Christen, P. (2012). The data matching process. Springer.

Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1–16.

Gagliardelli, L., Papadakis, G., Simonini, G., Bergamaschi, S., Palpanas, T., et al. (2022). Generalized supervised meta-blocking. Proceedings of the VLDB Endowment, 15(9):1902–1910.

Getoor, L. and Machanavajjhala, A. (2012). Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment, 5(12):2018–2019.

Hand, D. and Christen, P. (2018). A note on using the f-measure for evaluating record linkage algorithms. Statistics and Computing, 28:539–547.

Hassanzadeh, O., Chiang, F., Lee, H. C., and Miller, R. J. (2009). Framework for evaluating clustering algorithms in duplicate detection. Proceedings of the VLDB Endowment, 2(1):1282–1293.

Li, B.-H., Liu, Y., Zhang, A.-M., Wang, W.-H., and Wan, S. (2020). A survey on blocking technology of entity resolution. Journal of Computer Science and Technology, 35:769–793.

Li, H., Li, S., Hao, F., Zhang, C. J., Song, Y., and Chen, L. (2024). Booster: leveraging large language models for enhancing entity resolution. In Companion Proceedings of the ACM Web Conference 2024, pages 1043–1046.

Mestre, D. G., Pires, C. E. S., and Nascimento, D. C. (2017a). Towards the efficient parallelization of multi-pass adaptive blocking for entity matching. Journal of Parallel and Distributed Computing, 101:27–40.

Mestre, D. G., Pires, C. E. S., Nascimento, D. C., de Queiroz, A. R. M., Santos, V. B., and Araujo, T. B. (2017b). An efficient spark-based adaptive windowing for entity matching. Journal of Systems and Software, 128:1–10.

Nascimento, D. C., Pires, C. E., and Mestre, D. (2016). Data quality monitoring of cloud databases based on data quality slas. In Big-Data Analytics and Cloud Computing: Theory, Algorithms and Applications, pages 3–20. Springer.

Nascimento, D. C., Pires, C. E. S., and Mestre, D. G. (2020). Exploiting block co-occurrence to control block sizes for entity resolution. Knowledge and Information Systems, 62(1):359–400.

Papadakis, G., Koutrika, G., Palpanas, T., and Nejdl, W. (2013). Meta-blocking: Taking entity resolutionto the next level. IEEE Transactions on Knowledge and Data Engineering, 26(8):1946–1960.

Papadakis, G., Papastefanatos, G., and Koutrika, G. (2014). Supervised meta-blocking. Proceedings of the VLDB Endowment, 7(14):1929–1940.

Papadakis, G., Papastefanatos, G., Palpanas, T., and Koubarakis, M. (2016a). Scaling entity resolution to large, heterogeneous data with enhanced meta-blocking. In EDBT, pages 221–232.

Papadakis, G., Skoutas, D., Thanos, E., and Palpanas, T. (2020). Blocking and filtering techniques for entity resolution: A survey. ACM Computing Surveys (CSUR), 53(2):1–42.

Papadakis, G., Svirsky, J., Gal, A., and Palpanas, T. (2016b). Comparative analysis of approximate blocking techniques for entity resolution. Proceedings of the VLDB Endowment, 9(9):684–695.

Zeakis, A., Papadakis, G., Skoutas, D., and Koubarakis, M. (2023). Pre-trained embeddings for entity resolution: an experimental analysis. Proceedings of the VLDB Endowment, 16(9):2225–2238.
Published
2025-09-29
NASCIMENTO, Dimas Cassimiro; SILVA, Vítor Alan Bezerra. Exploring a New Metric for Accurately Measuring Blocking Precision in Entity Resolution Tasks. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 40. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 15-27. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2025.246994.