Identifying and Fusing Duplicate Features for Data Mining

Hortênsia Costa Barcelos; Mariana Recamonde Mendoza; Viviane Pereira Moreira

doi:10.5753/sbbd.2020.13631

Hortênsia Costa Barcelos Universidade Federal do Rio Grande do Sul
Mariana Recamonde Mendoza Universidade Federal do Rio Grande do Sul
Viviane Pereira Moreira Universidade Federal do Rio Grande do Sul

DOI: https://doi.org/10.5753/sbbd.2020.13631

Resumo

This work addresses the problem of identifying and fusing duplicate features in machine learning datasets. Our goal is to evaluate the hypothesis that fusing duplicate features can improve the predictive power of the data while reducing training time. We propose a simple method for duplicate detection and fusion based on a small set of features. An evaluation comparing the duplicate detection against a manually generated ground truth obtained F1 of 0.91. Then,the effects of fusion were measured on a mortality prediction test. The results were inferior to the ones obtained with the original dataset. Thus we concluded that the investigated hypothesis does not hold.

Palavras-chave: Featura Fusion, Deduplication, Data Mining

Referências

Han Bao, Tomoya Sakai, Issei Sato, and Masashi Sugiyama. Convex formulation of multipleinstance learning from positive and unlabeled bags.Neural Networks, 105:132 – 141, 2018.

Philip A Bernstein, Jayant Madhavan, and Erhard Rahm. Generic schema matching, ten yearslater.Proc. of the VLDB Endowment, 4(11):695–701, 2011.

Indrajit Bhattacharya and Lise Getoor. Iterative record linkage for cleaning and integration.InACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery,DMKD, page 11–18, 2004.

Alexander Bilke and Felix Naumann. Schema matching using duplicates. InInternational Con-ference on Data Engineering (ICDE), pages 69–80, 2005.

Leo Breiman. Random forests.Machine learning, 45(1):5–32, 2001.

Peter Christen. Febrl -: An open source data cleaning, deduplication and record linkage systemwith a graphical user interface. InInternational Conference on Knowledge Discovery and DataMining, page 1065–1068, 2008.

Hong-Hai Do and Erhard Rahm. Coma—a system for flexible combination of schema matchingapproaches. InInternational Conference on Very Large Databases, pages 610–621, 2002.

Avigdor Gal. Why is schema matching tough and what can we do about it?ACM Sigmod Record,35(4):2–5, 2006.

Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Wit-ten. The weka data mining software: an update.ACM SIGKDD explorations newsletter, 11(1):10–18, 2009.

Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, MohammadGhassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii,a freely accessible critical care database, 2016.

Shehroz S Khan and Michael G Madden. One-class classification: taxonomy of study and reviewof techniques.The Knowledge Engineering Review, 29(3):345–374, 2014.

Ron Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selec-tion.International Joint Conference on Artificial Intelligence, 1995.

Guangfeng Lin, Guoliang Fan, Xiaobing Kang, Erhu Zhang, and Liangjiang Yu. Heterogeneousfeature structure fusion for classification.Pattern Recognition, 2015.

Jayant Madhavan, Philip A Bernstein, and Erhard Rahm. Generic schema matching with cupid.InVLDB, volume 1, pages 49–58, 2001.

Dirk Meister, Jurgen Kaiser, Andre Brinkmann, Toni Cortes, Michael Kuhn, and Julian Kunkel.A study on data deduplication in hpc storage systems. InInternational Conference on HighPerformance Computing, Networking, Storage and Analysis, pages 1–11, 2012.

Tom M. Mitchell.Machine Learning. McGraw-Hill, mar 1997.

Claudio Perez, Juan Tapia, Pablo Est ́evez, and Claudio Held. Gender classification from face im-ages using mutual information and feature fusion.International Journal of Optomechatronics,2012.

Marthinus Christoffel Du Plessis, Gang Niu, and Masashi Sugiyama. Convex formulation forlearning from positive and unlabeled data. InInternational Conference on Machine Learning,volume 37, pages 1386–1394, jun 2015.

Fabien Scalzo, George Bebis, Mircea Nicolescu, Leandro Loss, and Alireza Tavakkoli. Featurefusion hierarchies for gender classification.International Conference on Pattern Recognition,2008.

Bernhard Schölkopf, Robert C Williamson, Alex J Smola, John Shawe-Taylor, and John C Platt.Support vector method for novelty detection. InAdvances in neural information processingsystems, pages 582–588, 2000.

Mark W. Storer, Kevin Greenan, Darrell D.E. Long, and Ethan L. Miller. Secure data deduplica-tion. InInternational Workshop on Storage Security and Survivability, page 1–10, New York,NY, USA, 2008.

Quan-Sen Sun, Sheng-Gen Zeng, Yan Liu, Pheng-Ann Heng, and De-Shen Xia. A new method offeature fusion and its application in image recognition.Pattern Recognition, 2004.

Edhy Sutanta, Retantyo Wardoyo, Khabib Mustofa, and Edi Winarko. Survey: Models and pro-totypes of schema matching.International Journal of Electrical and Computer Engineering,2016.