Identification and Characterization of Duplicate Complaints by Consumers on Multiple Platforms
Abstract
The growing volume of data in complaints repositories of consumers poses significant challenges for the effective management of this information. Among these challenges, the fact that many complaints are registered more than once, by the same consumer, to put pressure on companies stands out, which can impact the management of these records and distort analyses based on this data. This study proposes an approach to identify duplicates using temporal analysis and attributes such as consumer, supplier, and object of the complaint from consumers records on different platforms. In this sense, natural language processing techniques are explored, specifically the BERTimbau model, to detect semantic similarities between complaints. The results show that 95% of duplicates are posted within 30 days of the original. The proposed approach contributes to improving the accuracy in detecting duplicates and the efficiency in managing this type of (unstructured) data, benefiting conflict resolution and complaints administration by competent entities.
Keywords:
Complaints, Duplicates, Consumers, Consumidorgov, Procon, Sindec
References
Almeida, T. N. V. d. and Ramos, A. S. M. (2012). Os impactos das reclamações on-line na lealdade dos consumidores: um estudo experimental. Revista de Adm. Contemporânea, 16:664–683.
Barz, B. and Denzler, J. (2020). Do We Train on Test Data? Purging CIFAR of Near-Duplicates. Journal of Imaging, 6(6):41.
Belém, F. M., de Andrade, C. M. V., França, C., Carvalho, M., Ganem, M. A. S., Teixeira, G., Jallais, G., Laender, A. H. F., and Gonçalves, M. A. (2023). Contextual reinforcement, entity delimitation and generative data augmentation for entity recognition and relation extraction in official documents. J. Inf. Data Manag., 14(1).
Belém, F. M., Ganem, M. A. S., França, C., Carvalho, M., Laender, A. H. F., and Gonçalves, M. A. (2022). Reforço e delimitação contextual para reconhecimento de entidades e relações em documentos oficiais. In Anais do Simp. Bras. de Banco de Dados (SBBD), pages 292–303.
Carvalho, M., Mangaravite, V., Ponce, L. M., Cantelli, L., Campoi, B., Nunes, G., de Paiva, B. B. M., Laender, A. H. F., and Gonçalves, M. A. (2022). Deduplicating large volumes of data from natural and legal entities in the governmental field. In IEEE International Conference on Big Data, 2022, pages 2206–2213.
Costa, P. B., Pavan, M. C., Santos, W. R., Silva, S. C., and Paraboni, I. (2023). Bertabaporu: assessing a genre-specific language model for portuguese nlp. In Proc. of the Int. Conf. on Recent Advances in Natural Language Processing (RANLP), pages 217–223.
de Andrade, C. M. V., Belém, F., Cunha, W., França, C., Viegas, F., Rocha, L., and Gonçalves, M. A. (2023a). On the class separability of contextual embeddings representations - or "the classifier does not matter when the (text) representation is so good!". Inf. Process. Manag., 60(4):103336.
de Andrade, C. M. V., França, C., Belém, F., Jallais, G., Ganem, M. A. S., Texeira, G., Laender, A. H. F., and Gonçalves, M. A. (2023b). PromptNER: Uma Abordagem para Reconhecimento de Entidades Nomeadas em Dados Sensíveis a Partir de Instâncias Rotuladas Automaticamente. In Anais do Simp. Bras. de Banco de Dados (SBBD), pages 269–281.
de Carvalho, A. P., Ferreira, A. A., Laender, A. H. F., and Gonçalves, M. A. (2011). Incremental unsupervised name disambiguation in cleaned digital libraries. J. Inf. Data Manag., 2(3):289–304.
de Carvalho, M. G., Gonçalves, M. A., Laender, A. H. F., and da Silva, A. S. (2006). Learning to deduplicate. In ACM/IEEE Joint Conference on Digital Libraries (JCDL), pages 41–50.
de Carvalho, M. G., Laender, A. H. F., Gonçalves, M. A., and da Silva, A. S. (2008). Replica identification using genetic programming. In Proc. of the ACM Symposium on Applied Computing (SAC), pages 1801–1806.
de Oliveira, D. F., de Moura, E. S., Ribeiro-Neto, B. A., da Silva, A. S., and Gonçalves, M. A. (2007). Computing block importance for searching on web sites. In Proc. ACM Conference on Information and Knowledge Management (CIKM), pages 165–174.
Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S. (2007). Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1–16.
Fleiss, J. et al. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5):378–382.
Freitas, M. d. S. and Andreão, R. V. (2021). Automatização do Processamento do Texto Bruto Oriundo de um Serviço de Atendimento de Reclamações. In Anais da Escola Regional de Informática do Rio de Janeiro (ERI-RJ), pages 72–79.
Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., and Zhao, L. (2019). Latent dirichlet allocation (lda) and topic modeling: models, applications, a survey. Multimedia tools and applications, 78:15169–15211.
Le, Q. and Mikolov, T. (2014). Distributed representations of sentences and documents. In International Conference on Machine Learning, pages 1188–1196.
Loshin, D. (2010). Master data management. Morgan Kaufmann. Mangaravite, V., Carvalho, M., Cantelli, L., Ponce, L. M., Campoi, B., Nunes, G., Laender, A. H. F., and Goncalves, M. A. (2022). DedupeGov: Um Ambiente para Deduplicação de Grandes Volumes de Dados de Pessoas Físicas e Jurídicas em Âmbito Governamental. In Anais do Simp. Bras. de Banco de Dados (SBBD), pages 90–102.
Mansoor, M., Rehman, Z. U., Shaheen, M., Khan, M. A., and Habib, M. (2020). Deep Learning based Semantic Similarity Detection using Text Data. Information Technology And Control, 49(4):495–510.
Miller, F. P., Vandome, A. F., and McBrewster, J. (2009). Levenshtein Distance: Information theory, Computer science, String (computer science), String metric, Damerau?Levenshtein distance, Spell checker, Hamming distance. Alpha Press.
Mourão, F., Rocha, L., Araújo, R. B., Couto, T., Gonçalves, M. A., and Jr., W. M. (2008). Understanding temporal aspects in document classification. In Proc. of the Int. Conf. on Web Search and Web Data Mining (WSDM), pages 159–170.
Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv.
Ripon, K. S. N., Rahman, A., and Rahaman, G. A. (2010). A Domain-Independent Data Cleaning Algorithm for Detecting Similar-Duplicates. Journal of Computers, 5(12):1800–1809.
Sargiani, V., de Castro, L. N., and Silva, L. A. (2020). A data mininf study of sindec complaints in the period 2013-2017. In Proc. of the Int. Conf. on Internet Techn. & Society (ITS) and Sustainability, Techn. and Education (STE), pages 35–45.
Sienčnik, S. K. (2015). Adapting word2vec to named entity recognition. In Proc.of the Nordic Conference of Computational Linguistics (NODALIDA), pages 239–243.
Silva, L. S., Canalle, G. K., Salgado, A. C., Lóscio, B. F., and Moro, M. M. (2019). Uma Análise Experimental do Impacto da Seleção de Atributos em Processos de Resolução de Entidades. In Anais do Simp. Bras. de Banco de Dados (SBBD), pages 37–48.
Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: pretrained BERT models for Brazilian Portuguese. In Braz. Conf. on Intelligent Systems (BRACIS), pages 403–417.
Wang, Y., Qin, J., and Wang, W. (2017). Efficient approximate entity matching using jaro-winkler distance. In Web Inf. Systems Engineering (WISE), pages 231–239.
Barz, B. and Denzler, J. (2020). Do We Train on Test Data? Purging CIFAR of Near-Duplicates. Journal of Imaging, 6(6):41.
Belém, F. M., de Andrade, C. M. V., França, C., Carvalho, M., Ganem, M. A. S., Teixeira, G., Jallais, G., Laender, A. H. F., and Gonçalves, M. A. (2023). Contextual reinforcement, entity delimitation and generative data augmentation for entity recognition and relation extraction in official documents. J. Inf. Data Manag., 14(1).
Belém, F. M., Ganem, M. A. S., França, C., Carvalho, M., Laender, A. H. F., and Gonçalves, M. A. (2022). Reforço e delimitação contextual para reconhecimento de entidades e relações em documentos oficiais. In Anais do Simp. Bras. de Banco de Dados (SBBD), pages 292–303.
Carvalho, M., Mangaravite, V., Ponce, L. M., Cantelli, L., Campoi, B., Nunes, G., de Paiva, B. B. M., Laender, A. H. F., and Gonçalves, M. A. (2022). Deduplicating large volumes of data from natural and legal entities in the governmental field. In IEEE International Conference on Big Data, 2022, pages 2206–2213.
Costa, P. B., Pavan, M. C., Santos, W. R., Silva, S. C., and Paraboni, I. (2023). Bertabaporu: assessing a genre-specific language model for portuguese nlp. In Proc. of the Int. Conf. on Recent Advances in Natural Language Processing (RANLP), pages 217–223.
de Andrade, C. M. V., Belém, F., Cunha, W., França, C., Viegas, F., Rocha, L., and Gonçalves, M. A. (2023a). On the class separability of contextual embeddings representations - or "the classifier does not matter when the (text) representation is so good!". Inf. Process. Manag., 60(4):103336.
de Andrade, C. M. V., França, C., Belém, F., Jallais, G., Ganem, M. A. S., Texeira, G., Laender, A. H. F., and Gonçalves, M. A. (2023b). PromptNER: Uma Abordagem para Reconhecimento de Entidades Nomeadas em Dados Sensíveis a Partir de Instâncias Rotuladas Automaticamente. In Anais do Simp. Bras. de Banco de Dados (SBBD), pages 269–281.
de Carvalho, A. P., Ferreira, A. A., Laender, A. H. F., and Gonçalves, M. A. (2011). Incremental unsupervised name disambiguation in cleaned digital libraries. J. Inf. Data Manag., 2(3):289–304.
de Carvalho, M. G., Gonçalves, M. A., Laender, A. H. F., and da Silva, A. S. (2006). Learning to deduplicate. In ACM/IEEE Joint Conference on Digital Libraries (JCDL), pages 41–50.
de Carvalho, M. G., Laender, A. H. F., Gonçalves, M. A., and da Silva, A. S. (2008). Replica identification using genetic programming. In Proc. of the ACM Symposium on Applied Computing (SAC), pages 1801–1806.
de Oliveira, D. F., de Moura, E. S., Ribeiro-Neto, B. A., da Silva, A. S., and Gonçalves, M. A. (2007). Computing block importance for searching on web sites. In Proc. ACM Conference on Information and Knowledge Management (CIKM), pages 165–174.
Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S. (2007). Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1–16.
Fleiss, J. et al. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5):378–382.
Freitas, M. d. S. and Andreão, R. V. (2021). Automatização do Processamento do Texto Bruto Oriundo de um Serviço de Atendimento de Reclamações. In Anais da Escola Regional de Informática do Rio de Janeiro (ERI-RJ), pages 72–79.
Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., and Zhao, L. (2019). Latent dirichlet allocation (lda) and topic modeling: models, applications, a survey. Multimedia tools and applications, 78:15169–15211.
Le, Q. and Mikolov, T. (2014). Distributed representations of sentences and documents. In International Conference on Machine Learning, pages 1188–1196.
Loshin, D. (2010). Master data management. Morgan Kaufmann. Mangaravite, V., Carvalho, M., Cantelli, L., Ponce, L. M., Campoi, B., Nunes, G., Laender, A. H. F., and Goncalves, M. A. (2022). DedupeGov: Um Ambiente para Deduplicação de Grandes Volumes de Dados de Pessoas Físicas e Jurídicas em Âmbito Governamental. In Anais do Simp. Bras. de Banco de Dados (SBBD), pages 90–102.
Mansoor, M., Rehman, Z. U., Shaheen, M., Khan, M. A., and Habib, M. (2020). Deep Learning based Semantic Similarity Detection using Text Data. Information Technology And Control, 49(4):495–510.
Miller, F. P., Vandome, A. F., and McBrewster, J. (2009). Levenshtein Distance: Information theory, Computer science, String (computer science), String metric, Damerau?Levenshtein distance, Spell checker, Hamming distance. Alpha Press.
Mourão, F., Rocha, L., Araújo, R. B., Couto, T., Gonçalves, M. A., and Jr., W. M. (2008). Understanding temporal aspects in document classification. In Proc. of the Int. Conf. on Web Search and Web Data Mining (WSDM), pages 159–170.
Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv.
Ripon, K. S. N., Rahman, A., and Rahaman, G. A. (2010). A Domain-Independent Data Cleaning Algorithm for Detecting Similar-Duplicates. Journal of Computers, 5(12):1800–1809.
Sargiani, V., de Castro, L. N., and Silva, L. A. (2020). A data mininf study of sindec complaints in the period 2013-2017. In Proc. of the Int. Conf. on Internet Techn. & Society (ITS) and Sustainability, Techn. and Education (STE), pages 35–45.
Sienčnik, S. K. (2015). Adapting word2vec to named entity recognition. In Proc.of the Nordic Conference of Computational Linguistics (NODALIDA), pages 239–243.
Silva, L. S., Canalle, G. K., Salgado, A. C., Lóscio, B. F., and Moro, M. M. (2019). Uma Análise Experimental do Impacto da Seleção de Atributos em Processos de Resolução de Entidades. In Anais do Simp. Bras. de Banco de Dados (SBBD), pages 37–48.
Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: pretrained BERT models for Brazilian Portuguese. In Braz. Conf. on Intelligent Systems (BRACIS), pages 403–417.
Wang, Y., Qin, J., and Wang, W. (2017). Efficient approximate entity matching using jaro-winkler distance. In Web Inf. Systems Engineering (WISE), pages 231–239.
Published
2024-10-14
How to Cite
RABBI, Gestefane; ARAÚJO, Marcelo M. R.; KAKIZAKI, Gabriel; VITERBO, Julia; C. S. REIS, Julio; O. PRATES, Raquel; GONÇALVES, Marcos André.
Identification and Characterization of Duplicate Complaints by Consumers on Multiple Platforms. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 39. , 2024, Florianópolis/SC.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2024
.
p. 313-326.
ISSN 2763-8979.
DOI: https://doi.org/10.5753/sbbd.2024.240210.
