Privacy Preservation in Textual Data: A Systematic Mapping Study on Differential Privacy and Semantic Similarity
Resumo
Background: AI and machine learning increasingly depend on large-scale textual data, which often embeds personally identifiable and sensitive information. In this context, privacy-preserving processing of unstructured text has become essential to mitigate disclosure and re-identification risks, especially in pipelines that rely on semantic representations and similarity measures. Goal: This study aims to map state-of-the-art techniques for privacy-preserving textual data analysis and to characterize how privacy mechanisms relate to semantic similarity and the identification of infrequent (rare) textual patterns. Method: We conducted a Systematic Mapping Study (SMS) following established guidelines, retrieving peer-reviewed publications from four major digital libraries (2010–2025). The selected studies were screened using inclusion/exclusion criteria and a quality checklist, and the extracted data were synthesized through frequency-based and thematic mapping aligned with three research questions: (i) privacy-preserving techniques used in textual data analysis, (ii) computational approaches (data science and language-model-based methods) supporting such mechanisms, and (iii) techniques adopted for semantic similarity and rare-event oriented text analysis under privacy constraints. Results: The mapping shows that anonymization/de-identification, differential privacy, and federated learning are among the most recurrent privacy-preserving approaches reported for text. It also highlights the prevalence of NLP pipelines and transformer-based models (e.g., BERT variants and large language models) as supporting components, typically combined with classic semantic similarity techniques such as vector-space representations, embeddings, topic modeling, and cosine similarity. Rare-event detection appears less frequently, suggesting an emerging gap and opportunities for future research on privacy-aware pipelines for low-frequency phenomena in text. The results provide an evidence-based overview to support the design of guidelines and best practices for privacy-preserving text analytics in software intensive settings.
Referências
Acharya, D. B., Kuppan, K., and Divya, B. (2025). Agentic AI: Autonomous Intelligence for Complex GoalsA Comprehensive Survey. IEEE Access, 13:18912–18936.
Aghasian, E., Garg, S., and Montgomery, J. (2020). An automated model to score the privacy of unstructured informationSocial media case. Computers & Security, 92:101778.
Anthropic (2024). Model context protocol. [link]. Accessed: April 9, 2025.
Asif, H., Min, S., Wang, X., and Vaidya, J. (2024). U.S.-U.K. PETs Prize Challenge: Anomaly Detection via Privacy-Enhanced Federated Learning. IEEE Transactions on Privacy, 1:3–18. Conference Name: IEEE Transactions on Privacy.
Asimopoulos, D., Siniosoglou, I., Argyriou, V., Karamitsou, T., Fountoukidis, E., Goudos, S. K., Moscholios, I. D., Psannis, K. E., and Sarigiannidis, P. (2024). Benchmarking Advanced Text Anonymisation Methods: A Comparative Study on Novel and Traditional Approaches. In 2024 13th International Conference on Modern Circuits and Systems Technologies (MOCAST), pages 1–6. ISSN: 2993-4443.
Barbara, K. and Charters, S. (2007). Guidelines for performing systematic literature reviews in software engineering. Keele University, UK, 9:1–65.
Brereton, P., Kitchenham, B. A., Budgen, D., Turner, M., and Khalil, M. (2007). Lessons from applying the systematic literature review process within the software engineering domain. Journal of Systems and Software, 80(4):571–583.
Cui, J., Shen, H., and Cao, Y. (2024). Survey on the Applications of Differential Privacy. In 2024 6th International Conference on Frontier Technologies of Information and Computer (ICFTIC), pages 43–47.
Duan, Z. and Wang, J. (2024). Exploration of LLM Multi-Agent Application Implementation Based on LangGraph+CrewAI. arXiv:2411.18241 [cs].
Dwork, C. (2006). Differential Privacy. In Automata, Languages and Programming, pages 1–12, Berlin, Heidelberg. Springer.
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3):37–37.
Federative Republic of Brazil (2025). Lei Geral de Protecao de Dados Pessoais (LGPD) - Lei nž 13.709, de 14 de agosto de 2018. [link]. Accessed: 2025-03-27.
Giampaolo, F., Izzo, S., Prezioso, E., Chiaro, D., Cuomo, S., Bellandi, V., and Piccialli, F. (2023). A Privacy Preserving Service-Oriented Approach for Data Anonymization Through Deep Learning. In 2023 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), pages 0738–0746. ISSN: 2837-0740.
Gupta, B. B., Gaurav, A., Arya, V., Alhalabi, W., Alsalman, D., and Vijayakumar, P. (2024). Enhancing user prompt confidentiality in Large Language Models through advanced differential encryption. Computers and Electrical Engineering, 116:109215.
Hassan, F., Sánchez, D., Soria-Comas, J., and Domingo-Ferrer, J. (2019). Automatic Anonymization of Textual Documents: Detecting Sensitive Information via Word Embeddings. In 2019 18th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/13th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE), pages 358–365. ISSN: 2324-9013.
Khan, Y., Sánchez, D., and Domingo-Ferrer, J. (2024). Federated learning-based natural language processing: a systematic literature review. Artificial Intelligence Review, 57(12):320.
Khoei, T. T., Ehtesham, A., Kumar, S., and Khoei, T. T. (2025). A Survey of the Model Context Protocol (MCP): Standardizing Context to Enhance Large Language Models (LLMs).
Kluge Corrêa, N. (2024). Dynamic Normativity. Thesis, Universitätsund Landesbibliothek Bonn. Accepted: 2024-06-11T12:54:16Z.
Languré, A. d. L. and Zareei, M. (2025). Privacy-Preserving Emotion Detection: Evaluating the Trade-Off Between K-Anonymity and Model Performance. IEEE Access, 13:105901–105910.
Lee, S., Kim, Y., Kwon, Y., and Cho, S. (2025). Secure privacy-preserving record linkage system from re-identification//attack. PLOS ONE, 20(1):e0314486. Publisher: Public Library of Science.
Lim, W. Y. B., Luong, N. C., Hoang, D. T., Jiao, Y., Liang, Y.-C., Yang, Q., Niyato, D., and Miao, C. (2020). Federated Learning in Mobile Edge Networks: A Comprehensive Survey. IEEE Communications Surveys & Tutorials, 22(3):2031–2063. Conference Name: IEEE Communications Surveys & Tutorials.
Lison, P., Pilán, I., Sanchez, D., Batet, M., and Øvrelid, L. (2021). Anonymisation Models for Text Data: State of the art, Challenges and Future Directions. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4188–4203, Online. Association for Computational Linguistics.
Liu, Y., Yu, J. J. Q., Kang, J., Niyato, D., and Zhang, S. (2020). Privacy-Preserving Traffic Flow Prediction: A Federated Learning Approach. IEEE Internet of Things Journal, 7(8):7751–7763.
Mendes, R. and Vilela, J. P. (2017). Privacy-Preserving Data Mining: Methods, Metrics, and Applications. IEEE Access, 5:10562–10582. Conference Name: IEEE Access.
Merrouni, Z. A., Frikh, B., and Ouhbi, B. (2016). Automatic keyphrase extraction: An overview of the state of the art. In 2016 4th IEEE International Colloquium on Information Science and Technology (CiSt), pages 306–313. ISSN: 2327-1884.
Murin, M., Molan, S., Michalkc, M., Kainz, O., and Cymbalák, D. (2024). Technical Solutions for the Processing, Management and Anonymisation of Personal Data in Databases According to EU Data Protection Regulations. In 2024 International Conference on Emerging eLearning Technologies and Applications (ICETA), pages 490–503.
Parsifal Developers (2025). Parsifal: A platform for formal modeling and verification of privacy-preserving systems. [link]. Accessed: 2025-09-21.
Petticrew, M. and Roberts, H. (2008). Systematic reviews in the social sciences: A practical guide. John Wiley & Sons.
Shyalika, C., Wickramarachchi, R., and Sheth, A. (2024). A Comprehensive Survey on Rare Event Prediction. arXiv:2309.11356 [cs].
Souza, F. C., Nogueira, R. F., and Lotufo, R. A. (2023). BERT models for Brazilian Portuguese: Pretraining, evaluation and tokenization analysis. Applied Soft Computing, 149:110901.
SWEENEY, L. (2012). k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. Publisher: World Scientific Publishing Company.
Tan, A. Z., Yu, H., Cui, L., and Yang, Q. (2023). Towards Personalized Federated Learning. IEEE Transactions on Neural Networks and Learning Systems, 34(12):9587–9603. Conference Name: IEEE Transactions on Neural Networks and Learning Systems.
Union, E. (2025). General data protection regulation - gdpr. [link]. Accessed: 2025-03-27.
Volodina, E., Dobnik, S., Tiedemann, T. L. m., and Vu, X.-S. (2023). Grandma Karl is 27 years old research agenda for pseudonymization of research data. In 2023 IEEE Ninth International Conference on Big Data Computing Service and Applications (Big-DataService), pages 229–233.
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., Liu, P., Nie, J.-Y., and Wen, J.-R. (2025). A Survey of Large Language Models. arXiv:2303.18223 [cs].
Zhao, Y. and Chen, J. (2022). A survey on differential privacy for unstructured data content. ACM Comput. Surv., 54(10s):207:1–207:28.
Zhou, S., Ligett, K., and Wasserman, L. (2009). Differential privacy with compression. In 2009 IEEE International Symposium on Information Theory, pages 2718–2722. ISSN: 2157-8117.
