Advanced Chunking Techniques: a Novel Approach for Semantic Splitters
Resumo
Chunking, the process of splitting large amounts of text into processable parts, is an essential but often overlooked step for multiple Information Retrieval and Vector Databases tasks. Traditional chunking techniques rely on fixed-length or syntactic structures, creating opportunities for more meaningful approaches. Semantic chunking is the process of dividing text based on meaning and context, ensuring each chunk represents a logical unit of information. This work proposes the Dual Semantic Chunker, which represents an advancement over existing chunking methods by taking a closer look at semantic representation. We compared multiple chunking methods, including both semantic and traditional techniques, and achieved improved retrieval.
Referências
Chen, Zhiyu, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang (2022). “ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering”. In: Proceedings of the EMNLP. Association for Computational Linguistics, pp. 6279–6292.
DeepSeek-AI et al. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv: 2501.12948 [cs.CL]. URL: [link].
Ding, Ning, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou (2023). “Enhancing Chat Language Models by Scaling High-quality Instructional Conversations”. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 3029–3051.
Duarte, André, João Marques, Miguel Graça, Miguel Freire, Lei Li, and Arlindo Oliveira (2024). “LumberChunker: Long-Form Narrative Document Segmentation”. In: Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 6473–6486.
Gharehchopogh, Farhad Soleimanian and Zeinab Abbasi Khalifelu (2011). “Analysis and evaluation of unstructured data: text mining versus natural language processing”. In: 2011 5th International Conference on Application of Information and Communication Technologies (AICT). IEEE, pp. 1–4.
Hilbert, Martin and Priscila López (2011). “The world’s technological capacity to store, communicate, and compute information”. In: science 332.6025, pp. 60–65.
Izacard, Gautier and Édouard Grave (2021). “Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering”. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 874–880.
Kamradt, Greg (2024). Semantic Chunking. [link].
Karpukhin, Vladimir, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen Tau Yih (2020). “Dense passage retrieval for open-domain question answering”. In: 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020. Association for Computational Linguistics (ACL), pp. 6769–6781.
Koshorek, Omri, Adir Cohen, Noam Mor, Michael Rotman, and Jonathan Berant (2018). “Text Segmentation as a Supervised Learning Task”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 469–473.
Lee, Jinhyuk, Alexander Wettig, and Danqi Chen (2021). “Phrase Retrieval Learns Passage Retrieval, Too”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3661–3672.
Lewis, Patrick, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. (2020). “Retrieval-augmented generation for knowledge-intensive NLP tasks”. In: Advances in Neural Information Processing Systems 33, pp. 9459–9474.
Merity, Stephen, Caiming Xiong, James Bradbury, and Richard Socher (2022). “Pointer Sentinel Mixture Models”. In: International Conference on Learning Representations.
National Library of Medicine (2023). PMC Open Access Subset. Dataset retrieved from Hugging Face Datasets (PubMed Central Open Access dataset, Version 2023-06-17). Available: [link] access, cited 2024-05-08.
Porzel, Robert and Iryna Gurevych (2003). “Contextual coherence in natural language processing”. In: International and Interdisciplinary Conference on Modeling and Using Context. Springer, pp. 272–285.
Reimers, Nils and Iryna Gurevych (Nov. 2019). “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Ed. by Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan. Hong Kong, China: Association for Computational Linguistics, pp. 3982–3992. DOI: 10.18653/v1/D19-1410. URL: [link].
Smith, Brandon and Anton Troynikov (July 2024). Evaluating Chunking Strategies for Retrieval. Tech. rep. [link]. Chroma.
The White House (2024). State of the Union 2024. Accessed: 2024-05-02. Vaswani, A (2017). “Attention is all you need”. In: Advances in Neural Information Processing Systems. Yang, Wei, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, and Jimmy Lin (2019). “End-to-End Open-Domain Question Answering with BERT-serini”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp. 72–77.
