Efficient Compression-Based Low-resource Text Classification

Bruno Vargas de Souza; Pedro Garcia Freitas

doi:10.5753/eniac.2025.13966

Bruno Vargas de Souza UnB
Pedro Garcia Freitas UnB

DOI: https://doi.org/10.5753/eniac.2025.13966

Resumo

In recent years, the machine learning community has engaged on training increasingly powerful large language models (LLMs) for text classification tasks. These LLMs demand huge computational capabilities, requiring billions of parameters and enormous amounts of labeled data, which can make them expensive to use, optimize, and deploy in practice. However, recent studies on compressor-based distance metrics for text classification have been proposed, aiming to achieve competitive accuracy when compared with these LLM models. Compression-based models are usually based on k-nearest neighbors (KNN) and demand very few training parameters. In this paper, we propose a compression-based method that uses a Burkhard-Keller tree (BKTree) for similarity search, accelerating the KNN on compressed data. When compared with the state-of-the-art, the proposed method achieved speedups of 20x, 25x, 7x, 6.6x, 8x, 10x, 1.4x, 1.9x, 11x, and 12x for the Brotli, FSST, LZ4, LZAV, LZF, QuickLZ, Shoco, Smaz, Snappy, and ZLib compression algorithms, respectively. The proposed method is able to achieve similar prediction performance results while reducing the average runtime and computational resources, demonstrating its advantage in saving computational resources.

Referências

Alakuijala, J., Farruggia, A., Ferragina, P., Kliuchnikov, E., Obryk, R., Szabadka, Z., and Vandevenne, L. (2018). Brotli: A general-purpose data compressor. ACM Transactions on Information Systems, 37:1–30.

Avaneev, D. (2023). LZAV: Fast in-memory lz77-based data compressor (header-only c/c++). GitHub repository. Available at [link], with performance around 480 MB/s compression and 2800 MB/s decompression :contentReference[oaicite:1]index=1.

Boncz, P., Neumann, T., and Leis, V. (2019). FSST: Fast static symbol table string compression. CWI / research publication and GitHub. Lightweight random-access string compression; [link].

Burkhard, W. A. and Keller, R. M. (1973). Some approaches to best-match file searching. Communications of the ACM, 16(4):230–236.

Collet, Y. (2016). Zstandard (zstd): A fast lossless compression algorithm. Reference C implementation and specification. First released August 31, 2016; format standardized in IETF RFC 8878 (February 2021); official site: [link].

Cruz, J. C. B., Tan, J. A., and Cheng, C. (2020). Localization of fake news detection via multitask transfer learning. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 2596–2604.

Damerau, F. J. (1964). A technique for computer detection and correction of spelling errors. Communications of the ACM, 7(3):171–176.

Dean, J., Ghemawat, S., and Gunderson, S. H. (2011). Snappy: A fast compressor/decompressor. Software library. Open-sourced by Google; C++ implementation with 250MB/s compression, 500MB/s decompression; [link].

Frank, E., Chui, C., and Witten, I. H. (2000). Text categorization using compression models. In Proceedings of the Conference on Data Compression, page 555.

Gailly, J. and Adler, M. (1992). gzip: Gnu file compression utility. Software and format specification. Initial release 31 October 1992; official site: [link].

Gailly, J. and Adler, M. (1995). zlib: A lossless data compression library. Software library and documentation. First released May 1, 1995; official site: [link].

Huggingface (2020). Fake news filipino (jcblaise/fake news filipino). [link]. Accessed: 2025-05-20.

Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. In JSM Proceedings, Social Statistics Section.

Jiang, Z., Yang, M. Y. R., Tsirlin, M., Tang, R., Dai, Y., and Lin, J. (2023). ”low-resource” text classification: A parameter-free classification method with compressors. In Rogers, A., Boyd-Graber, J. L., and Okazaki, N., editors, Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 6810–6828. Association for Computational Linguistics.

Lehmann, M. A. (2008). LibLZF: A very small and fast lz77-based compression library. Project homepage. Last updated August 25, 2008; BSD-style license; official site: [link].

Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10:707–710.

Reinhold, L. M. (2011). QuickLZ: A very fast lz compression library. Official website and source code. Described as the world’s fastest compression library ( 308 MB/s); original C version v1.5.0 available at [link], and ports in Go and Rust exist :contentReference[oaicite:1]index=1.

Sanfilippo, S. (2012). Smaz: Small string compression library. GitHub repository. Compresses very short strings (e.g. the → 1 byte); BSD-3; [link].

Schramm, C. (2015). shoco: A fast compressor for short strings. GitHub repository. C library optimized for very short strings; MIT license; [link].

Seward, J. (1996). bzip2: A block-sorting file compressor. Source code and documentation. Released July 1996; official site: [link].

Teahan, W. J. and Harper, D. J. (2003). Using compression-based language models for text categorization. Language modeling for information retrieval, pages 141–165.

White, S. (2000). How to strike a match: A simple algorithm for string similarity. [link]. Accessed 2025-05-20.

Winkler, W. E. (1990). String comparator metrics and enhanced decision rules in the fellegi–sunter model of record linkage. Technical report, U.S. Bureau of Census.

Yianilos, P. N. (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. In Soda, volume 93, pages 311–21.

Efficient Compression-Based Low-resource Text Classification

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)