One-Class Lightweight Interpretable Filtering For Academic Profiles and Strategic Themes Affinity

Marcos Paulo Silva Gôlo; Matheus Yasuo Ribeiro Utino

Marcos Paulo Silva Gôlo USP
Matheus Yasuo Ribeiro Utino USP

Resumo

Filtering affinity between strategic themes and academic profiles plays a key role in public policies, mainly in Brazil. Although large language models (LLMs) currently represent the state-of-the-art in text mining, they suffer from limitations such as high computational and financial costs, as well as substantial electricity consumption, particularly in High-Performance Computing environments, which leads to significant carbon emissions. These factors highlight the need for lightweight filtering mechanisms to avoid unnecessary LLM calls. Unsupervised similarity-based filters and binary classifiers also face challenges, including a lack of theme specificity and the requirement for large amounts of labeled data. To address these issues, we propose the use of theme-specific one-class learning, where classifiers are trained using only positive high-affinity examples, combined with the theme description and its keywords. This approach aims to reduce carbon emissions while maintaining filter performance. It requires only a small amount of labeled data, making it suitable even for themes with few known profiles, and provides theme-specific adaptability. We leverage embeddings derived from language models to ensure computational efficiency, and project these embeddings into two dimensions to guarantee interpretability, enabling visualization of the decision function and supporting parameter tuning. Our experiments demonstrate that the proposed approach outperforms a textual similarity baseline with similar carbon emissions, achieving higher f1-scores in identifying relevant pairs. Additionally, the method’s interpretability proved valuable for understanding model decisions and generating insights into the data.

Palavras-chave: Text mining, Filters, Computational modeling, High performance computing, Large language models, Carbon dioxide, Transformers, Public policy, Tuning, Faces, One-Class Classification, Lightweight Filtering, Language Models, Bidirectional Encoder Representations for Transformers, Text Pair Classification, Lightweight Methods