Efficient Filtering with BERT Embeddings for Researcher–Topic Affinity Prediction in HPC Pipelines

  • Matheus Yasuo Ribeiro Utino USP
  • Marcos Paulo Silva Gôlo USP

Resumo


Matching researchers to strategic topics is crucial for funding allocation and science policy, yet current solutions often depend on Large Language Models (LLMs), which are prohibitively costly in carbon emissions, especially in High-Performance Computing (HPC) environments. In this work, we formulate researcher-topic affinity prediction as a binary classification task (high vs. non-high) using a real-world dataset of 42,046 Portuguese theses and dissertations defended in Brazil in 2023. We evaluate three families of lightweight filtering mechanisms: (i) lexical baselines, (ii) cosine-similarity thresholding on BERT embeddings, and (iii) lightweight classifiers applied to pooled embeddings. Beyond accuracy, we explicitly account for environmental impact by measuring emissions (kg CO2) using CodeCarbon, which offers a carbon-aware evaluation of model efficiency. Results show that while lexical filters are nearly carbon-free but underperform, similarity thresholding provides modest improvements, and lightweight classifiers deliver the best trade-off in terms of performance. Our best model, Granite embeddings with K-Nearest Neighbors (KNN) and max pooling, achieves an F1-macro of 0.779 with only 0.00227 kg CO2, significantly outperforming baselines while keeping emissions negligible. These findings demonstrate that lightweight classifiers can act as effective pre-filters, routing only high-confidence cases to expensive LLMs, thereby reducing latency, cost, and emissions in HPC pipelines.
Palavras-chave: Filters, Costs, High performance computing, Large language models, Conferences, Pipelines, Computer architecture, Nearest neighbor methods, Routing, Resource management, lightweight filtering, researcher-topic affinity, sustainable AI, BERT embeddings, text pair classification
Publicado
28/10/2025
UTINO, Matheus Yasuo Ribeiro; GÔLO, Marcos Paulo Silva. Efficient Filtering with BERT Embeddings for Researcher–Topic Affinity Prediction in HPC Pipelines. In: WORKSHOP ON LIGHTWEIGHT EFFICIENT DEEP LEARNING IN HPC ENVIRONMENTS (LEANDL-HPC) - INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD), 37. , 2025, Bonito/MS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 155-162.