Explainability of LLMs Using Neural Region Activation Graphs (NRAGs)
Abstract
LLMs are currently central technologies for the development of AI applications, attracting academic interest and significant industry investment. Despite their success and widespread use, LLMs pose major challenges in terms of interpretability—the complexity of the models and the black-box nature of neural networks make it difficult to understand the mechanisms behind output generation. This paper introduces Neural Region Activation Graphs (NRAGs), a novel approach to explainability in LLMs. NRAGs are graph-based representations of the activations of an LLM when stimulated by a corpus. The generated graphs can be used for tasks such as: (i) understanding the interconnections between different regions of the multidimensional space of the network layers, (ii) comparing activation subgraphs from texts of different categories, and (iii) comparing properties of graphs induced by different LLMs for the same corpus. NRAGs are implemented in the LLM-MRI library, which provides a variety of tools for studying LLM activations. This paper presents NRAGs as an alternative for the scientific investigation of complex phenomena resulting from LLM inference, covering the process of generating graphs using the LLM-MRI library and examples of ongoing applications.
References
Bengio, Y., Ducharme, R., and Vincent, P. (2000). A neural probabilistic language model. In Leen, T., Dietterich, T., and Tresp, V., editors, Advances in Neural Information Processing Systems, volume 13. MIT Press.
Costa, L., Figênio, M., Santanchè, A., and Gomes-Jr, L. (2024). LLM-MRI python module: a brain scanner for llms. In Anais Estendidos do XXXIX Simpósio Brasileiro de Bancos de Dados, pages 125–130, Porto Alegre, RS, Brasil. SBC.
Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. (2023). Sparse auto-encoders find highly interpretable features in language models.
Dalvi, F., Durrani, N., Sajjad, H., Belinkov, Y., Bau, A., and Glass, J. (2019). What is one grain of sand in the desert? analyzing individual neurons in deep nlp models. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):6309–6317.
DeRose, J. F., Wang, J., and Berger, M. (2020). Attention flows: Analyzing and comparing attention mechanisms in language models.
Figênio, M., Santanché, A., and Gomes-Jr, L. (2024a). The impact of activation patterns in the explainability of large language models – a survey of recent advances. In Anais da XIX Escola Regional de Banco de Dados, pages 141–149, Porto Alegre, RS, Brasil. SBC.
Figênio, M. R. and Gomes-Jr, L. (2023). Ética na era dos modelos de linguagem massivos (llms): um estudo de caso do chatgpt. In Anais da XVIII Escola Regional de Banco de Dados (ERBD 2023), volume 0, page 100, Brasil.
Figênio, M. R., Santanché, A., and Gomes-Jr, L. (2024b). The impact of activation patterns in the explainability of large language models - a survey of recent advances. In Anais da XIX Escola Regional de Banco de Dados (ERBD 2024), page 141, Brasil.
Hiter, S. (2024). Top 20 generative ai tools and applications in 2024. Disponível em: [link].
Hoover, B., Strobelt, H., and Gehrmann, S. (2019). exbert: A visual analysis tool to explore learned representations in transformers models.
Horta, V. A., Tiddi, I., Little, S., and Mileo, A. (2021). Extracting knowledge from deep neural networks through graph analysis. Future Generation Computer Systems, 120:109–118.
L. da F. Costa, F. A. Rodrigues, G. T. and Boas, P. R. V. (2007). Characterization of complex networks: A survey of measurements. Advances in Physics, 56(1):167–242.
Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kramár, J., Dragan, A., Shah, R., and Nanda, N. (2024). Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2.
Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., and Mian, A. (2024). A comprehensive overview of large language models.
Samek, W., Wiegand, T., and Müller, K.-R. (2017). Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models.
Schmidt, H. G. and Rikers, R. M. J. P. (2007). How expertise develops in medicine: knowledge encapsulation and illness script formation. Medical Education, 41(12):1133–1139.
Tunstall, L., Von Werra, L., and Wolf, T. (2022). Natural language processing with transformers. ”O’Reilly Media, Inc.”.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems.
Zhang, B., He, Z., and Lin, H. (2024). A comprehensive review of deep neural network interpretation using topological data analysis. Neurocomputing, 609:128513.
Zhao, H., Chen, H., Yang, F., Liu, N., Deng, H., Cai, H., Wang, S., Yin, D., and Du, M. (2024a). Explainability for large language models: A survey. ACM Trans. Intell. Syst. Technol. Just Accepted.
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., Liu, P., Nie, J.-Y., and Wen, J.-R. (2024b). A survey of large language models.
