A Multimodal Approach for Music Genre Classification Using Audio and Lyrics Embeddings
Abstract
Music genre classification is an important task for music recommendation and organization. This work explores deep learning methods for automatic classification, combining convolutional neural networks (CNNs) for audio spectral analysis with large language model (LLM) embeddings from lyrics. Experiments on the GTZAN dataset show that audio-only methods achieved higher overall accuracy, while the multimodal approach redistributed performance across classes, improving some genres (e.g., Country, Pop).References
Bahuleyan, H. (2018). Music genre classification using machine learning techniques. CoRR, abs/1804.01149.
Bhalke, D. G., Rajesh, B., and Bormane, D. S. (2017). Automatic genre classification using fractional fourier transform based mel frequency cepstral coefficient and timbral features. Archives of Acoustics, vol. 42(No 2).
Campana, M. G., Delmastro, F., and Pagani, E. (2023). Transfer learning for the efficient detection of covid-19 from smartphone audio data. Pervasive and Mobile Computing, 89:101754.
Cinyol, F., Baysal, U., Köksal, D., Babaoğlu, E., and Ulaşlı, S. S. (2023). Incorporating support vector machine to the classification of respiratory sounds by convolutional neural network. Biomedical Signal Processing and Control, 79:104093.
de Araújo Lima, R., de Sousa, R. C. C., Barbosa, S. D. J., and Lopes, H. C. V. (2020). Brazilian lyrics-based music genre classification using a BLSTM network. CoRR, abs/2003.05377.
Dogo, E. M., Afolabi, O. J., and Twala, B. (2022). On the relative impact of optimizers on convolutional neural networks with varying depth and width for image classification. Applied Sciences, 12(23).
Gan, J. (2021). Music feature classification based on recurrent neural networks with channel attention mechanism. Mobile Information Systems, 2021(1):7629994.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. [link].
Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke, J. F., Jansen, A., Moore, R. C., Plakal, M., Platt, D., Saurous, R. A., Seybold, B., Slaney, M., Weiss, R. J., and Wilson, K. (2017). Cnn architectures for large-scale audio classification.
Jia, X. (2022). Music emotion classification method based on deep learning and improved attention mechanism. Computational Intelligence and Neuroscience, 2022(1):5181899.
Jing, H., Liu, Y., Ma, Y., and Zheng, N. (2024). Hidden states in llms improve eeg representation learning and visual decoding. In ECAI, pages 2130–2137.
Koh, E. and Dubnov, S. (2021). Comparison and analysis of deep audio embeddings for music emotion recognition.
Lau, D. and Ajoodha, R. (2022). Music Genre Classification: A Comparative Study Between Deep Learning and Traditional Machine Learning Approaches, pages 239–247.
Liu, H., Liu, X., Kong, Q., Wang, W., and Plumbley, M. D. (2024). Learning temporal resolution in spectrogram for audio classification. Proceedings of the AAAI Conference on Artificial Intelligence, 38(12):13873–13881.
Liu, J., Wang, C., and Zha, L. (2021). A middle-level learning feature interaction method with deep learning for multi-feature music genre classification. Electronics, 10(18).
Marijić, A. and Bagić Babac, M. (2025). Predicting song genre with deep learning. Global Knowledge, Memory and Communication, 74(1/2):93–110.
Muhammad Turab, Tipparti Anil Kumar, M. B. T. S. (2022). Investigating multi-feature selection and ensembling for audio classification.
Seo, W., Cho, S.-H., Teisseyre, P., and Lee, J. (2024). A short survey and comparison of cnn-based music genre classification using multiple spectral features. IEEE Access, 12:245–257.
Sharma, G., Umapathy, K., and Krishnan, S. (2020). Trends in audio signal feature extraction methods. Applied Acoustics, 158:107020.
Silverman, M. J. (2009). The use of lyric analysis interventions in contemporary psychiatric music therapy: Descriptive results of songs and objectives for clinical practice. Music Therapy Perspectives, 27(1):55–61.
Sturm, B. L. (2014). The state of the art ten years after a state of the art: Future research in music information retrieval. Journal of New Music Research, 43(2):147–172.
Tzanetakis, G. and Cook, P. (2002). Musical genre classification of audio signals. IEEE Trans. Speech Audio Process., 10(5):293–302.
Wang, C., Nulty, P., and Lillis, D. (2021). A comparative study on word embeddings in deep learning for text classification. In Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval, NLPIR ’20, page 37–46, New York, NY, USA. Association for Computing Machinery.
Yin, X., Liu, Q., Huang, X., and Pan, Y. (2021). Real-time prediction of rockburst intensity using an integrated cnn-adam-bo algorithm based on microseismic data and its engineering application. Tunnelling and Underground Space Technology, 117:104133.
Zhang, Y. and Zhang, K. (2021). Music style classification algorithm based on music feature extraction and deep neural network. Wireless Communications and Mobile Computing, 2021:9298654.
Bhalke, D. G., Rajesh, B., and Bormane, D. S. (2017). Automatic genre classification using fractional fourier transform based mel frequency cepstral coefficient and timbral features. Archives of Acoustics, vol. 42(No 2).
Campana, M. G., Delmastro, F., and Pagani, E. (2023). Transfer learning for the efficient detection of covid-19 from smartphone audio data. Pervasive and Mobile Computing, 89:101754.
Cinyol, F., Baysal, U., Köksal, D., Babaoğlu, E., and Ulaşlı, S. S. (2023). Incorporating support vector machine to the classification of respiratory sounds by convolutional neural network. Biomedical Signal Processing and Control, 79:104093.
de Araújo Lima, R., de Sousa, R. C. C., Barbosa, S. D. J., and Lopes, H. C. V. (2020). Brazilian lyrics-based music genre classification using a BLSTM network. CoRR, abs/2003.05377.
Dogo, E. M., Afolabi, O. J., and Twala, B. (2022). On the relative impact of optimizers on convolutional neural networks with varying depth and width for image classification. Applied Sciences, 12(23).
Gan, J. (2021). Music feature classification based on recurrent neural networks with channel attention mechanism. Mobile Information Systems, 2021(1):7629994.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. [link].
Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke, J. F., Jansen, A., Moore, R. C., Plakal, M., Platt, D., Saurous, R. A., Seybold, B., Slaney, M., Weiss, R. J., and Wilson, K. (2017). Cnn architectures for large-scale audio classification.
Jia, X. (2022). Music emotion classification method based on deep learning and improved attention mechanism. Computational Intelligence and Neuroscience, 2022(1):5181899.
Jing, H., Liu, Y., Ma, Y., and Zheng, N. (2024). Hidden states in llms improve eeg representation learning and visual decoding. In ECAI, pages 2130–2137.
Koh, E. and Dubnov, S. (2021). Comparison and analysis of deep audio embeddings for music emotion recognition.
Lau, D. and Ajoodha, R. (2022). Music Genre Classification: A Comparative Study Between Deep Learning and Traditional Machine Learning Approaches, pages 239–247.
Liu, H., Liu, X., Kong, Q., Wang, W., and Plumbley, M. D. (2024). Learning temporal resolution in spectrogram for audio classification. Proceedings of the AAAI Conference on Artificial Intelligence, 38(12):13873–13881.
Liu, J., Wang, C., and Zha, L. (2021). A middle-level learning feature interaction method with deep learning for multi-feature music genre classification. Electronics, 10(18).
Marijić, A. and Bagić Babac, M. (2025). Predicting song genre with deep learning. Global Knowledge, Memory and Communication, 74(1/2):93–110.
Muhammad Turab, Tipparti Anil Kumar, M. B. T. S. (2022). Investigating multi-feature selection and ensembling for audio classification.
Seo, W., Cho, S.-H., Teisseyre, P., and Lee, J. (2024). A short survey and comparison of cnn-based music genre classification using multiple spectral features. IEEE Access, 12:245–257.
Sharma, G., Umapathy, K., and Krishnan, S. (2020). Trends in audio signal feature extraction methods. Applied Acoustics, 158:107020.
Silverman, M. J. (2009). The use of lyric analysis interventions in contemporary psychiatric music therapy: Descriptive results of songs and objectives for clinical practice. Music Therapy Perspectives, 27(1):55–61.
Sturm, B. L. (2014). The state of the art ten years after a state of the art: Future research in music information retrieval. Journal of New Music Research, 43(2):147–172.
Tzanetakis, G. and Cook, P. (2002). Musical genre classification of audio signals. IEEE Trans. Speech Audio Process., 10(5):293–302.
Wang, C., Nulty, P., and Lillis, D. (2021). A comparative study on word embeddings in deep learning for text classification. In Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval, NLPIR ’20, page 37–46, New York, NY, USA. Association for Computing Machinery.
Yin, X., Liu, Q., Huang, X., and Pan, Y. (2021). Real-time prediction of rockburst intensity using an integrated cnn-adam-bo algorithm based on microseismic data and its engineering application. Tunnelling and Underground Space Technology, 117:104133.
Zhang, Y. and Zhang, K. (2021). Music style classification algorithm based on music feature extraction and deep neural network. Wireless Communications and Mobile Computing, 2021:9298654.
Published
2025-09-29
How to Cite
SILVA, J. M. L.; FIGUEIREDO, C. M. S.; GUEDES, E. B.; MELO, T. De.
A Multimodal Approach for Music Genre Classification Using Audio and Lyrics Embeddings. In: NATIONAL MEETING ON ARTIFICIAL AND COMPUTATIONAL INTELLIGENCE (ENIAC), 22. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 1293-1304.
ISSN 2763-9061.
DOI: https://doi.org/10.5753/eniac.2025.11742.
