A Multimodal Approach for Music Genre Classification Using Audio and Lyrics Embeddings

J. M. L. Silva; C. M. S. Figueiredo; E. B. Guedes; T. De Melo

doi:10.5753/eniac.2025.11742

J. M. L. Silva Instituto de Pesquisas Eldorado
C. M. S. Figueiredo UEA
E. B. Guedes UEA
T. De Melo UEA

DOI: https://doi.org/10.5753/eniac.2025.11742

Resumo

A classificação de gêneros musicais é uma tarefa importante para sistemas de recomendação e organização. Este trabalho investiga métodos de aprendizado profundo para classificação automática, combinando redes neurais convolucionais (CNNs) para análise espectral de áudio e embeddings de grandes modelos de linguagem (LLMs) extraídos das letras. Os experimentos no conjunto GTZAN mostram que os métodos apenas de áudio alcançaram maior acurácia geral, enquanto a abordagem multimodal redistribuiu o desempenho entre as classes, melhorando alguns gêneros (ex.: Country, Pop).

Referências

Bahuleyan, H. (2018). Music genre classification using machine learning techniques. CoRR, abs/1804.01149.

Bhalke, D. G., Rajesh, B., and Bormane, D. S. (2017). Automatic genre classification using fractional fourier transform based mel frequency cepstral coefficient and timbral features. Archives of Acoustics, vol. 42(No 2).

Campana, M. G., Delmastro, F., and Pagani, E. (2023). Transfer learning for the efficient detection of covid-19 from smartphone audio data. Pervasive and Mobile Computing, 89:101754.

Cinyol, F., Baysal, U., Köksal, D., Babaoğlu, E., and Ulaşlı, S. S. (2023). Incorporating support vector machine to the classification of respiratory sounds by convolutional neural network. Biomedical Signal Processing and Control, 79:104093.

de Araújo Lima, R., de Sousa, R. C. C., Barbosa, S. D. J., and Lopes, H. C. V. (2020). Brazilian lyrics-based music genre classification using a BLSTM network. CoRR, abs/2003.05377.

Dogo, E. M., Afolabi, O. J., and Twala, B. (2022). On the relative impact of optimizers on convolutional neural networks with varying depth and width for image classification. Applied Sciences, 12(23).

Gan, J. (2021). Music feature classification based on recurrent neural networks with channel attention mechanism. Mobile Information Systems, 2021(1):7629994.

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. [link].

Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke, J. F., Jansen, A., Moore, R. C., Plakal, M., Platt, D., Saurous, R. A., Seybold, B., Slaney, M., Weiss, R. J., and Wilson, K. (2017). Cnn architectures for large-scale audio classification.

Jia, X. (2022). Music emotion classification method based on deep learning and improved attention mechanism. Computational Intelligence and Neuroscience, 2022(1):5181899.

Jing, H., Liu, Y., Ma, Y., and Zheng, N. (2024). Hidden states in llms improve eeg representation learning and visual decoding. In ECAI, pages 2130–2137.

Koh, E. and Dubnov, S. (2021). Comparison and analysis of deep audio embeddings for music emotion recognition.

Lau, D. and Ajoodha, R. (2022). Music Genre Classification: A Comparative Study Between Deep Learning and Traditional Machine Learning Approaches, pages 239–247.

Liu, H., Liu, X., Kong, Q., Wang, W., and Plumbley, M. D. (2024). Learning temporal resolution in spectrogram for audio classification. Proceedings of the AAAI Conference on Artificial Intelligence, 38(12):13873–13881.

Liu, J., Wang, C., and Zha, L. (2021). A middle-level learning feature interaction method with deep learning for multi-feature music genre classification. Electronics, 10(18).

Marijić, A. and Bagić Babac, M. (2025). Predicting song genre with deep learning. Global Knowledge, Memory and Communication, 74(1/2):93–110.

Muhammad Turab, Tipparti Anil Kumar, M. B. T. S. (2022). Investigating multi-feature selection and ensembling for audio classification.

Seo, W., Cho, S.-H., Teisseyre, P., and Lee, J. (2024). A short survey and comparison of cnn-based music genre classification using multiple spectral features. IEEE Access, 12:245–257.

Sharma, G., Umapathy, K., and Krishnan, S. (2020). Trends in audio signal feature extraction methods. Applied Acoustics, 158:107020.

Silverman, M. J. (2009). The use of lyric analysis interventions in contemporary psychiatric music therapy: Descriptive results of songs and objectives for clinical practice. Music Therapy Perspectives, 27(1):55–61.

Sturm, B. L. (2014). The state of the art ten years after a state of the art: Future research in music information retrieval. Journal of New Music Research, 43(2):147–172.

Tzanetakis, G. and Cook, P. (2002). Musical genre classification of audio signals. IEEE Trans. Speech Audio Process., 10(5):293–302.

Wang, C., Nulty, P., and Lillis, D. (2021). A comparative study on word embeddings in deep learning for text classification. In Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval, NLPIR ’20, page 37–46, New York, NY, USA. Association for Computing Machinery.

Yin, X., Liu, Q., Huang, X., and Pan, Y. (2021). Real-time prediction of rockburst intensity using an integrated cnn-adam-bo algorithm based on microseismic data and its engineering application. Tunnelling and Underground Space Technology, 117:104133.

Zhang, Y. and Zhang, K. (2021). Music style classification algorithm based on music feature extraction and deep neural network. Wireless Communications and Mobile Computing, 2021:9298654.

A Multimodal Approach for Music Genre Classification Using Audio and Lyrics Embeddings

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)