Avaliação da Qualidade da Voz em Serviços de Comunicação usando Deep Learning
Resumo
Os serviços telefonicos baseados em redes IP são muito utilizados ao redor do mundo. No entanto, uma taxa de perda de pacotes (PLR) pode ocorrer em redes IP, afetando a qualidade de experiência (QoE) dos usuários, sendo necessário avaliar a qualidade da voz. A determinação de uma metodologia para prever uma qualidade da voz é relevante e necessária. Consequentemente, este artigo apresenta um modelo não intrusivo de classificação de qualidade de voz, baseado em aprendizagem profunda utilizando cinco classes. Foi construída uma base de dados, na qual diferentes PLRs são aplicadas e o índice de qualidade de cada arquivo foi calculado. Os resultados experimentais mostram que o desempenho do modelo proposto supera a Recomendação ITU-T P.563.
Referências
Bengio, Y., Chapados, N., Delalleau, O., Larochelle, H., Saint-Mleux, X., Hudon, C., and Louradour, J. (2012). Detonation classication from acoustic signature with the restricted boltzmann machine. Computational Intelligence, 28(2):261–288.
Chen, C. L. P., Zhang, C. Y., Chen, L., and Gan, M. (2015). Fuzzy restricted boltzmann machine for the enhancement of deep learning. IEEE Trans. on Fuzzy Systems, 23(6):2163–2173.
Cremonezi, B. M., Vieira, A. B., Nogueira, M., and Nacif, J. A. M. (2017). Um protocolo de alocacao dinamica de canais para ambientes medicos sob multiplas estacoes base. In Simposio Brasileiro de Redes de Computadores e Sistemas Distribuidos, pages 272– 285, Belem,Para.
EADS Telecom (2003). Audio enhancement in telecom. applications: Anita reference database description.
Graves, A. (2012). Supervised Sequence Labelling with Recurrent Neural Networks, volume 385 of Studies in Computational Intelligence. Springer.
Hinton, G., Deng, L., Yu, D., Dahl, G., rahman Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., and Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. Signal Processing Magazine.
Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural Computing, 18(7):1527–1554.
ITU-T Rec. G.107 (2015). The E-model: a computational model for use in transmission planning.
ITU-T Rec. P.800 (1996). Methods for subjective determination of transmission quality.
ITU-T Rec. P.862 (2001). Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs.
ITU-T Rec. P.863 (2014). Perceptual objective listening quality assessment (POLQA).
ITU-T Rec. Sup. 23 (1998). Coded-speech database.
ITU-T Temporary Document (2015). Technical requirement specication proposals for scope of single-ended perceptual evaluation of listening quality (P.SPELQ).
Jaitly, N. and Hinton, G. E. (2011). Learning a better representation of speech soundwaves using restricted boltzmann machines. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal, pages 5884–5887, Prague, Czech Republic.
Lee, B. K. and Chang, J. H. (2016). Packet loss concealment based on deep neural networks for digital speech transmission. IEEE/ACM Trans. on Audio, Speech, and Language Processing, 24(2):378–387.
Liu, C., Xie, L., and Meng, H. (2007). Classication of music and speech in mandarin news broadcasts. In National Conf. on Man-Machine Speech Communication, pages 17–20, Anhui, China.
Monika, S. and Rama, A. (2016). An efcient digital speech transmission using neural network with HMM (Hidden Markov Model). In Proc. Int. Conf. on Emerging Engineering Trends and Science, pages 34–43, Tamilnadu, India.
Montag, C., Baszkiewicz, K., Sariyska, R., Lachmann, B., Andone, I., Trendalov, B., Eibes, M., and Markowetz, A. (2015). Smartphone usage in the 21st century: who is active on whatsapp? BMC Research Notes, 8(1):331–336.
Pan, G., Qiao, J., Chai, W., and Dimopoulos, N. (2014). An improved RBM based on bayesian regularization. In Proc. Int. Joint Conf. on Neural Networks, pages 2935– 2939, Beijing, China.
Polacky, J. and Pocta, P. (2014). An analysis of the impact of packet loss, codecs and type of voice on internal parameters of p.563 model. In Proc. IEEE Int. Conf. on Digital Technlogies, pages 281–284, Zilina, Slovakia.
Räsänen, O. J., Laine, U. K., and Altosaar, T. (2009). Self-learning vector quantization for pattern discovery from speech. In INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, Brighton, United Kingdom, September 6-10, 2009, pages 852–855.
Rodríguez, D. Z., Rosa, R. L., Alfaia, E. C., Abrahão, J. I., and Bressan, G. (2016). Video quality metric for streaming service using DASH standard. TBC, 62(3):628–639.
Rodríguez, D. Z., Wang, Z., Rosa, R. L., and Bressan, G. (2014). The impact of videoquality-level switching on user quality of experience in dynamic adaptive streaming over HTTP. EURASIP J. Wireless Comm. and Networking, 2014:216–226.
Saini, P. and Kaur, P. (2013). Automatic speech recognition: A review. International journal of Engineering Trends & Technology, pages 132–136.