Avaliação da Qualidade da Voz em Serviços de Comunicação usando Deep Learning
Abstract
The telephone services based on IP networks are very used around the world. However, the Packet Loss Rate (PLR) can occur on IP networks, affecting the users Quality of Experience (QoE), being necessary to perform the assessment of the speech quality. The determination of a methodology to predict a speech quality is relevant and necessary. Consequently, this paper introduces a novel non-intrusive speech quality model based on deep learning, in order to identify five speech quality classes. A speech database was built, in which different PLRs are applied and the index quality of each file was calculated. Experimental results of performance assessment show that the proposed model overcomes the ITU-T Recommendation P.563.
References
Bengio, Y., Chapados, N., Delalleau, O., Larochelle, H., Saint-Mleux, X., Hudon, C., and Louradour, J. (2012). Detonation classication from acoustic signature with the restricted boltzmann machine. Computational Intelligence, 28(2):261–288.
Chen, C. L. P., Zhang, C. Y., Chen, L., and Gan, M. (2015). Fuzzy restricted boltzmann machine for the enhancement of deep learning. IEEE Trans. on Fuzzy Systems, 23(6):2163–2173.
Cremonezi, B. M., Vieira, A. B., Nogueira, M., and Nacif, J. A. M. (2017). Um protocolo de alocacao dinamica de canais para ambientes medicos sob multiplas estacoes base. In Simposio Brasileiro de Redes de Computadores e Sistemas Distribuidos, pages 272– 285, Belem,Para.
EADS Telecom (2003). Audio enhancement in telecom. applications: Anita reference database description.
Graves, A. (2012). Supervised Sequence Labelling with Recurrent Neural Networks, volume 385 of Studies in Computational Intelligence. Springer.
Hinton, G., Deng, L., Yu, D., Dahl, G., rahman Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., and Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. Signal Processing Magazine.
Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural Computing, 18(7):1527–1554.
ITU-T Rec. G.107 (2015). The E-model: a computational model for use in transmission planning.
ITU-T Rec. P.800 (1996). Methods for subjective determination of transmission quality.
ITU-T Rec. P.862 (2001). Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs.
ITU-T Rec. P.863 (2014). Perceptual objective listening quality assessment (POLQA).
ITU-T Rec. Sup. 23 (1998). Coded-speech database.
ITU-T Temporary Document (2015). Technical requirement specication proposals for scope of single-ended perceptual evaluation of listening quality (P.SPELQ).
Jaitly, N. and Hinton, G. E. (2011). Learning a better representation of speech soundwaves using restricted boltzmann machines. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal, pages 5884–5887, Prague, Czech Republic.
Lee, B. K. and Chang, J. H. (2016). Packet loss concealment based on deep neural networks for digital speech transmission. IEEE/ACM Trans. on Audio, Speech, and Language Processing, 24(2):378–387.
Liu, C., Xie, L., and Meng, H. (2007). Classication of music and speech in mandarin news broadcasts. In National Conf. on Man-Machine Speech Communication, pages 17–20, Anhui, China.
Monika, S. and Rama, A. (2016). An efcient digital speech transmission using neural network with HMM (Hidden Markov Model). In Proc. Int. Conf. on Emerging Engineering Trends and Science, pages 34–43, Tamilnadu, India.
Montag, C., Baszkiewicz, K., Sariyska, R., Lachmann, B., Andone, I., Trendalov, B., Eibes, M., and Markowetz, A. (2015). Smartphone usage in the 21st century: who is active on whatsapp? BMC Research Notes, 8(1):331–336.
Pan, G., Qiao, J., Chai, W., and Dimopoulos, N. (2014). An improved RBM based on bayesian regularization. In Proc. Int. Joint Conf. on Neural Networks, pages 2935– 2939, Beijing, China.
Polacky, J. and Pocta, P. (2014). An analysis of the impact of packet loss, codecs and type of voice on internal parameters of p.563 model. In Proc. IEEE Int. Conf. on Digital Technlogies, pages 281–284, Zilina, Slovakia.
Räsänen, O. J., Laine, U. K., and Altosaar, T. (2009). Self-learning vector quantization for pattern discovery from speech. In INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, Brighton, United Kingdom, September 6-10, 2009, pages 852–855.
Rodríguez, D. Z., Rosa, R. L., Alfaia, E. C., Abrahão, J. I., and Bressan, G. (2016). Video quality metric for streaming service using DASH standard. TBC, 62(3):628–639.
Rodríguez, D. Z., Wang, Z., Rosa, R. L., and Bressan, G. (2014). The impact of videoquality-level switching on user quality of experience in dynamic adaptive streaming over HTTP. EURASIP J. Wireless Comm. and Networking, 2014:216–226.
Saini, P. and Kaur, P. (2013). Automatic speech recognition: A review. International journal of Engineering Trends & Technology, pages 132–136.
