ABSTRACT
Audio Description (AD) or Video Description is a vital accessibility concept in blind and visually impaired people's life. Automating this task is not easy and involves many problems, such as describing the scenario, actions, emotions, and characters. This paper presents an approach to automatically describe characters --- in a video or image --- combining Deep Learning (DL), Face detection, Facial Expression detection techniques, and audio synthesizers. Our proposal uses the detection tools, applies some DL models to the analyzed data, and generates an audio description. To evaluate the feasibility of our proposal, we have developed a proof of concept of the solution and performed some computational experiments to evaluate it.
- Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.Google Scholar
- Inc. Amazon Web Services. 2021. Amazon Polly. https://aws.amazon.com/polly/Google Scholar
- Langis Gagnon, Samuel Foucher, Maguelonne Heritier, Marc Lalonde, David Byrns, Claude Chapdelaine, James Turner, Suzanne Mathieu, Denis Laurendeau, Nath Nguyen, and Denis Ouellet. 2009. Towards computer-vision software tools to increase production and accessibility of video description for people with vision loss. Universal Access in the Information Society 8 (08 2009), 199--218. https://doi.org/10.1007/s10209-008-0141-0Google Scholar
- Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. 2020. Array programming with NumPy. Nature 585, 7825 (Sept. 2020), 357--362. https://doi.org/10.1038/s41586-020-2649-2Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs.CV]Google Scholar
- C. Jiménez, C.J. Hurtado, A. Rodríguez, and C. Seibel. 2012. Un corpus de cine: fundamentos teóricos y aplicados de la audiodescripción. Tragacanto. https://books.google.com.br/books?id=OOSXYgEACAAJGoogle Scholar
- Kimmo Karkkainen and Jungseock Joo. 2021. FairFace: Face Attribute Dataset for Balanced Race, Gender, and Age for Bias Measurement and Mitigation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1548--1558.Google ScholarCross Ref
- Davis E. King. 2009. Dlib-ml: A Machine Learning Toolkit. Journal of Machine Learning Research 10 (2009), 1755--1758.Google ScholarDigital Library
- Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG]Google Scholar
- Masatomo Kobayashi, Kentarou Fukuda, Hironobu Takagi, and Chieko Asakawa. 2009. Providing Synthesized Audio Description for Online Videos. In Proceedings of the 11th International ACM SIGACCESS Conference on Computers and Accessibility (Pittsburgh, Pennsylvania, USA) (Assets '09). Association for Computing Machinery, New York, NY, USA, 249--250. https://doi.org/10.1145/1639642.1639699Google ScholarDigital Library
- Masatomo Kobayashi, Trisha O'Connell, Bryan Gould, Hironobu Takagi, and Chieko Asakawa. 2010. Are synthesized video descriptions acceptable? ASSETS'10 - Proceedings of the 12th International ACM SIGACCESS Conference on Computers and Accessibility. https://doi.org/10.1145/1878803.1878833Google ScholarDigital Library
- James Lakritz and Andrew Salway. 2006. The Semi-Automatic Generation of Audio Description from Screenplays. Dept. of Computing Technical Report CS-06-05, University of Surrey (2006).Google Scholar
- Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep Learning Face Attributes in the Wild. In Proceedings of International Conference on Computer Vision (ICCV).Google ScholarDigital Library
- Adrian Rosebrock. 2021. imutils. https://github.com/jrosebr1/imutilsGoogle Scholar
- Rasmus Rothe, Radu Timofte, and Luc Van Gool. 2018. Deep expectation of real and apparent age from a single image without facial landmarks. International Journal of Computer Vision 126, 2-4 (2018), 144--157.Google ScholarDigital Library
- Barbara Cristina A. Silveira, Thiago Silva-de Souza, and Ana Regina C. da Rocha. 2018. Software Accessibility for Visually Impaired People: A Systematic Mapping Study. In Proceedings of the 17th Brazilian Symposium on Software Quality (Curitiba, Brazil) (SBQS). Association for Computing Machinery, New York, NY, USA, 190--199. https://doi.org/10.1145/3275245.3275266Google ScholarDigital Library
- Yujia Wang, Wei Liang, Haikun Huang, Yongqi Zhang, Dingzeyu Li, and Lap-Fai Yu. 2021. Toward Automatic Audio Description Generation for Accessible Videos. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI '21). Association for Computing Machinery, New York, NY, USA, Article 277, 12 pages. https://doi.org/10.1145/3411764.3445347Google ScholarDigital Library
- Zhifei Zhang, Yang Song, and Hairong Qi. 2017. Age Progression/Regression by Conditional Adversarial Autoencoder. arXiv:1702.08423 [cs.CV]Google Scholar
Index Terms
- An Approach for Automatic Description of Characters for Blind People
Recommendations
What Makes Videos Accessible to Blind and Visually Impaired People?
CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing SystemsUser-generated videos are an increasingly important source of information online, yet most online videos are inaccessible to blind and visually impaired (BVI) people. To find videos that are accessible, or understandable without additional description ...
Toward Automatic Audio Description Generation for Accessible Videos
CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing SystemsVideo accessibility is essential for people with visual impairments. Audio descriptions describe what is happening on-screen, e.g., physical actions, facial expressions, and scene changes. Generating high-quality audio descriptions requires a lot of ...
Comments