Imagery contents descriptions for People with visual impairments

Alessandra Helena Jandrey; Duncan Dubugras Alcoba Ruiz; Milene Selbach Silveira

Alessandra Helena Jandrey Pontifícia Universidade Católica do Rio de Janeiro https://orcid.org/0009-0000-0886-1241
Duncan Dubugras Alcoba Ruiz Pontifícia Universidade Católica do Rio de Janeiro http://orcid.org/0000-0002-4071-3246
Milene Selbach Silveira Pontifícia Universidade Católica do Rio de Janeiro https://orcid.org/0000-0003-2159-551X

Resumo

Image descriptions are crucial in assisting individuals without eye-sight by providing verbal representations of visual content. While manual and Artificial Intelligence (AI)-generated descriptions exist, automatic description generators have not fully met the needs of visually impaired People. In this study, we have examined the problems related to image descriptions reported in existing literature using the Snowballing technique. Through this method, we have identified thirteen issues, including ethical concerns surrounding physical appearance, gender and identity, race, and disability. Furthermore, we have identified five reasons why sighted Individuals often fail to provide descriptions for visual content, highlighting the necessity for accessibility campaigns that raise awareness about the social significance of descriptive sentences. We conducted interviews with eight low-vision volunteers, in which we analyzed the characteristics of descriptive sentences for 25 indoor images and gathered participants’ expectations regarding image descriptions. As a result, we propose a set of Good Practices for writing descriptive sentences aimed to assist automatic tools and sighted Individuals in generating more satisfactory and high-quality image descriptions. We hope our results will emphasize the societal importance of imagery descriptions and inspire the community to pursue further interdisciplinary research to address the issues identified in our study.

Palavras-chave: Good practices, Image descriptions, Visually impaired People

Referências

ABNT. Associação Brasileira de Normas Técnicas/Brazilian Association of Technical Standards. NBR 16452:2016. Comitê Técnico de Acessibilidade. 2016. Accessibility in communication — Audio description. [link]

Maria Lúcia Toledo Moraes Amiralian. 1997. Compreendendo O Cego - Uma Visão Psicanalítica Da Cegueira Por Meio De Desenhos-Estórias. Casa do Psicólogo, São Paulo.

Vera Lúcia Santiago Araújo. 2010. A formação de audiodescritores no Ceará e em Minas Gerais: Uma proposta baseada em pesquisa acadêmica. In Audiodescrição: transformando imagens em palavras, Lívia Maria Villela Mello Motta and Paulo Romeu Filho (Eds.). Secretaria dos Direitos da Pessoa com Deficiência do Estado de São Paulo, 93–105. [link]

San Pa Pa Aung, Win Pa Pa, and Tin Lay Nwe. 2020. Automatic Myanmar Image Captioning using CNN and LSTM-Based Language Model. In Proceedings of the Joint Workshop on Spoken Language Technologies for Under-resourced languages and Collaboration and Computing for Under-Resourced Languages. European Language Resources Association, 139–143.

Shuang Bai and Shan An. 2018. A survey on automatic image caption generation. Neurocomputing 311 (May 2018), 291–304. DOI: 10.1016/j.neucom.2018.05.080

Cynthia L. Bennett, Cole Gleason, Morgan Klaus Scheuerman, Jeffrey P. Bigham, Anhong Guo, and Alexandra To. 2021. “It’s Complicated”: Negotiating Accessibility and (Mis)Representation in Image Descriptions of Race, Gender, and Disability. In Proceedings of the Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 1–19. DOI: 10.1145/3411764.3445498

Rajarshi Biswas, Michael Barz, and Daniel Sonntag. 2020. Towards Explanatory Interactive Image Captioning Using Top-Down and Bottom-Up Features, Beam Search and Re-ranking. KI-Künstliche Intelligenz 34 (Jul 2020), 1–14. DOI: 10.1007/s13218-020-00679-2

Julia Brannen. 2005. Mixing Methods: The Entry of Qualitative and Quantitative Approaches into the Research Process. International Journal of Social Research Methodology 8, 3 (Feb. 2005), 173–184. DOI: 10.1080/13645570500154642

Shizhe Chen, Qin Jin, Peng Wang, and Qi Wu. 2020. Say As You Wish: Fine-Grained Control of Image Caption Generation With Abstract Scene Graphs. In Proceedings of the Conference on Computer Vision and Pattern Recognition. 9959–9968. DOI: 10.1109/CVPR42600.2020.00998

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO Captions: Data Collection and Evaluation Server. In Proceedings of the European Conference on Computer Vision, Vol. 8693. Springer, Cham, Zurich, Switzerland, 740–755. DOI: 10.1007/978-3-319-10602-1_48

Domenico Chiarella, Justin Yarbrough, and Christopher A.L. Jackson. 2020. Using alt text to make science Twitter more accessible for people with visual impairments. Nature communications 11, 1 (Nov. 2020), 5803. DOI: 10.1038/s41467-020-19640-w

Larissa Magalhães Costa. 2014. Audiodescrição em filmes: história, discussão conceitual e pesquisa de recepção. Ph. D. Dissertation. Departamento de Letras–Pontifícia Universidade Católica do Rio de Janeiro, Rio de Janeiro, RJ, Brasil. DOI: 10.17771/PUCRio.acad.29932

Emanuel Diamant. 2008. Unveiling the mystery of visual information processing in human brain. Brain Research 1225 (Aug. 2008), 171–178. DOI: 10.1016/j.brainres.2008.05.017

Pierre Dognin, Igor Melnyk, Youssef Mroueh, Jerret Ross, and Tom Sercu. 2019. Adversarial Semantic Alignment for Improved Image Captions. In Proceedings of the Conference on Computer Vision and Pattern Recognition. 10455–10463. DOI: 10.1109/CVPR.2019.01071

Barbara Downe‐Wamboldt. 1992. Content analysis: Method, applications, and issues. Health Care for Women International 13, 3 (Nov. 1992), 313–321. DOI: 10.1080/07399339209516006

Desmond Elliott and Frank Keller. 2013. Image Description using Visual Dependency Representations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1292–1302. [link]

Satu Elo and Helvi Kyngäs. 2008. The qualitative content analysis process. Journal of Advanced Nursing 62, 1 (Nov. 2008), 107–115. DOI: 10.1111/j.1365-2648.2007.04569.x

Mikaela Daiane Prestes Floriano, Paulo Vanderlei Cassanego Junior, and Andressa Hennig Silva. 2020. #PraCegoVer: uma discussão da inclusão digital e social sob a ótica da pesquisa transformativa do consumidor. CTS: Revista iberoamericana de ciencia, tecnología y sociedad 15, 45 (Oct. 2020), 183–207.

Lawrence R. Frey, Carl H. Botan, and Gary L. Kreps. 1999. Investigating Communication: An Introduction to Research Methods (second ed.). Allyn and Bacon, Boston, MA, United States, Chapter 9, 139–161.

Cole Gleason, Patrick Carrington, Cameron Cassidy, Meredith Ringel Morris, Kris M. Kitani, and Jeffrey P. Bigham. 2019. “It’s Almost like They’re Trying to Hide It”: How User-Provided Image Descriptions Have Failed to Make Twitter Accessible. In Proceedings of the The World Wide Web Conference (San Francisco, CA, USA) (WWW ’19). Association for Computing Machinery, New York, NY, USA, 549–559. DOI: 10.1145/3308558.3313605

Danna Gurari, Yinan Zhao, Meng Zhang, and Nilavra Bhattacharya. 2020. Captioning Images Taken by People Who Are Blind. In Proceedings of the European Conference on Computer Vision, Vol. 12362. Springer, Cham, Glasgow, UK, 417–434.

Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics. Journal of Artificial Intelligence Research 47, 1 (May. 2013), 853–899. DOI: 10.1613/jair.3994

Laura Hollink, A. Th. Schreiber, Bob J. Wielinga, and Marcel Worring. 2004. Classification of user image descriptions. International Journal of Human-Computer Studies 61, 5 (Nov 2004), 601–626. DOI: 10.1016/j.ijhcs.2004.03.002

Md. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. 2019. A Comprehensive Survey of Deep Learning for Image Captioning. Comput. Surveys 51, 6, Article 118 (Feb 2019), 36 pages. DOI: 10.1145/3295748

Rachel Hutchinson, Hannah Thompson, and Matthew Cock. 2020. Describing diversity: an exploration of the description of human characteristics and appearance within the practice of theatre audio description. Technical Report. Describing Diversity project in partnership of VocalEyes with Royal Holloway, University of London. 1–84 pages. DOI: 10.13140/RG.2.2.23958.78400

IBM. 2018. IBM Developer Model Asset Exchange: Image Caption Generator. [link]

Nikolai Ilinykh and Simon Dobnik. 2020. When an Image Tells a Story: The Role of Visual and Semantic Information for Generating Paragraph Descriptions. In Proceedings of the International Conference on Natural Language Generation. Association for Computational Linguistics, Dublin, Ireland, 338–348.

Marina Ivasic-Kos, Ivo Ipsic, and Slobodan Ribaric. 2015. A knowledge-based multi-layered image annotation system. Expert Systems with Applications 42, 24 (Dec. 2015), 9539–9553. DOI: 10.1016/j.eswa.2015.07.068

Alejandro Jaimes and Shih-Fu Chang. 2000. A Conceptual Framework for Indexing Visual Information at Multiple Levels. Electronic Imaging 3964 (Jan 2000), 2–15. DOI: 10.1117/12.373443

Alessandra Helena Jandrey, Duncan Dubugras Alcoba Ruiz, and Milene Selbach Silveira. 2021. Image Descriptions’ Limitations for People with Visual Impairments: Where Are We and Where Are We Going?. In Proceedings of the Brazilian Symposium on Human Factors in Computing Systems (Online, Brazil) (IHC ’21). Association for Computing Machinery, New York, NY, USA, 1–11. DOI: 10.1145/3472301.3484356

C. R. Kothari. 2013. Research Methodology Methods and Techniques (second ed.). New Age International, New Delhi, Delhi, India.

Shuang Liu, Liang Bai, Yanli Hu, and Haoran Wang. 2018. Image Captioning Based on Deep Neural Networks. In Proceedings of the International Conference on Electronic Information Technology and Computer Engineering(MATEC Web of Conferences, Vol. 232). Article 7, 7 pages. DOI: 10.1051/matecconf/201823201052

Haley MacLeod, Cynthia L. Bennett, Meredith Ringel Morris, and Edward Cutrell. 2017. Understanding Blind People’s Experiences with Computer-Generated Captions of Social Media Images. In Proceedings of the Conference on Human Factors in Computing Systems(CHI ’17). Association for Computing Machinery, New York, NY, USA, 5988–5999. DOI: 10.1145/3025453.3025814

Celia Maria Magalhães and Pedro Henrique Lima Praxedes Filho. 2018. Neutrality in Audio Descriptions of paintings: an appraisal system-based study of corpora in English and Portuguese. Revista da Anpoll 44, 1 (Feb-Apr 2018), 279–298. DOI: 10.18309/anp.v1i44.1169

Burak Makav and Volkan Kılıç. 2019. A New Image Captioning Approach for Visually Impaired People. In Proceedings of the International Conference on Electrical and Electronics Engineering. IEEE, 945–949. DOI: 10.23919/ELECO47770.2019.8990630

Nahema Marchal, Lisa-Maria Neudert, Bence Kollanyi, and Philip Howard. 2021. Investigating Visual Content Shared over Twitter during the 2019 EU Parliamentary Election Campaign. Media and Communication 9, 1 (Feb. 2021), 158–170. DOI: 10.17645/mac.v9i1.3421

Meredith Ringel Morris, Jazette Johnson, Cynthia L. Bennett, and Edward Cutrell. 2018. Rich Representations of Visual Content for Screen Reader Users. In Proceedings of the Conference on Human Factors in Computing Systems (Montreal QC, Canada) (CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–11. DOI: 10.1145/3173574.3173633

Meredith Ringel Morris, Annuska Zolyomi, Catherine Yao, Sina Bahram, Jeffrey P. Bigham, and Shaun K. Kane. 2016. " With Most of It Being Pictures Now, I Rarely Use It": Understanding Twitter’s Evolving Accessibility to Blind Users. In Proceedings of the Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 5506–5516. DOI: 10.1145/2858036.2858116

Sylvia Bahiense Naves, Carla Mauch, Soraya Ferreira Alves, and Vera Lúcia Santiago Araújo. 2016. Guia para Produções Audiovisuais Acessíveis. Technical Report. Ministério da Cultura por meio da Secretaria do Audiovisual, Rio de Janeiro, RJ, Brasil. 1–80 pages. [link]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the Annual Meeting on Association for Computational Linguistics (Philadelphia, Pennsylvania) (ACL ’02). Association for Computational Linguistics, USA, 311–318. DOI: 10.3115/1073083.1073135

Helen Petrie, Chandra Harrison, and Sundeep Dev. 2005. Describing images on the Web: a survey of current practice and prospects for the future. In Proceedings of the Human Computer Interaction International, Vol. 71. 1–10.

Rune Pettersson. 2013. Views on Visual Literacy. Journal on Images and Culture 1, 1 (Feb. 2013), 1–9.

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2017. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. International Journal of Computer Vision 123, 1 (May 2017), 74–93. DOI: 10.1007/s11263-016-0965-7

Carolina Sacramento, Leonardo Nardi, Simone Bacellar Leal Ferreira, and Jo ao Marcelo dos Santos Marques. 2020. #PraCegoVer: Investigating the Description of Visual Content in Brazilian Online Social Media. In Proceedings of the Brazilian Symposium on Human Factors in Computing Systems (Diamantina, Brazil) (IHC ’20). Association for Computing Machinery, New York, NY, USA, Article 1, 10 pages. DOI: 10.1145/3424953.3426489

Johnny Saldaña. 2013. The Coding Manual for Qualitative Researchers (second ed.). SAGE Publications Ltd, London, UK.

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Vol. 1: Long Papers.Association for Computational Linguistics, Melbourne, Australia, 2556–2565. DOI: 10.18653/v1/P18-1238

Joel Snyder. 2010. Audio Description guidelines and best practices. Technical Report. American Council of the Blind’s Audio Description Project, USA. 1–98 pages. [link]

Bruno Splendiani, Mireia Ribera Turró, Roberto García, and Marina Salse. 2012. An interdisciplinary approach to alternative representations of images. In Proceedings of the International Conference on Computers Helping People with Special Needs(ICCHP’13). 153–158.

Abigale Stangl, Meredith Ringel Morris, and Danna Gurari. 2020. ’Person, Shoes, Tree. Is the Person Naked?’ What People with Vision Impairments Want in Image Descriptions. In Proceedings of the Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–13. DOI: 10.1145/3313831.3376404

Steve Stemler. 2000. An overview of content analysis. Practical Assessment, Research, and Evaluation 7, 1, Article 17 (Nov. 2000), 10 pages. DOI: 10.7275/z6fm-2e34

Elizabet Dias Sá, Izilda Maria Campos, and Myriam Beatriz Campolina Silva. 2007. Inclusão escolar de alunos cegos e com baixa visão. In Atendimento educacional especializado em deficiência visual. SEESP/SEED/MEC, Brasilia, DF, Brazil, Chapter 1, 13–40.

Luana Rodrigues S. Sá, Lídia Hubert, and Jader S. Nunes. 2020. Técnicas de audiodescrição aplicadas à Internet e sites. Technical Report. Fundação Escola Nacional de Administração Pública, Brasília, DF, Brasil. 1–47 pages. [link]

Lisa Tang. 2012. Producing informative text alternatives for images. Ph. D. Dissertation. University of Saskatchewan, Saskatoon, SK, Canada.

David R. Thomas. 2006. A General Inductive Approach for Analyzing Qualitative Evaluation Data. American Journal of Evaluation 27 (Jun. 2006), 237–246. DOI: 10.1177/1098214005283748

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-Based Image Description Evaluation. In Proceedings of the Conference on Computer Vision and Pattern Recognition. 4566–4575.

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2017. Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (Sep. 2017), 652–663. DOI: 10.1109/TPAMI.2016.2587640

W3C.World Wide Web Consortium. 1998. Understanding Success Criterion 1.1.1: Non-text Content. [link]

WCAG.Web Content Accessibility Guidelines. 2008. v2.0 (2008). [link]

WebAIM. 2023. The WebAIM Million. [link]

Shaomei Wu, Jeffrey Wieland, Omid Farivar, and Julie Schiller. 2017. Automatic Alt-Text: Computer-Generated Image Descriptions for Blind Users on a Social Network Service. In Proceedings of the Conference on Computer Supported Cooperative Work and Social Computing (Portland, Oregon, USA) (CSCW ’17). Association for Computing Machinery, New York, NY, USA, 1180–1192. DOI: 10.1145/2998181.2998364

Yuhang Zhao, Shaomei Wu, Lindsay Reynolds, and Shiri Azenkot. 2017. The Effect of Computer-Generated Descriptions on Photo-Sharing Experiences of People with Visual Impairments. ACM Transactions on Computer-Human Interaction 1, CSCW, Article 121 (Dec 2017), 22 pages. DOI: 10.1145/3134756