ABSTRACT
Image descriptions are crucial in assisting individuals without eyesight by providing verbal representations of visual content. While manual and Artificial Intelligence (AI)-generated descriptions exist, automatic description generators have not fully met the needs of visually impaired People. In this study, we have examined the problems related to image descriptions reported in existing literature using the Snowballing technique. Through this method, we have identified thirteen issues, including ethical concerns surrounding physical appearance, gender and identity, race, and disability. Furthermore, we have identified five reasons why sighted Individuals often fail to provide descriptions for visual content, highlighting the necessity for accessibility campaigns that raise awareness about the social significance of descriptive sentences. We conducted interviews with eight low-vision volunteers, in which we analyzed the characteristics of descriptive sentences for 25 indoor images and gathered participants’ expectations regarding image descriptions. As a result, we propose a set of Good Practices for writing descriptive sentences aimed to assist automatic tools and sighted Individuals in generating more satisfactory and high-quality image descriptions. We hope our results will emphasize the societal importance of imagery descriptions and inspire the community to pursue further interdisciplinary research to address the issues identified in our study.
- ABNT. Associação Brasileira de Normas Técnicas/Brazilian Association of Technical Standards. NBR 16452:2016. Comitê Técnico de Acessibilidade. 2016. Accessibility in communication — Audio description. https://www.prefeitura.sp.gov.br/cidade/secretarias/upload/ABNT%20-%20Acessibilidade.pdfGoogle Scholar
- Maria Lúcia Toledo Moraes Amiralian. 1997. Compreendendo O Cego - Uma Visão Psicanalítica Da Cegueira Por Meio De Desenhos-Estórias. Casa do Psicólogo, São Paulo.Google Scholar
- Vera Lúcia Santiago Araújo. 2010. A formação de audiodescritores no Ceará e em Minas Gerais: Uma proposta baseada em pesquisa acadêmica. In Audiodescrição: transformando imagens em palavras, Lívia Maria Villela Mello Motta and Paulo Romeu Filho (Eds.). Secretaria dos Direitos da Pessoa com Deficiência do Estado de São Paulo, 93–105. https://www.prefeitura.sp.gov.br/cidade/secretarias/upload/planejamento/prodam/arquivos/Livro_Audiodescricao.pdfGoogle Scholar
- San Pa Pa Aung, Win Pa Pa, and Tin Lay Nwe. 2020. Automatic Myanmar Image Captioning using CNN and LSTM-Based Language Model. In Proceedings of the Joint Workshop on Spoken Language Technologies for Under-resourced languages and Collaboration and Computing for Under-Resourced Languages. European Language Resources Association, 139–143.Google Scholar
- Shuang Bai and Shan An. 2018. A survey on automatic image caption generation. Neurocomputing 311 (May 2018), 291–304. https://doi.org/10.1016/j.neucom.2018.05.080Google ScholarDigital Library
- Cynthia L. Bennett, Cole Gleason, Morgan Klaus Scheuerman, Jeffrey P. Bigham, Anhong Guo, and Alexandra To. 2021. “It’s Complicated”: Negotiating Accessibility and (Mis)Representation in Image Descriptions of Race, Gender, and Disability. In Proceedings of the Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 1–19. https://doi.org/10.1145/3411764.3445498Google ScholarDigital Library
- Rajarshi Biswas, Michael Barz, and Daniel Sonntag. 2020. Towards Explanatory Interactive Image Captioning Using Top-Down and Bottom-Up Features, Beam Search and Re-ranking. KI-Künstliche Intelligenz 34 (Jul 2020), 1–14. https://doi.org/10.1007/s13218-020-00679-2Google ScholarCross Ref
- Julia Brannen. 2005. Mixing Methods: The Entry of Qualitative and Quantitative Approaches into the Research Process. International Journal of Social Research Methodology 8, 3 (Feb. 2005), 173–184. https://doi.org/10.1080/13645570500154642Google ScholarCross Ref
- Shizhe Chen, Qin Jin, Peng Wang, and Qi Wu. 2020. Say As You Wish: Fine-Grained Control of Image Caption Generation With Abstract Scene Graphs. In Proceedings of the Conference on Computer Vision and Pattern Recognition. 9959–9968. https://doi.org/10.1109/CVPR42600.2020.00998Google ScholarCross Ref
- Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO Captions: Data Collection and Evaluation Server. In Proceedings of the European Conference on Computer Vision, Vol. 8693. Springer, Cham, Zurich, Switzerland, 740–755. https://doi.org/10.1007/978-3-319-10602-1_48Google ScholarCross Ref
- Domenico Chiarella, Justin Yarbrough, and Christopher A.L. Jackson. 2020. Using alt text to make science Twitter more accessible for people with visual impairments. Nature communications 11, 1 (Nov. 2020), 5803. https://doi.org/10.1038/s41467-020-19640-wGoogle ScholarCross Ref
- Larissa Magalhães Costa. 2014. Audiodescrição em filmes: história, discussão conceitual e pesquisa de recepção. Ph. D. Dissertation. Departamento de Letras–Pontifícia Universidade Católica do Rio de Janeiro, Rio de Janeiro, RJ, Brasil. https://doi.org/10.17771/PUCRio.acad.29932Google ScholarCross Ref
- Emanuel Diamant. 2008. Unveiling the mystery of visual information processing in human brain. Brain Research 1225 (Aug. 2008), 171–178. https://doi.org/10.1016/j.brainres.2008.05.017Google ScholarCross Ref
- Pierre Dognin, Igor Melnyk, Youssef Mroueh, Jerret Ross, and Tom Sercu. 2019. Adversarial Semantic Alignment for Improved Image Captions. In Proceedings of the Conference on Computer Vision and Pattern Recognition. 10455–10463. https://doi.org/10.1109/CVPR.2019.01071Google ScholarCross Ref
- Barbara Downe‐Wamboldt. 1992. Content analysis: Method, applications, and issues. Health Care for Women International 13, 3 (Nov. 1992), 313–321. https://doi.org/10.1080/07399339209516006Google ScholarCross Ref
- Desmond Elliott and Frank Keller. 2013. Image Description using Visual Dependency Representations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1292–1302. https://www.aclweb.org/anthology/D13-1128/Google Scholar
- Satu Elo and Helvi Kyngäs. 2008. The qualitative content analysis process. Journal of Advanced Nursing 62, 1 (Nov. 2008), 107–115. https://doi.org/10.1111/j.1365-2648.2007.04569.xGoogle ScholarCross Ref
- Mikaela Daiane Prestes Floriano, Paulo Vanderlei Cassanego Junior, and Andressa Hennig Silva. 2020. #PraCegoVer: uma discussão da inclusão digital e social sob a ótica da pesquisa transformativa do consumidor. CTS: Revista iberoamericana de ciencia, tecnología y sociedad 15, 45 (Oct. 2020), 183–207.Google Scholar
- Lawrence R. Frey, Carl H. Botan, and Gary L. Kreps. 1999. Investigating Communication: An Introduction to Research Methods (second ed.). Allyn and Bacon, Boston, MA, United States, Chapter 9, 139–161.Google Scholar
- Cole Gleason, Patrick Carrington, Cameron Cassidy, Meredith Ringel Morris, Kris M. Kitani, and Jeffrey P. Bigham. 2019. “It’s Almost like They’re Trying to Hide It”: How User-Provided Image Descriptions Have Failed to Make Twitter Accessible. In Proceedings of the The World Wide Web Conference (San Francisco, CA, USA) (WWW ’19). Association for Computing Machinery, New York, NY, USA, 549–559. https://doi.org/10.1145/3308558.3313605Google ScholarDigital Library
- Danna Gurari, Yinan Zhao, Meng Zhang, and Nilavra Bhattacharya. 2020. Captioning Images Taken by People Who Are Blind. In Proceedings of the European Conference on Computer Vision, Vol. 12362. Springer, Cham, Glasgow, UK, 417–434.Google ScholarDigital Library
- Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics. Journal of Artificial Intelligence Research 47, 1 (May. 2013), 853–899. https://doi.org/10.1613/jair.3994Google ScholarCross Ref
- Laura Hollink, A. Th. Schreiber, Bob J. Wielinga, and Marcel Worring. 2004. Classification of user image descriptions. International Journal of Human-Computer Studies 61, 5 (Nov 2004), 601–626. https://doi.org/10.1016/j.ijhcs.2004.03.002Google ScholarDigital Library
- Md. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. 2019. A Comprehensive Survey of Deep Learning for Image Captioning. Comput. Surveys 51, 6, Article 118 (Feb 2019), 36 pages. https://doi.org/10.1145/3295748Google ScholarDigital Library
- Rachel Hutchinson, Hannah Thompson, and Matthew Cock. 2020. Describing diversity: an exploration of the description of human characteristics and appearance within the practice of theatre audio description. Technical Report. Describing Diversity project in partnership of VocalEyes with Royal Holloway, University of London. 1–84 pages. https://doi.org/10.13140/RG.2.2.23958.78400Google Scholar
- IBM. 2018. IBM Developer Model Asset Exchange: Image Caption Generator. https://developer.ibm.com/technologies/artificial-intelligence/models/max-image-caption-generator/Google Scholar
- Nikolai Ilinykh and Simon Dobnik. 2020. When an Image Tells a Story: The Role of Visual and Semantic Information for Generating Paragraph Descriptions. In Proceedings of the International Conference on Natural Language Generation. Association for Computational Linguistics, Dublin, Ireland, 338–348.Google ScholarCross Ref
- Marina Ivasic-Kos, Ivo Ipsic, and Slobodan Ribaric. 2015. A knowledge-based multi-layered image annotation system. Expert Systems with Applications 42, 24 (Dec. 2015), 9539–9553. https://doi.org/10.1016/j.eswa.2015.07.068Google ScholarDigital Library
- Alejandro Jaimes and Shih-Fu Chang. 2000. A Conceptual Framework for Indexing Visual Information at Multiple Levels. Electronic Imaging 3964 (Jan 2000), 2–15. https://doi.org/10.1117/12.373443Google ScholarCross Ref
- Alessandra Helena Jandrey, Duncan Dubugras Alcoba Ruiz, and Milene Selbach Silveira. 2021. Image Descriptions’ Limitations for People with Visual Impairments: Where Are We and Where Are We Going?. In Proceedings of the Brazilian Symposium on Human Factors in Computing Systems (Online, Brazil) (IHC ’21). Association for Computing Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/3472301.3484356Google ScholarDigital Library
- C. R. Kothari. 2013. Research Methodology Methods and Techniques (second ed.). New Age International, New Delhi, Delhi, India.Google Scholar
- Shuang Liu, Liang Bai, Yanli Hu, and Haoran Wang. 2018. Image Captioning Based on Deep Neural Networks. In Proceedings of the International Conference on Electronic Information Technology and Computer Engineering(MATEC Web of Conferences, Vol. 232). Article 7, 7 pages. https://doi.org/10.1051/matecconf/201823201052Google ScholarCross Ref
- Haley MacLeod, Cynthia L. Bennett, Meredith Ringel Morris, and Edward Cutrell. 2017. Understanding Blind People’s Experiences with Computer-Generated Captions of Social Media Images. In Proceedings of the Conference on Human Factors in Computing Systems(CHI ’17). Association for Computing Machinery, New York, NY, USA, 5988–5999. https://doi.org/10.1145/3025453.3025814Google ScholarDigital Library
- Celia Maria Magalhães and Pedro Henrique Lima Praxedes Filho. 2018. Neutrality in Audio Descriptions of paintings: an appraisal system-based study of corpora in English and Portuguese. Revista da Anpoll 44, 1 (Feb-Apr 2018), 279–298. https://doi.org/10.18309/anp.v1i44.1169Google ScholarCross Ref
- Burak Makav and Volkan Kılıç. 2019. A New Image Captioning Approach for Visually Impaired People. In Proceedings of the International Conference on Electrical and Electronics Engineering. IEEE, 945–949. https://doi.org/10.23919/ELECO47770.2019.8990630Google ScholarCross Ref
- Nahema Marchal, Lisa-Maria Neudert, Bence Kollanyi, and Philip Howard. 2021. Investigating Visual Content Shared over Twitter during the 2019 EU Parliamentary Election Campaign. Media and Communication 9, 1 (Feb. 2021), 158–170. https://doi.org/10.17645/mac.v9i1.3421Google ScholarCross Ref
- Meredith Ringel Morris, Jazette Johnson, Cynthia L. Bennett, and Edward Cutrell. 2018. Rich Representations of Visual Content for Screen Reader Users. In Proceedings of the Conference on Human Factors in Computing Systems (Montreal QC, Canada) (CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/3173574.3173633Google ScholarDigital Library
- Meredith Ringel Morris, Annuska Zolyomi, Catherine Yao, Sina Bahram, Jeffrey P. Bigham, and Shaun K. Kane. 2016. " With Most of It Being Pictures Now, I Rarely Use It": Understanding Twitter’s Evolving Accessibility to Blind Users. In Proceedings of the Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 5506–5516. https://doi.org/10.1145/2858036.2858116Google ScholarDigital Library
- Sylvia Bahiense Naves, Carla Mauch, Soraya Ferreira Alves, and Vera Lúcia Santiago Araújo. 2016. Guia para Produções Audiovisuais Acessíveis. Technical Report. Ministério da Cultura por meio da Secretaria do Audiovisual, Rio de Janeiro, RJ, Brasil. 1–80 pages. https://inclusao.enap.gov.br/wp-content/uploads/2018/05/Guia-para-Producoes-Audiovisuais-Acessiveis-com-audiodescricao-das-imagens-1.pdfGoogle Scholar
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the Annual Meeting on Association for Computational Linguistics (Philadelphia, Pennsylvania) (ACL ’02). Association for Computational Linguistics, USA, 311–318. https://doi.org/10.3115/1073083.1073135Google ScholarDigital Library
- Helen Petrie, Chandra Harrison, and Sundeep Dev. 2005. Describing images on the Web: a survey of current practice and prospects for the future. In Proceedings of the Human Computer Interaction International, Vol. 71. 1–10.Google Scholar
- Rune Pettersson. 2013. Views on Visual Literacy. Journal on Images and Culture 1, 1 (Feb. 2013), 1–9.Google Scholar
- Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2017. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. International Journal of Computer Vision 123, 1 (May 2017), 74–93. https://doi.org/10.1007/s11263-016-0965-7Google ScholarDigital Library
- Carolina Sacramento, Leonardo Nardi, Simone Bacellar Leal Ferreira, and Jo ao Marcelo dos Santos Marques. 2020. #PraCegoVer: Investigating the Description of Visual Content in Brazilian Online Social Media. In Proceedings of the Brazilian Symposium on Human Factors in Computing Systems (Diamantina, Brazil) (IHC ’20). Association for Computing Machinery, New York, NY, USA, Article 1, 10 pages. https://doi.org/10.1145/3424953.3426489Google ScholarDigital Library
- Johnny Saldaña. 2013. The Coding Manual for Qualitative Researchers (second ed.). SAGE Publications Ltd, London, UK.Google Scholar
- Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Vol. 1: Long Papers.Association for Computational Linguistics, Melbourne, Australia, 2556–2565. https://doi.org/10.18653/v1/P18-1238Google ScholarCross Ref
- Joel Snyder. 2010. Audio Description guidelines and best practices. Technical Report. American Council of the Blind’s Audio Description Project, USA. 1–98 pages. http://docenti.unimc.it/catia.giaconi/teaching/2017/17069/files/corso-sostegno/audiodescrizioniGoogle Scholar
- Bruno Splendiani, Mireia Ribera Turró, Roberto García, and Marina Salse. 2012. An interdisciplinary approach to alternative representations of images. In Proceedings of the International Conference on Computers Helping People with Special Needs(ICCHP’13). 153–158.Google Scholar
- Abigale Stangl, Meredith Ringel Morris, and Danna Gurari. 2020. ’Person, Shoes, Tree. Is the Person Naked?’ What People with Vision Impairments Want in Image Descriptions. In Proceedings of the Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3313831.3376404Google ScholarDigital Library
- Steve Stemler. 2000. An overview of content analysis. Practical Assessment, Research, and Evaluation 7, 1, Article 17 (Nov. 2000), 10 pages. https://doi.org/10.7275/z6fm-2e34Google ScholarCross Ref
- Elizabet Dias Sá, Izilda Maria Campos, and Myriam Beatriz Campolina Silva. 2007. Inclusão escolar de alunos cegos e com baixa visão. In Atendimento educacional especializado em deficiência visual. SEESP/SEED/MEC, Brasilia, DF, Brazil, Chapter 1, 13–40.Google Scholar
- Luana Rodrigues S. Sá, Lídia Hubert, and Jader S. Nunes. 2020. Técnicas de audiodescrição aplicadas à Internet e sites. Technical Report. Fundação Escola Nacional de Administração Pública, Brasília, DF, Brasil. 1–47 pages. http://repositorio.enap.gov.br/handle/1/5299Google Scholar
- Lisa Tang. 2012. Producing informative text alternatives for images. Ph. D. Dissertation. University of Saskatchewan, Saskatoon, SK, Canada.Google Scholar
- David R. Thomas. 2006. A General Inductive Approach for Analyzing Qualitative Evaluation Data. American Journal of Evaluation 27 (Jun. 2006), 237–246. https://doi.org/10.1177/1098214005283748Google ScholarCross Ref
- Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-Based Image Description Evaluation. In Proceedings of the Conference on Computer Vision and Pattern Recognition. 4566–4575.Google ScholarCross Ref
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2017. Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (Sep. 2017), 652–663. https://doi.org/10.1109/TPAMI.2016.2587640Google ScholarDigital Library
- W3C.World Wide Web Consortium. 1998. Understanding Success Criterion 1.1.1: Non-text Content. https://www.w3.org/WAI/WCAG21/Understanding/non-text-contentGoogle Scholar
- WCAG.Web Content Accessibility Guidelines. 2008. v2.0 (2008). https://www.w3.org/TR/WCAG20/Google Scholar
- WebAIM. 2023. The WebAIM Million. https://webaim.org/projects/million/Google Scholar
- Shaomei Wu, Jeffrey Wieland, Omid Farivar, and Julie Schiller. 2017. Automatic Alt-Text: Computer-Generated Image Descriptions for Blind Users on a Social Network Service. In Proceedings of the Conference on Computer Supported Cooperative Work and Social Computing (Portland, Oregon, USA) (CSCW ’17). Association for Computing Machinery, New York, NY, USA, 1180–1192. https://doi.org/10.1145/2998181.2998364Google ScholarDigital Library
- Yuhang Zhao, Shaomei Wu, Lindsay Reynolds, and Shiri Azenkot. 2017. The Effect of Computer-Generated Descriptions on Photo-Sharing Experiences of People with Visual Impairments. ACM Transactions on Computer-Human Interaction 1, CSCW, Article 121 (Dec 2017), 22 pages. https://doi.org/10.1145/3134756Google ScholarDigital Library
Index Terms
- Imagery contents descriptions for People with visual impairments
Recommendations
Image Descriptions' Limitations for People with Visual Impairments: Where Are We and Where Are We Going?
IHC '21: Proceedings of the XX Brazilian Symposium on Human Factors in Computing SystemsImage descriptions aim to transcribe the visual content and are essential for people who do not have eyesight. Such image descriptions are generated manually or by Artificial Intelligence (AI) models. Despite its relevance, the emergence of automatic ...
Face Recognition Assistant for People with Visual Impairments
Although there are many face recognition systems to help individuals with visual impairments (VIPs) recognize other people, almost all require a database with the pictures and names of the people who should be tracked. These solutions would not be able ...
Inclusive Side-Scrolling Action Game Securing Accessibility for Visually Impaired People
16th IFIP TC 13 International Conference on Human-Computer Interaction --- INTERACT 2017 - Volume 10516Though many computer games have recently become accessible for gamers with visual impairments, these players still face difficulty in manipulating game characters and acquiring visual information. It is true that although an increasing number of games ...
Comments