Deep Learning para Geração Automática de Legenda de Imagem

Maynara Scoparo; Adriane Serapião

doi:10.5753/eniac.2019.9314

Maynara Scoparo Universidade do Estado de São Paulo
Adriane Serapião Universidade do Estado de São Paulo

DOI: https://doi.org/10.5753/eniac.2019.9314

Resumo

A geração automática de legenda de imagem é uma tarefa que consiste em decifrar uma imagem e descrevê-la em frases em linguagem natural. Combina Processamento de Linguagem Natural e Visão Computacional para gerar legendas. Recentemente, os métodos de Deep Learning estão obtendo resultados muito promissores para o problema de geração de legendas. O presente trabalho propôs, com base no modelo NIC (Neural Image Caption), uma combinação de redes neurais convolucionais sobre imagens e rede neural recorrente sobre frases, alinhando-as a um objetivo estruturado de criar a descrição textual das imagens. Os resultados mostraram que o modelo neural proposto foi capaz de aprender o modelo da linguagem sobre o conteúdo da imagem, produzindo descrições precisas na maioria das imagens.

Palavras-chave: Deep Learning, geração automática de legenda de imagem, Processamento de Linguagem Natural, Visão Computacional

Referências

Bai, S. and An, S. (2018). A survey on automatic image caption generation. Neurocomputing, 311:291–304.

Bengio, Y., Frasconi, P., and Simard, P. (1993). The problem of learning long-term dependencies in recurrent networks. In IEEE international conference on neural networks, pages 1183–1188. IEEE.

Denkowski, M. and Lavie, A. (2014). Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pages 376–380.

Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., and Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634.

Goodfellow, I. J., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press, Cambridge, MA, USA.

Hendricks, L. A., Venugopalan, S., Rohrbach, M., Mooney, R., Saenko, K., and Darrell, T. (2016). Deep compositional captioning: Describing novel object categories without paired training data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–10.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780.

Hodosh, M., Young, P., and Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47:853–899.

Hossain, M., Sohel, F., Shiratuddin, M. F., and Laga, H. (2019). A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CSUR), 51(6):118.

Karpathy, A. and Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137.

LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.

Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out.

Mao, J., Wei, X., Yang, Y., Wang, J., Huang, Z., and Yuille, A. L. (2015). Learning like a child: Fast novel visual concept learning from sentence descriptions of images. In Proceedings of the IEEE international conference on computer vision, pages 2533– 2541.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.

Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826.

Tanti, M., Gatt, A., and Camilleri, K. P. (2017). What is the role of recurrent neural networks (rnns) in an image caption generator? arXiv preprint arXiv:1708.02043.

Vedantam, R., Lawrence Zitnick, C., and Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.

Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164.

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057.