Towards Neural-Symbolic AI for Media Understanding

  • Polyana B. Costa PUC-Rio
  • Guilherme Marques PUC-Rio
  • Arhur C. Serra PUC-Rio
  • Daniel de S. Moraes PUC-Rio
  • Antonio J. G. Busson PUC-Rio
  • Álan L. V. Guedes PUC-Rio
  • Guilherme Lima IBM Research
  • Sérgio Colcher PUC-Rio


Methods based on Machine Learning have become state-of-the-art in various segments of computing, especially in the fields of computer vision, speech recognition, and natural language processing. Such methods, however, generally work best when applied to specific tasks in specific domains where large training datasets are available. This paper presents an overview of the state-of-the-art in the area of Deep Learning for Multimedia Content Analysis (image, audio, and video), and describe recent works that propose The integration of deep learning with symbolic AI reasoning. We draw a picture of the future by discussing envisaged use cases that address media understanding gaps which can be solved by the integration of machine learning and symbolic AI, the so-called Neuro-Symbolic integration.


Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic.2016. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.5297ś5307.

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower,Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation 42, 4 (2008), 335.

Antonio José G Busson, Lucas Caracas de Figueiredo, Gabriel NP dos Santos, André Luiz de B. Damasceno, Sérgio Colcher, and Ruy Milidiú. 2018. Developing Deep Learning Models for Multimedia Applications in TensorFlow. In Proceedings of the 24th Brazilian Symposium on Multimedia and the Web. 7ś9.

Di Chen, Yiwei Bai, Wenting Zhao, Sebastian Ament, John M Gregoire, and Carla P Gomes. 2019. Deep reasoning networks: Thinking fast and slow. arXiv preprint arXiv:1906.00855 (2019).

Min-Hung Chen, Baopu Li, Yingze Bao, Ghassan AlRegib, and ZsoltKira. 2020. Action Segmentation With Joint Self-Supervised Temporal Domain Adaptation. 9454ś9463.

Adrian K Davison, Cliff Lansley, Nicholas Costen, Kevin Tan, and Moi Hoon Yap.2016. Samm: A spontaneous micro-facial movement dataset. IEEE Transactions on Affective Computing 9, 1 (2016), 116ś129.

Caroline Etienne, Guillaume Fidanza, Andrei Petrovskii, Laurence Devillers, and Benoit Schmauch. 2018. Cnn+ lstm architecture for speech emotion recognition with data augmentation. arXiv preprint arXiv:1802.05630 (2018).

Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18thACM international conference on Multimedia. 1459ś1462.

Alireza Fathi, Xiaofeng Ren, and James M. Rehg. 2011. Learning to recognize objects in egocentric activities. In CVPR 2011. 3281ś3288. ISSN: 1063-6919.

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow-Fast Networks for Video Recognition. 6202ś6211.

Artur d’Avila Garcez, Marco Gori, Luis C Lamb, Luciano Serafini, Michael Spranger, and Son N Tran. 2019. Neural-symbolic computing: An effective methodology for principled integration of machine learning and reasoning. arXiv preprint arXiv:1905.06088 (2019).

Ioannis Hatzilygeroudis and Jim Prentzas. 2010. Integrated rule-based learning and inference. IEEE Transactions on Knowledge and Data Engineering 22, 11 (2010),1549ś1562.

Mélanie Hilario, Yannick Lallement, Frédéric Alexandre, and Crin-inria Lorraine.1995. Neurosymbolic integration: Unified versus hybrid approaches. In In The European Symposium On Artificial Neural Networks. Citeseer.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neuralcomputation 9, 8 (1997).

Jie Hu, Li Shen, and Gang Sun. 2017. Squeeze-and-Excitation Networks. CoRRabs/1709.01507 (2017). arXiv:1709.01507

Yifei Huang, Yusuke Sugano, and Yoichi Sato. 2020. Improving Action Segmentation via Graph-Based Temporal Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14024ś14034.

Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35, 1 (2012), 221ś231.

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Suk-thankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1725ś1732.

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The Kinetics Human Action Video Dataset. arXiv:1705.06950 [cs] (May 2017). arXiv:1705.06950.

Alex Krizhevsky, Ilya Sutskever, and Geoff rey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097ś1105.

Hilde Kuehne, Ali Arslan, and Thomas Serre. 2014. The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities. In 2014IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Columbus, OH, USA, 780ś787.

Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman.2017. Building machines that learn and think like people. Behavioral and brain sciences 40 (2017).

Wei Li, Wei Shao, Shaoxiong Ji, and Erik Cambria. 2020. BiERU: Bidirectional Emotional Recurrent Unit for Conversational Sentiment Analysis. arXiv preprintarXiv:2006.00492 (2020).

X. Li, T. Pfi ster, X. Huang, G. Zhao, and M. Pietikäinen. 2013. A Spontaneous Micro-expression Database: Inducement, collection and baseline. In 2013 10thIEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). 1ś6.

S Maghilnan and M Rajesh Kumar. 2017. Sentiment analysis on speaker specific speech data. In 2017 International Conference on Intelligent Computing and Control (I2C2). IEEE, 1ś5.

Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B Tenenbaum, and Jiajun Wu. 2019. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. arXiv preprint arXiv:1904.12584 (2019).

David Matsumoto, Seung Hee Yoo, and Sanae Nakagawa. 2008. Culture, emotion regulation, and adjustment. Journal of personality and social psychology 94, 6(2008), 925.

Walied Merghani, Adrian K Davison, and Moi Hoon Yap. 2018. A review on facial micro-expressions analysis: datasets, features and metrics. arXiv preprintarXiv:1805.02397 (2018).

Antoine Miech, Ivan Laptev, and Josef Sivic. 2017. Learnable pooling with context gating for video classifi cation. arXiv preprint arXiv:1706.06905 (2017).

Katia Moskvitch. 2020. Neurosymbolic AI to Give Us Machines With True Com-mon Sense.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, SeanMa, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al.2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015).

Miha Skalic and David Austin. 2018. Building a size constrained predictive model for video classifi cation. In Proceedings of the European Conference on Computer Vision (ECCV). 0ś0.

Sebastian Stein and Stephen J. McKenna. 2013. Combining embedded accelerometers with computer vision for recognizing food preparation activities. In Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing (UbiComp ’13). Association for Computing Machinery, New York, NY,USA, 729ś738.

Ron Sun. 1993. On neural networks for symbolic processing. In Proceedings 1993The First New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems. IEEE, 5ś6.

Leslie G. Valiant. 2003. Three Problems in Computer Science. J. ACM 50, 1 (Jan.2003), 96ś99.

Wen-Jing Yan, Q. Wu, Yong-Jin Liu, Su-Jing Wang, and X. Fu. 2013. CASME database: A dataset of spontaneous micro-expressions collected from neutralized faces. In 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). 1ś7.

Shuwen Xiao, Zhou Zhao, Zijian Zhang, Xiaohui Yan, and Min Yang. 2020. Convolutional Hierarchical Attention Network for Query-Focused Video Summarization. Proceedings of the AAAI Conference on Artificial Intelligence 34, 07 (April2020), 12426ś12433. Number: 07.

Wen-Jing Yan, Xiaobai Li, Su-Jing Wang, Guoying Zhao, Yong-Jin Liu, Yu-HsinChen, and Xiaolan Fu. 2014. CASME II: An Improved Spontaneous Micro-Expression Database and the Baseline Evaluation. PLOS ONE 9 (01 2014), 1ś8.

Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. 2019. Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442 (2019).
Como Citar

Selecione um Formato
COSTA, Polyana B.; MARQUES, Guilherme ; SERRA, Arhur C. ; MORAES, Daniel de S. ; BUSSON, Antonio J. G. ; GUEDES, Álan L. V. ; LIMA, Guilherme ; COLCHER, Sérgio. Towards Neural-Symbolic AI for Media Understanding. In: WORKSHOP “O FUTURO DA VIDEOCOLABORAÇÃO” - SIMPÓSIO BRASILEIRO DE SISTEMAS MULTIMÍDIA E WEB (WEBMEDIA), 26. , 2020, São Luís. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2020 . p. 169-172. ISSN 2596-1683. DOI: