Shaping the Video Conferences of Tomorrow With AI

  • Paulo Renato C. Mendes PUC-Rio
  • Eduardo S. Vieira PUC-Rio
  • Pedro Vinicius A. de Freitas PUC-Rio
  • Antonio José G. Busson PUC-Rio
  • Álan Lívio V. Guedes PUC-Rio
  • Carlos de Salles Soares Neto UFMA
  • Sérgio Colcher PUC-Rio


Before the COVID-19 pandemic, video was already one of the main media used on the internet. During the pandemic, video conferencing services became even more important, coming to be one of the main instruments to enable most social and professional human activities. Given the social distancing policies, people are spending more time using these online services for working, learning, and also for leisure activities. Videoconferencing software became the standard communication for home-office and remote learning. Nevertheless, there are still a lot of issues to be addressed on these platforms, and many different aspects to be reexamined or investigated, such as ethical and user-experience issues, just to name a few. We argue that many of the current state-of-the-art techniques of Artificial Intelligence (AI) may help on enhancing video collabo- ration services, particularly the methods based on Deep Learning such as face and sentiment analyses, and video classification. In this paper, we present a future vision about how AI techniques may contribute to this upcoming videoconferencing-age.


Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic.2016. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.5297–5307.

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. arXiv preprintarXiv:2005.14165 (2020).

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles.2015. Activitynet: A large-scale video benchmark for human activity understand-ing. In Proceedings of the ieee conference on computer vision and pattern recognition.961–970.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

Yufeng Diao, Hongfei Lin, Liang Yang, Xiaochao Fan, Yonghe Chu, Di Wu,Dongyu Zhang, and Kan Xu. 2020. CRHASum: extractive text summarization with contextualized-representation hierarchical-attention summarization network. Neural Computing and Applications (2020), 1–13.

Jiri Fajtl, Hajar Sadeghi Sokeh, Vasileios Argyriou, Dorothy Monekosso, and Paolo Remagnino. 2018. Summarizing videos with attention. In Asian Conference on Computer Vision. Springer, 39–54.

Yoav Freund and Robert E Schapire. 1997. A decision-theoretic generalization ofon-line learning and an application to boosting. Journal of computer and system sciences 55, 1 (1997), 119–139.

Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. 2019. Videoaction transformer network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 244–253.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.

Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, ArenJansen, Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Sey-bold, Malcolm Slaney, Ron Weiss, and Kevin Wilson. 2017. CNN Architectures for Large-Scale Audio Classification. In International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neuralcomputation 9, 8 (1997).

Abdullah Aman Khan, Jie Shao, Waqar Ali, and Saifullah Tumrani. 2020. Content-Aware Summarization of Broadcast Sports Videos: An Audio–Visual Feature Extraction Approach. Neural Processing Letters (2020), 1–24.

Krishan Kumar, Deepti D Shrimankar, and Navjot Singh. 2016. Equal partition based clustering approach for event summarization in videos. In 2016 12th Inter-national Conference on Signal-Image Technology & Internet-Based Systems (SITIS).IEEE, 119–126.

Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018.Bsn: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV). 3–19.

Yunfei Long, Rong Xiang, Qin Lu, Chu-Ren Huang, and Minglei Li. 2019. Improving attention model based on cognition grounded data for sentiment analysis. IEEE transactions on affective computing (2019).

I. Masi, Y. Wu, T. Hassner, and P. Natarajan. 2018. Deep Face Recognition: A Survey. In 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI). 471–478.

Paulo RC Mendes, Eduardo S Vieira, Álan LV Guedes, Antonio JG Busson, and Sérgio Colcher. 2020. A Clustering-Based Method for Automatic Educational Video Recommendation Using Deep Face-Features of Lecturers. arXiv preprintarXiv:2010.04676 (2020).

Antoine Miech, Ivan Laptev, and Josef Sivic. 2017. Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905 (2017).

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, OriolVinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu.2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499(2016).

P. Rodriguez, G. Cucurull, J. Gonzàlez, J. M. Gonfaus, K. Nasrollahi, T. B. Moeslund,and F. X. Roca. 2017. Deep Pain: Exploiting Long Short-Term Memory Networks for Facial Expression Classifi cation. IEEE Transactions on Cybernetics (2017),1–11.

Florian Schroff , Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815–823.

A. Srivastava, S. Mane, A. Shah, N. Shrivastava, and B. Thakare. 2017. A survey of face detection algorithms. In 2017 International Conference on Inventive Systems and Control (ICISC). 1–4.

Yi Sun, Xiaogang Wang, and Xiaoou Tang. 2014. Deep learning face representation from predicting 10,000 classes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1891–1898.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, DragomirAnguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition.

Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. 2014. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1701–1708.

Yongyi Tang, Xing Zhang, Lin Ma, Jingwen Wang, Shaoxiang Chen, and Yu-GangJiang. 2018. Non-local netvlad encoding for video classification. In Proceedings of the European Conference on Computer Vision (ECCV). 0–0.

Paul Viola and Michael J Jones. 2004. Robust real-time face detection. International journal of computer vision 57, 2 (2004), 137–154.

Faen Zhang, Xinyu Fan, Guo Ai, Jianfei Song, Yongqiang Qin, and JiahongWu. 2019. Accurate face detection for high performance. arXiv preprintarXiv:1905.01585 (2019).
MENDES, Paulo Renato C.; VIEIRA, Eduardo S. ; FREITAS, Pedro Vinicius A. de ; BUSSON, Antonio José G. ; GUEDES, Álan Lívio V. ; SOARES NETO, Carlos de Salles ; COLCHER, Sérgio. Shaping the Video Conferences of Tomorrow With AI. In: WORKSHOP “O FUTURO DA VIDEOCOLABORAÇÃO” - SIMPÓSIO BRASILEIRO DE SISTEMAS MULTIMÍDIA E WEB (WEBMEDIA), 26. , 2020, São Luís. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2020 . p. 165-168. ISSN 2596-1683. DOI: