# Semantic graph attention networks and tensor decompositions for computer vision and computer graphics

### Resumo

This thesis proposes new architectures for deep neural networks with attention enhancement and multilinear algebra methods to increase their performance. We also explore graph convolutions and their particularities. We focus here on the problems related to real-time human pose estimation. We explore different architectures to reduce computational complexity, and, as a result, we propose two novel neural network models for 2D and 3D pose estimation. We also introduce a new architecture for Graph attention networks called Semantic Graph Attention.

### Referências

R. Ranjan, V. M. Patel, and R. Chellappa, “Hyperface: A deep multitask learning framework for face detection, landmark localization, pose estimation, and gender recognition,” IEEE TPAMI, vol. 41, no. 1, pp. 121–135, 2019.

V. A. Sindagi and V. M. Patel, “A survey of recent advances in cnnbased single image crowd counting and density estimation,” Pattern Recognition Letters, vol. 107, pp. 3–16, 2018.

L. Ge, “Real-time 3d hand pose estimation from depth images,” Ph.D. dissertation, 2018.

S. Schwarcz and T. Pollard, “3d human pose estimation from deep multi-view 2d pose,” in 2018 24th International Conference on Pattern Recognition (ICPR). IEEE, 2018, pp. 2326–2331.

M. Lin, L. Lin, X. Liang, K. Wang, and H. Cheng, “Recurrent 3d pose sequence machines,” in Proceedings of the IEEE CVPR, 2017, pp. 810– 819.

Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in CVPR, 2017.

Y. Luo, J. Ren, Z. Wang, W. Sun, J. Pan, J. Liu, J. Pang, and L. Lin, “Lstm pose machines,” in Proceedings of the IEEE CVPR, 2018, pp. 5207–5215.

J. Kossaifi, A. Bulat, G. Tzimiropoulos, and M. Pantic, “T-net: Parametrizing fully convolutional nets with a single high-order tensor,” in Proceedings of the IEEE CVPR, 2019, pp. 7822–7831.

I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attention augmented convolutional networks,” arXiv preprint arXiv:1904.09925, 2019.

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE CVPR, 2018, pp. 7132–7141.

J. Martinez, R. Hossain, J. Romero, and J. J. Little, “A simple yet effective baseline for 3d human pose estimation,” in Proceedings of the IEEE ICCV, 2017, pp. 2640–2649.

L. Zhao, X. Peng, Y. Tian, M. Kapadia, and D. N. Metaxas, “Semantic graph convolutional networks for 3d human pose regression,” in Proceedings of the IEEE CVPR, 2019, pp. 3425–3435.

D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H.- P. Seidel, W. Xu, D. Casas, and C. Theobalt, “Vnect: Real-time 3d human pose estimation with a single rgb camera,” ACM Transactions on Graphics, 2017.

D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, M. Elgharib, P. Fua, H.-P. Seidel, H. Rhodin, G. Pons-Moll, and C. Theobalt, “Xnect: Realtime multi-person 3d motion capture with a single rgb camera,” ACM Transactions on Graphics (TOG), vol. 39, no. 4, pp. 82–1, 2020.

B. Zhao, X. Wu, J. Feng, Q. Peng, and S. Yan, “Diversified visual attention networks for fine-grained object classification,” IEEE Transactions on Multimedia, vol. 19, no. 6, pp. 1245–1256, 2017.

Q. Huang, F. Zhou, J. He, Y. Zhao, and R. Qin, “Spatial–temporal graph attention networks for skeleton-based action recognition,” Journal of Electronic Imaging, vol. 29, no. 5, p. 053003, 2020.

C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,” IEEE TPAMI, vol. 36, no. 7, pp. 1325–1339, 2013.

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.

D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt, “Monocular 3d human pose estimation in the wild using improved cnn supervision,” in 2017 Proceedings of 3DV. IEEE, 2017, pp. 506–516.

S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in Proceedings of the IEEE CVPR, 2016, pp. 4724– 4732.

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.

M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE CVPR, 2018, pp. 4510–4520.

F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE CVPR, 2017, pp. 1251–1258.

Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression of deep convolutional neural networks for fast and low power mobile applications,” arXiv preprint arXiv:1511.06530, 2015.

T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009.

S. Smith and G. Karypis, “Accelerating the tucker decomposition with compressed sparse tensors,” in European Conference on Parallel Processing. Springer, 2017, pp. 653–668.

A. Cichocki, R. Zdunek, A. H. Phan, and S.-i. Amari, Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. John Wiley & Sons, 2009.

L. R. Tucker, “Some mathematical notes on three-mode factor analysis,” Psychometrika, vol. 31, no. 3, pp. 279–311, 1966.

L. De Lathauwer, B. De Moor, and J. Vandewalle, “A multilinear singular value decomposition,” SIAM journal on Matrix Analysis and Applications, vol. 21, no. 4, pp. 1253–1278, 2000.

P. Symeonidis, A. Nanopoulos, and Y. Manolopoulos, “A unified framework for providing recommendations in social tagging systems based on ternary semantic analysis,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 2, pp. 179–192, 2010.

L. J. S. Silva, D. L. S. da Silva, A. B. Raposo, L. Velho, and H. C. V. Lopes, “Tensorpose: Real-time pose estimation for interactive applications,” Computers & Graphics, 2019.

L. Schirmer, D. Lúcio, A. Raposo, L. Velho, and H. Lopes, “A lightweight 2d pose machine with attention enhancement,” in 2020 33rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI). IEEE, 2020, pp. 324–331.

J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” arXiv preprint arXiv:1312.6203, 2013.

M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Advances in neural information processing systems, 2016.

T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.

Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “Gcnet: Non-local networks meet squeeze-excitation networks and beyond,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019, pp. 0–0.

W. Yang, W. Ouyang, X. Wang, J. Ren, H. Li, and X. Wang, “3d human pose estimation in the wild by adversarial learning,” in Proceedings of the IEEE CVPR, 2018, pp. 5255–5264.

M. Rayat Imtiaz Hossain and J. J. Little, “Exploiting temporal information for 3d human pose estimation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 68–84.

D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli, “3d human pose estimation in video with temporal convolutions and semi-supervised training,” in Proceedings of the IEEE CVPR, 2019, pp. 7753–7762.

R. Dabral, A. Mundhada, U. Kusupati, S. Afaque, A. Sharma, and A. Jain, “Learning 3d human pose from structure and motion,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 668–683.

D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, S. Sridhar, G. Pons-Moll, and C. Theobalt, “Single-shot multi-person 3d pose estimation from monocular rgb,” in 2018, Proceedings of 3DV. IEEE, 2018, pp. 120–130.

A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik, “End-to-end recovery of human shape and pose,” in Proceedings of the IEEE CVPR, 2018, pp. 7122–7131.

J. N. Kundu, S. Seth, V. Jampani, M. Rakesh, R. V. Babu, and A. Chakraborty, “Self-supervised 3d human pose estimation via part guided novel image synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6152– 6162.

M. Kocabas, N. Athanasiou, and M. J. Black, “Vibe: Video inference for human body pose and shape estimation,” in Proceedings of the IEEE/CVF CVPR, 2020, pp. 5253–5263.

L. J. S. Silva, D. L. S. da Silva, L. Velho, and H. Lopes, “An end-toend framework for 3d capture and human digitization with a single rgb camera.” in Eurographics, 2020, pp. 1–2.

V. A. Sindagi and V. M. Patel, “A survey of recent advances in cnnbased single image crowd counting and density estimation,” Pattern Recognition Letters, vol. 107, pp. 3–16, 2018.

L. Ge, “Real-time 3d hand pose estimation from depth images,” Ph.D. dissertation, 2018.

S. Schwarcz and T. Pollard, “3d human pose estimation from deep multi-view 2d pose,” in 2018 24th International Conference on Pattern Recognition (ICPR). IEEE, 2018, pp. 2326–2331.

M. Lin, L. Lin, X. Liang, K. Wang, and H. Cheng, “Recurrent 3d pose sequence machines,” in Proceedings of the IEEE CVPR, 2017, pp. 810– 819.

Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in CVPR, 2017.

Y. Luo, J. Ren, Z. Wang, W. Sun, J. Pan, J. Liu, J. Pang, and L. Lin, “Lstm pose machines,” in Proceedings of the IEEE CVPR, 2018, pp. 5207–5215.

J. Kossaifi, A. Bulat, G. Tzimiropoulos, and M. Pantic, “T-net: Parametrizing fully convolutional nets with a single high-order tensor,” in Proceedings of the IEEE CVPR, 2019, pp. 7822–7831.

I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attention augmented convolutional networks,” arXiv preprint arXiv:1904.09925, 2019.

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE CVPR, 2018, pp. 7132–7141.

J. Martinez, R. Hossain, J. Romero, and J. J. Little, “A simple yet effective baseline for 3d human pose estimation,” in Proceedings of the IEEE ICCV, 2017, pp. 2640–2649.

L. Zhao, X. Peng, Y. Tian, M. Kapadia, and D. N. Metaxas, “Semantic graph convolutional networks for 3d human pose regression,” in Proceedings of the IEEE CVPR, 2019, pp. 3425–3435.

D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H.- P. Seidel, W. Xu, D. Casas, and C. Theobalt, “Vnect: Real-time 3d human pose estimation with a single rgb camera,” ACM Transactions on Graphics, 2017.

D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, M. Elgharib, P. Fua, H.-P. Seidel, H. Rhodin, G. Pons-Moll, and C. Theobalt, “Xnect: Realtime multi-person 3d motion capture with a single rgb camera,” ACM Transactions on Graphics (TOG), vol. 39, no. 4, pp. 82–1, 2020.

B. Zhao, X. Wu, J. Feng, Q. Peng, and S. Yan, “Diversified visual attention networks for fine-grained object classification,” IEEE Transactions on Multimedia, vol. 19, no. 6, pp. 1245–1256, 2017.

Q. Huang, F. Zhou, J. He, Y. Zhao, and R. Qin, “Spatial–temporal graph attention networks for skeleton-based action recognition,” Journal of Electronic Imaging, vol. 29, no. 5, p. 053003, 2020.

C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,” IEEE TPAMI, vol. 36, no. 7, pp. 1325–1339, 2013.

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.

D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt, “Monocular 3d human pose estimation in the wild using improved cnn supervision,” in 2017 Proceedings of 3DV. IEEE, 2017, pp. 506–516.

S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in Proceedings of the IEEE CVPR, 2016, pp. 4724– 4732.

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.

M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE CVPR, 2018, pp. 4510–4520.

F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE CVPR, 2017, pp. 1251–1258.

Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression of deep convolutional neural networks for fast and low power mobile applications,” arXiv preprint arXiv:1511.06530, 2015.

T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009.

S. Smith and G. Karypis, “Accelerating the tucker decomposition with compressed sparse tensors,” in European Conference on Parallel Processing. Springer, 2017, pp. 653–668.

A. Cichocki, R. Zdunek, A. H. Phan, and S.-i. Amari, Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. John Wiley & Sons, 2009.

L. R. Tucker, “Some mathematical notes on three-mode factor analysis,” Psychometrika, vol. 31, no. 3, pp. 279–311, 1966.

L. De Lathauwer, B. De Moor, and J. Vandewalle, “A multilinear singular value decomposition,” SIAM journal on Matrix Analysis and Applications, vol. 21, no. 4, pp. 1253–1278, 2000.

P. Symeonidis, A. Nanopoulos, and Y. Manolopoulos, “A unified framework for providing recommendations in social tagging systems based on ternary semantic analysis,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 2, pp. 179–192, 2010.

L. J. S. Silva, D. L. S. da Silva, A. B. Raposo, L. Velho, and H. C. V. Lopes, “Tensorpose: Real-time pose estimation for interactive applications,” Computers & Graphics, 2019.

L. Schirmer, D. Lúcio, A. Raposo, L. Velho, and H. Lopes, “A lightweight 2d pose machine with attention enhancement,” in 2020 33rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI). IEEE, 2020, pp. 324–331.

J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” arXiv preprint arXiv:1312.6203, 2013.

M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Advances in neural information processing systems, 2016.

T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.

Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “Gcnet: Non-local networks meet squeeze-excitation networks and beyond,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019, pp. 0–0.

W. Yang, W. Ouyang, X. Wang, J. Ren, H. Li, and X. Wang, “3d human pose estimation in the wild by adversarial learning,” in Proceedings of the IEEE CVPR, 2018, pp. 5255–5264.

M. Rayat Imtiaz Hossain and J. J. Little, “Exploiting temporal information for 3d human pose estimation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 68–84.

D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli, “3d human pose estimation in video with temporal convolutions and semi-supervised training,” in Proceedings of the IEEE CVPR, 2019, pp. 7753–7762.

R. Dabral, A. Mundhada, U. Kusupati, S. Afaque, A. Sharma, and A. Jain, “Learning 3d human pose from structure and motion,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 668–683.

D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, S. Sridhar, G. Pons-Moll, and C. Theobalt, “Single-shot multi-person 3d pose estimation from monocular rgb,” in 2018, Proceedings of 3DV. IEEE, 2018, pp. 120–130.

A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik, “End-to-end recovery of human shape and pose,” in Proceedings of the IEEE CVPR, 2018, pp. 7122–7131.

J. N. Kundu, S. Seth, V. Jampani, M. Rakesh, R. V. Babu, and A. Chakraborty, “Self-supervised 3d human pose estimation via part guided novel image synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6152– 6162.

M. Kocabas, N. Athanasiou, and M. J. Black, “Vibe: Video inference for human body pose and shape estimation,” in Proceedings of the IEEE/CVF CVPR, 2020, pp. 5253–5263.

L. J. S. Silva, D. L. S. da Silva, L. Velho, and H. Lopes, “An end-toend framework for 3d capture and human digitization with a single rgb camera.” in Eurographics, 2020, pp. 1–2.

Publicado

18/10/2021

Como Citar

SCHIRMER, Luiz; LOPES, Hélio; VELHO, Luiz.
Semantic graph attention networks and tensor decompositions for computer vision and computer graphics.

*In*: WORKSHOP DE TESES E DISSERTAÇÕES - CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 34. , 2021, Online.**Anais**[...]. Porto Alegre: Sociedade Brasileira de Computação, 2021 . p. 126-132. DOI: https://doi.org/10.5753/sibgrapi.est.2021.20024.