Multimodal Graph Attention Networks for Real-Time Action Prediction and Accident Prevention in Industrial Environments

Guilherme Nunes; Paulo Sérgio Rodrigues

doi:10.5753/sibgrapi.est.2025.38275

Guilherme Nunes Centro Universitário FEI
Paulo Sérgio Rodrigues Centro Universitário FEI

DOI: https://doi.org/10.5753/sibgrapi.est.2025.38275

Resumo

One of the major challenges faced by the manufacturing industry is the prevention of workplace accidents. In this context, ensuring compliance with safety regulations, such as the use of personal protective equipment (PPE) and the proper execution of specific tasks according to safety protocols, is essential, especially when supervisors or safety personnel are not present on site. To address this issue, we propose a multimodal method for real-time human action prediction in industrial environments, aimed at supporting accident prevention systems. Our approach integrates two parallel Graph Attention Networks (GATs): one based on human skeleton pose estimation, and another built from scene object detection graphs. By combining these two complementary modalities, the model captures both human motion dynamics and contextual environmental information. To the best of our knowledge, this approach has not yet been explored in the literature. The proposed method will be evaluated on two benchmark datasets: Kinetics-400 (a large-scale video dataset with diverse real-world actions), and UnsafeNet (a dataset featuring factory-recorded videos annotated with safe and unsafe behaviors). The expected results aim to demonstrate the feasibility of applying multimodal GAT-based architectures to enhance occupational safety through intelligent action recognition systems.

Referências

N. Pianegonda. (2023) Acidentes de trabalho matam ao menos uma pessoa a cada 3h47min no brasil. [Online]. Available: [link]

S. Adhikesaven, “An industrial workplace alerting and monitoring platform to prevent workplace injury and accidents,” ArXiv, vol. abs/2210.17414, 2022. [Online]. Available: [link]

N. D. Nath, A. H. Behzadan, and S. G. Paal, “Automation in construction,” 2020. [Online]. Available: [link]

M. M. Saudi, A. Hakim, A. Ahmad, A. Shakir, M. Hanafi, A. Narzullaev, and M. Ifwat, “Image detection model for construction worker safety conditions using faster r-cnn,” International Journal of Advanced Computer Science and Applications, vol. 11, 2020. [Online]. Available: [link]

C. ling Wang and J. Yan, “A comprehensive survey of rgb-based and skeleton-based human action recognition,” IEEE Access, vol. 11, pp. 53 880–53 898, 2023. [Online]. Available: [link]

S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in AAAI Conference on Artificial Intelligence, 2018. [Online]. Available: [link]

Z. Chen, S. Li, B. Yang, Q. Li, and H. Liu, “Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition,” in AAAI Conference on Artificial Intelligence, 2021. [Online]. Available: [link]

Y. Zhou, C. Li, Z.-Q. Cheng, Y. Geng, X. Xie, and M. Keuper, “Hypergraph transformer for skeleton-based action recognition,” ArXiv, vol. abs/2211.09590, 2022. [Online]. Available: [link]

H. gun Chi, M. H. Ha, S. geun Chi, S. W. Lee, Q.-X. Huang, and K. Ramani, “Infogcn: Representation learning for human skeleton-based action recognition,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20 154–20 164, 2022. [Online]. Available: [link]

L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graph convolutional networks for skeleton-based action recognition,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12 018–12 027, 2018. [Online]. Available: [link]

——, “Skeleton-based action recognition with directed graph neural networks,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7904–7913, 2019. [Online]. Available: [link]

F. Ye, S. Pu, Q. Zhong, C. Li, D. Xie, and H. Tang, “Dynamic gcn: Context-enriched topology learning for skeleton-based action recognition,” Proceedings of the 28th ACM International Conference on Multimedia, 2020. [Online]. Available: [link]

Y. Chen, Z. Zhang, C. Yuan, B. Li, Y. Deng, and W. Hu, “Channel-wise topology refinement graph convolution for skeleton-based action recognition,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13 339–13 348, 2021. [Online]. Available: [link]

L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition,” in Asian Conference on Computer Vision, 2020. [Online]. Available: [link]

H. Liu, J. Tu, and M. Liu, “Two-stream 3d convolutional neural network for skeleton-based action recognition,” ArXiv, vol. abs/1705.08106, 2017. [Online]. Available: [link]

M. Korban and X. Li, “Ddgcn: A dynamic directed graph convolutional network for action recognition,” in European Conference on Computer Vision, 2020. [Online]. Available: [link]

L. Wang, Z. Tong, B. Ji, and G. Wu, “Tdn: Temporal difference networks for efficient action recognition,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1895–1904, 2020. [Online]. Available: [link]

S. Gao, M.-M. Cheng, K. Zhao, X. Zhang, M.-H. Yang, and P. H. S. Torr, “Res2net: A new multi-scale backbone architecture,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, pp. 652–662, 2019. [Online]. Available: [link]

H. Zhou, Q. Liu, and Y. Wang, “Learning discriminative representations for skeleton based action recognition,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10 608–10 617, 2023. [Online]. Available: [link]

W. Xiang, C. Li, Y. Zhou, B. Wang, and L. Zhang, “Generative action description prompts for skeleton-based action recognition,” 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10 242–10 251, 2022. [Online]. Available: [link]

L. G. Foo, T. Li, H. Rahmani, Q. Ke, and J. Liu, “Era: Expert retrieval and assembly for early action prediction,” ArXiv, vol. abs/2207.09675, 2022. [Online]. Available: [link]

X. Wang, J. Hu, J. Lai, J. Zhang, and W. Zheng, “Progressive teacher-student learning for early action prediction,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3551–3560, 2019. [Online]. Available: [link]

J. Weng, X. Jiang, W.-L. Zheng, and J. Yuan, “Early action recognition with category exclusion using policy-based reinforcement learning,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, pp. 4626–4638, 2020. [Online]. Available: [link]

G. Pang, X. Wang, J. Hu, Q. Zhang, and W. Zheng, “Dbdnet: Learning bi-directional dynamics for early action prediction,” in International Joint Conference on Artificial Intelligence, 2019. [Online]. Available: [link]

S. geun Chi, H. gun Chi, Q. Huang, and K. Ramani, “Infogcn++: Learning representation by predicting the future for online human skeleton-based action recognition,” ArXiv, vol. abs/2310.10547, 2023. [Online]. Available: [link]

O. Elharrouss, N. Almaadeed, S. A. Al-Maadeed, A. Bouridane, and A. Beghdadi, “A combined multiple action recognition and summarization for surveillance video sequences,” Applied Intelligence, vol. 51, pp. 690 – 712, 2020. [Online]. Available: [link]

O. Önal and E. Dandıl, “Video dataset for the detection of safe and unsafe behaviours in workplaces,” Data in Brief, vol. 56, 2024. [Online]. Available: [link]

L. Gómez-Chova, D. Tuia, G. Moser, and G. Camps-Valls, “Multimodal classification of remote sensing images: A review and future directions,” Proceedings of the IEEE, vol. 103, no. 9, pp. 1560–1584, 2015.

J. Liang, Y. Deng, and D. Zeng, “A deep neural network combined cnn and gcn for remote sensing scene classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 4325–4338, 2020.

S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, pp. 1137–1149, 2015. [Online]. Available: [link]

J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788, 2015. [Online]. Available: [link]

J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, W. Liu, and B. Xiao, “Deep high-resolution representation learning for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, pp. 3349–3364, 2019. [Online]. Available: [link]

H. Duan, Y. Zhao, K. Chen, D. Shao, D. Lin, and B. Dai, “Revisiting skeleton-based action recognition,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2959–2968, 2021. [Online]. Available: [link]

T.-Y. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision, 2014. [Online]. Available: [link]

O. Önal and E. Dandıl, “Unsafe-net: Yolo v4 and convlstm based computer vision system for real-time detection of unsafe behaviours in workplace,” Multimedia Tools and Applications, 2024. [Online]. Available: [link]