skip to main content
10.1145/3470482.3479632acmconferencesArticle/Chapter ViewAbstractPublication PageswebmediaConference Proceedingsconference-collections
research-article

A Cluster-Based Method for Action Segmentation Using Spatio-Temporal and Positional Encoded Embeddings

Published:05 November 2021Publication History

ABSTRACT

A crucial task to overall video understanding is the recognition and localisation in time of different actions or events that are present along the scenes. To address this problem, action segmentation must be achieved. Action segmentation consists of temporally segmenting a video by labeling each frame with a specific action. In this work, we propose a novel action segmentation method that requires no prior video analysis and no annotated data. Our method involves extracting spatio-temporal features from videos in samples of 0.5s using a pre-trained deep network. Data is then transformed using a positional encoder and finally a clustering algorithm is applied with the use of a silhouette score to find the optimal number of clusters where each cluster presumably corresponds to a different single and distinguishable action. In experiments, we show that our method produces competitive results on Breakfast and Inria Instructional Videos dataset benchmarks.

References

  1. Sathyanarayanan N. Aakur and Sudeep Sarkar. 2019. A Perceptual Prediction Framework for Self Supervised Event Segmentation. arXiv:1811.04869 [cs] (April 2019). http://arxiv.org/abs/1811.04869 arXiv: 1811.04869.Google ScholarGoogle Scholar
  2. Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien. 2016. Unsupervised Learning from Narrated Instruction Videos. In CVPR2016 - 29th IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, United States. https://hal.inria.fr/hal-01171193Google ScholarGoogle ScholarCross RefCross Ref
  3. Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. 2016. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5297--5307.Google ScholarGoogle ScholarCross RefCross Ref
  4. Piotr Bojanowski, Rémi Lajugie, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, and Josef Sivic. 2014. Weakly supervised action labeling in videos under ordering constraints. In European Conference on Computer Vision. Springer, 628--643.Google ScholarGoogle ScholarCross RefCross Ref
  5. Joao Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. 6299--6308.Google ScholarGoogle Scholar
  6. Chien-Yi Chang, De-An Huang, Yanan Sui, Li Fei-Fei, and Juan Carlos Niebles. 2019. D3tw: Discriminative differentiable dynamic time warping for weakly R@supervised action alignment and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3546--3555.Google ScholarGoogle ScholarCross RefCross Ref
  7. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).Google ScholarGoogle Scholar
  8. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. (2009).Google ScholarGoogle Scholar
  9. Li Ding and Chenliang Xu. 2018. Weakly-supervised action segmentation with iterative soft boundary assignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6508--6516.Google ScholarGoogle Scholar
  10. Mohsen Fayyaz and Jurgen Gall. 2020. Sct: Set constrained temporal transformer for set supervised action segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 501--510.Google ScholarGoogle ScholarCross RefCross Ref
  11. Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow-Fast Networks for Video Recognition. 6202--6211.Google ScholarGoogle Scholar
  12. Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 776--780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google ScholarGoogle ScholarCross RefCross Ref
  14. Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron Weiss, and Kevin Wilson. 2017. CNN Architectures for Large-Scale Audio Classification. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://arxiv.org/abs/1609.09430Google ScholarGoogle Scholar
  15. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997).Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60 (May 2017), 84--90. https://doi.org/10.1145/3065386Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. JB Kruskal and Mark Liberman. 1983. The symmetric time-warping problem: From continuous to discrete. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison (Jan. 1983).Google ScholarGoogle Scholar
  18. Hilde Kuehne, Ali Arslan, and Thomas Serre. 2014. The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 780--787. https://doi.org/10.1109/CVPR.2014.105 ISSN: 1063--6919.Google ScholarGoogle Scholar
  19. Hilde Kuehne, Alexander Richard, and Juergen Gall. 2017. Weakly supervised learning of actions from transcripts. Computer Vision and Image Understanding 163 (2017), 78--89.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hilde Kuehne, Alexander Richard, and Juergen Gall. 2018. A hybrid rnn-hmm approach for weakly supervised temporal action segmentation. IEEE transactions on pattern analysis and machine intelligence 42, 4 (2018), 765--779.Google ScholarGoogle Scholar
  21. Anna Kukleva, Hilde Kuehne, Fadime Sener, and Jurgen Gall. 2019. Unsupervised Learning of Action Classes With Continuous Temporal Embedding. 12066--12074.Google ScholarGoogle Scholar
  22. Jun Li, Peng Lei, and Sinisa Todorovic. 2019. Weakly supervised energy-based learning for action segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6243--6251.Google ScholarGoogle ScholarCross RefCross Ref
  23. Jun Li and Sinisa Todorovic. 2021. Action Shuffle Alternating Learning for Unsupervised Action Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12628--12636.Google ScholarGoogle Scholar
  24. J. Macqueen. 1967. Some methods for classification and analysis of multivariate observations. In In 5-th Berkeley Symposium on Mathematical Statistics and Probability. 281--297.Google ScholarGoogle Scholar
  25. Paulo Renato C Mendes, Antonio José G Busson, Sérgio Colcher, Daniel Schwabe, Álan Lívio V Guedes, and Carlos Laufer. 2020. A Cluster-Matching-Based Method for Video Face Recognition. In Proceedings of the Brazilian Symposium on Multimedia and the Web. 97--104.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Paulo Renato C Mendes, Eduardo S Vieira, Pedro Vinicius A de Freitas, Antonio José G Busson, Álan Lívio V Guedes, Carlos de Salles Soares Neto, and Sérgio Colcher. 2020. Shaping the Video Conferences of Tomorrow With AI. In Anais Estendidos do XXVI Simpósio Brasileiro de Sistemas Multimídia e Web. SBC, 165--168.Google ScholarGoogle Scholar
  27. Antoine Miech, Ivan Laptev, and Josef Sivic. 2017. Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905 (2017).Google ScholarGoogle Scholar
  28. Alexander Richard, Hilde Kuehne, and Juergen Gall. 2017. Weakly supervised action learning with rnn based fine-to-coarse modeling. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 754--763.Google ScholarGoogle ScholarCross RefCross Ref
  29. Alexander Richard, Hilde Kuehne, Ahsan Iqbal, and Juergen Gall. 2018. Neuralnetwork-viterbi: A framework for weakly supervised video learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 7386--7395.Google ScholarGoogle ScholarCross RefCross Ref
  30. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. arXiv:1409.0575 [cs] (Jan. 2015). arXiv: 1409.0575.Google ScholarGoogle Scholar
  31. Gabriel NP dos Santos, Pedro VA de Freitas, Antonio José G Busson, Álan LV Guedes, Ruy Milidiú, and Sérgio Colcher. 2019. Deep learning methods for video understanding. In Proceedings of the 25th Brazillian Symposium on Multimedia and the Web. 21--23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Saquib Sarfraz, Vivek Sharma, and Rainer Stiefelhagen. 2019. Efficient Parameter-free Clustering Using First Neighbor Relations. (Feb. 2019).Google ScholarGoogle Scholar
  33. Saquib Sarfraz, Naila Murray, Vivek Sharma, Ali Diba, Luc Van Gool, and Rainer Stiefelhagen. 2021. Temporally-Weighted Hierarchical Clustering for Unsupervised Action Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11225--11234.Google ScholarGoogle ScholarCross RefCross Ref
  34. Fadime Sener and Angela Yao. 2018. Unsupervised Learning and Segmentation of Complex Activities from Video. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, UT, 8368--8376. https://doi.org/10.1109/CVPR.2018.00873Google ScholarGoogle ScholarCross RefCross Ref
  35. Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 568--576.Google ScholarGoogle Scholar
  36. Yaser Souri, Mohsen Fayyaz, Luca Minciullo, Gianpiero Francesca, and Juergen Gall. 2021. Fast weakly supervised action segmentation using mutual consistency. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).Google ScholarGoogle Scholar
  37. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google ScholarGoogle ScholarCross RefCross Ref
  38. Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. [n.d.]. WaveNet: A Generative Model for Raw Audio. In 9th ISCA Speech Synthesis Workshop. 125--125.Google ScholarGoogle Scholar
  39. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.Google ScholarGoogle Scholar
  40. Rosaura G VidalMata, Walter J Scheirer, Anna Kukleva, David Cox, and Hilde Kuehne. 2021. Joint visual-temporal embedding for unsupervised learning of actions in untrimmed sequences. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1238--1247.Google ScholarGoogle Scholar
  41. Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision. 3551--3558.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, and Christoph Feichtenhofer. 2020. Audiovisual SlowFast Networks for Video Recognition. (Jan. 2020). https://arxiv.org/abs/2001.08740v2Google ScholarGoogle Scholar

Index Terms

  1. A Cluster-Based Method for Action Segmentation Using Spatio-Temporal and Positional Encoded Embeddings

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        WebMedia '21: Proceedings of the Brazilian Symposium on Multimedia and the Web
        November 2021
        271 pages
        ISBN:9781450386098
        DOI:10.1145/3470482

        Copyright © 2021 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 5 November 2021

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        WebMedia '21 Paper Acceptance Rate24of75submissions,32%Overall Acceptance Rate270of873submissions,31%
      • Article Metrics

        • Downloads (Last 12 months)13
        • Downloads (Last 6 weeks)1

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader