research-article

A Cluster-Based Method for Action Segmentation Using Spatio-Temporal and Positional Encoded Embeddings

Authors:
Guilherme de A. P. Marques

PUC-Rio, Rio de Janeiro, Brazil

PUC-Rio, Rio de Janeiro, Brazil
View Profile

,
Antonio José G. Busson

PUC-Rio, Rio de Janeiro, Brazil

PUC-Rio, Rio de Janeiro, Brazil
View Profile

,
Álan Lívio V. Guedes

PUC-Rio, Rio de Janeiro, Brazil

PUC-Rio, Rio de Janeiro, Brazil
View Profile

,
Sérgio Colcher

PUC-Rio, Rio de Janeiro, Brazil

PUC-Rio, Rio de Janeiro, Brazil
View Profile

WebMedia '21: Proceedings of the Brazilian Symposium on Multimedia and the WebNovember 2021Pages 181–187https://doi.org/10.1145/3470482.3479632

Published:05 November 2021Publication History

WebMedia '21: Proceedings of the Brazilian Symposium on Multimedia and the Web

Pages 181–187

ABSTRACT

A crucial task to overall video understanding is the recognition and localisation in time of different actions or events that are present along the scenes. To address this problem, action segmentation must be achieved. Action segmentation consists of temporally segmenting a video by labeling each frame with a specific action. In this work, we propose a novel action segmentation method that requires no prior video analysis and no annotated data. Our method involves extracting spatio-temporal features from videos in samples of 0.5s using a pre-trained deep network. Data is then transformed using a positional encoder and finally a clustering algorithm is applied with the use of a silhouette score to find the optimal number of clusters where each cluster presumably corresponds to a different single and distinguishable action. In experiments, we show that our method produces competitive results on Breakfast and Inria Instructional Videos dataset benchmarks.

References

Sathyanarayanan N. Aakur and Sudeep Sarkar. 2019. A Perceptual Prediction Framework for Self Supervised Event Segmentation. arXiv:1811.04869 [cs] (April 2019). http://arxiv.org/abs/1811.04869 arXiv: 1811.04869.Google Scholar
Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien. 2016. Unsupervised Learning from Narrated Instruction Videos. In CVPR2016 - 29th IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, United States. https://hal.inria.fr/hal-01171193Google ScholarCross Ref
Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. 2016. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5297--5307.Google ScholarCross Ref
Piotr Bojanowski, Rémi Lajugie, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, and Josef Sivic. 2014. Weakly supervised action labeling in videos under ordering constraints. In European Conference on Computer Vision. Springer, 628--643.Google ScholarCross Ref
Joao Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. 6299--6308.Google Scholar
Chien-Yi Chang, De-An Huang, Yanan Sui, Li Fei-Fei, and Juan Carlos Niebles. 2019. D3tw: Discriminative differentiable dynamic time warping for weakly R@supervised action alignment and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3546--3555.Google ScholarCross Ref
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).Google Scholar
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. (2009).Google Scholar
Li Ding and Chenliang Xu. 2018. Weakly-supervised action segmentation with iterative soft boundary assignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6508--6516.Google Scholar
Mohsen Fayyaz and Jurgen Gall. 2020. Sct: Set constrained temporal transformer for set supervised action segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 501--510.Google ScholarCross Ref
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow-Fast Networks for Video Recognition. 6202--6211.Google Scholar
Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 776--780.Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google ScholarCross Ref
Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron Weiss, and Kevin Wilson. 2017. CNN Architectures for Large-Scale Audio Classification. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://arxiv.org/abs/1609.09430Google Scholar
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997).Google ScholarDigital Library
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60 (May 2017), 84--90. https://doi.org/10.1145/3065386Google ScholarDigital Library
JB Kruskal and Mark Liberman. 1983. The symmetric time-warping problem: From continuous to discrete. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison (Jan. 1983).Google Scholar
Hilde Kuehne, Ali Arslan, and Thomas Serre. 2014. The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 780--787. https://doi.org/10.1109/CVPR.2014.105 ISSN: 1063--6919.Google Scholar
Hilde Kuehne, Alexander Richard, and Juergen Gall. 2017. Weakly supervised learning of actions from transcripts. Computer Vision and Image Understanding 163 (2017), 78--89.Google ScholarDigital Library
Hilde Kuehne, Alexander Richard, and Juergen Gall. 2018. A hybrid rnn-hmm approach for weakly supervised temporal action segmentation. IEEE transactions on pattern analysis and machine intelligence 42, 4 (2018), 765--779.Google Scholar
Anna Kukleva, Hilde Kuehne, Fadime Sener, and Jurgen Gall. 2019. Unsupervised Learning of Action Classes With Continuous Temporal Embedding. 12066--12074.Google Scholar
Jun Li, Peng Lei, and Sinisa Todorovic. 2019. Weakly supervised energy-based learning for action segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6243--6251.Google ScholarCross Ref
Jun Li and Sinisa Todorovic. 2021. Action Shuffle Alternating Learning for Unsupervised Action Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12628--12636.Google Scholar
J. Macqueen. 1967. Some methods for classification and analysis of multivariate observations. In In 5-th Berkeley Symposium on Mathematical Statistics and Probability. 281--297.Google Scholar
Paulo Renato C Mendes, Antonio José G Busson, Sérgio Colcher, Daniel Schwabe, Álan Lívio V Guedes, and Carlos Laufer. 2020. A Cluster-Matching-Based Method for Video Face Recognition. In Proceedings of the Brazilian Symposium on Multimedia and the Web. 97--104.Google ScholarDigital Library
Paulo Renato C Mendes, Eduardo S Vieira, Pedro Vinicius A de Freitas, Antonio José G Busson, Álan Lívio V Guedes, Carlos de Salles Soares Neto, and Sérgio Colcher. 2020. Shaping the Video Conferences of Tomorrow With AI. In Anais Estendidos do XXVI Simpósio Brasileiro de Sistemas Multimídia e Web. SBC, 165--168.Google Scholar
Antoine Miech, Ivan Laptev, and Josef Sivic. 2017. Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905 (2017).Google Scholar
Alexander Richard, Hilde Kuehne, and Juergen Gall. 2017. Weakly supervised action learning with rnn based fine-to-coarse modeling. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 754--763.Google ScholarCross Ref
Alexander Richard, Hilde Kuehne, Ahsan Iqbal, and Juergen Gall. 2018. Neuralnetwork-viterbi: A framework for weakly supervised video learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 7386--7395.Google ScholarCross Ref
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. arXiv:1409.0575 [cs] (Jan. 2015). arXiv: 1409.0575.Google Scholar
Gabriel NP dos Santos, Pedro VA de Freitas, Antonio José G Busson, Álan LV Guedes, Ruy Milidiú, and Sérgio Colcher. 2019. Deep learning methods for video understanding. In Proceedings of the 25th Brazillian Symposium on Multimedia and the Web. 21--23.Google ScholarDigital Library
M. Saquib Sarfraz, Vivek Sharma, and Rainer Stiefelhagen. 2019. Efficient Parameter-free Clustering Using First Neighbor Relations. (Feb. 2019).Google Scholar
Saquib Sarfraz, Naila Murray, Vivek Sharma, Ali Diba, Luc Van Gool, and Rainer Stiefelhagen. 2021. Temporally-Weighted Hierarchical Clustering for Unsupervised Action Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11225--11234.Google ScholarCross Ref
Fadime Sener and Angela Yao. 2018. Unsupervised Learning and Segmentation of Complex Activities from Video. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, UT, 8368--8376. https://doi.org/10.1109/CVPR.2018.00873Google ScholarCross Ref
Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 568--576.Google Scholar
Yaser Souri, Mohsen Fayyaz, Luca Minciullo, Gianpiero Francesca, and Juergen Gall. 2021. Fast weakly supervised action segmentation using mutual consistency. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).Google Scholar
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google ScholarCross Ref
Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. [n.d.]. WaveNet: A Generative Model for Raw Audio. In 9th ISCA Speech Synthesis Workshop. 125--125.Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.Google Scholar
Rosaura G VidalMata, Walter J Scheirer, Anna Kukleva, David Cox, and Hilde Kuehne. 2021. Joint visual-temporal embedding for unsupervised learning of actions in untrimmed sequences. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1238--1247.Google Scholar
Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision. 3551--3558.Google ScholarDigital Library
Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, and Christoph Feichtenhofer. 2020. Audiovisual SlowFast Networks for Video Recognition. (Jan. 2020). https://arxiv.org/abs/2001.08740v2Google Scholar

Index Terms

A Cluster-Based Method for Action Segmentation Using Spatio-Temporal and Positional Encoded Embeddings
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Video segmentation
  2. Machine learning

Recommendations

Unsupervised method for video action segmentation through spatio-temporal and positional-encoded embeddings
MMSys '22: Proceedings of the 13th ACM Multimedia Systems Conference

Action segmentation consists of temporally segmenting a video and labeling each segmented interval with a specific action label. In this work, we propose a novel action segmentation method that requires no prior video analysis and no annotated data. Our ...
Read More
A unified tree-based framework for joint action localization, recognition and segmentation

A unified tree-based framework for joint action localization, recognition and segmentation is proposed. An action is represented as a sequence of joint hog-flow descriptors extracted independently from each frame. During training, a set of action ...
Read More
Action matching network: open-set action recognition using spatio-temporal representation matching
Abstract
In this paper, we address an open-set action recognition problem. While the closed-set action recognition classifies test samples into the same classes of actions used for model training, the problem of the open-set action recognition is more ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WebMedia '21: Proceedings of the Brazilian Symposium on Multimedia and the Web
November 2021
271 pages
ISBN:9781450386098
DOI:10.1145/3470482
General Chairs:
Adriano César Machado Pereira
UFMG
,
Leonardo Chaves Dutra da Rocha
UFSJ
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 November 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Action recognition
Action segmentation
I3D
Positional encoding
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
WebMedia '21 Paper Acceptance Rate24of75submissions,32%Overall Acceptance Rate270of873submissions,31%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 102
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Cluster-Based Method for Action Segmentation Using Spatio-Temporal and Positional Encoded Embeddings

WebMedia '21: Proceedings of the Brazilian Symposium on Multimedia and the Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Unsupervised method for video action segmentation through spatio-temporal and positional-encoded embeddings

A unified tree-based framework for joint action localization, recognition and segmentation

Action matching network: open-set action recognition using spatio-temporal representation matching

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A Cluster-Based Method for Action Segmentation Using Spatio-Temporal and Positional Encoded Embeddings

WebMedia '21: Proceedings of the Brazilian Symposium on Multimedia and the Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Unsupervised method for video action segmentation through spatio-temporal and positional-encoded embeddings

A unified tree-based framework for joint action localization, recognition and segmentation

Action matching network: open-set action recognition using spatio-temporal representation matching

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media