Memory-Augmented Long Short-Term Memory for Dynamic Video Summarization
Resumo
Capturing relevant content from videos while preserving temporal coherence remains a central challenge in video skimming. The prevalence of redundant information often hinders the extraction of meaningful content, especially when the goal is to retain the central narrative of the video. While scene change detection can aid in segmenting video content, conventional methods often struggle with highly diverse and repetitive scenes due to their limited ability to model temporal dependencies and detect transitions effectively. To address these limitations, we propose the Memory-Augmented LSTM for Dynamic Video Summarization - MALSumm, a supervised architecture based on Extended Long Short-Term Memory (xLSTM) networks that enhances memory capacity through a dual-path design. This design integrates weighted memory to evaluate local and global information, allowing the model to preserve fine-grained details while maintaining overall temporal consistency, all within a low-complexity framework. Experimental Results validate the effectiveness of our approach, achieving an average F-score of 49.7 on the SumMe dataset and 62.1 on TVSum, outperforming recent supervised baselines. Additionally, when measuring alignment with human annotations, the model attains a Kendall's τ of 0.180 and Spearman's ρ of 0.242, exceeding the scores reported for human agreement. These findings demonstrate that our method provides a competitive and lightweight solution for dynamic video summarization, effectively balancing accuracy and efficiency.
