Language-Driven Graphs for Short Video Similarity

Juliano Koji Yugoshi; Ricardo Marcondes Marcacini

doi:10.5753/eniac.2025.14276

Juliano Koji Yugoshi USP / UFMS
Ricardo Marcondes Marcacini USP

DOI: https://doi.org/10.5753/eniac.2025.14276

Resumo

Comparing videos by similarity is a central task in modern video analysis, but its effectiveness depends critically on the chosen representation. While visual embeddings effectively capture appearance, they often lack semantic abstraction. Recently, Large Language Models (LLMs) have emerged as a promising way to generate rich textual descriptions, yet the structural properties they induce in video similarity spaces remain underexplored. We introduce a graph-based methodology to investigate these properties, systematically comparing the structure of similarity graphs derived from visual features, human-written text, and LLM-generated text. Our framework evaluates how well each graph preserves semantic consistency, both in immediate neighborhoods (local cohesion) and across longer paths (global organization). Our analysis reveals a fundamental trade-off: visual graphs exhibit high local purity but decay rapidly, whereas LLM-based graphs preserve superior global semantic coherence. We demonstrate that LLMs build an abstract space that prioritizes deep thematic links—such as grouping videos by concepts like ’stage performance’ across formal categories—over superficial purity. This structure offers a powerful, semantically organized alternative to the local cohesion of visual models.

Referências

Covington, P., Adams, J., and Sargin, E. (2016). Deep neural networks for YouTube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, pages 191–198. ACM.

Davidson, J., Liebald, B., Liu, J., Nandy, P., Van Vleet, T., Garg, H., Gupta, S., Hsieh, Y., and Tock, T. (2010). The YouTube video recommendation system. In Proceedings of the Fourth ACM Conference on Recommender Systems, pages 293–296. ACM.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE.

Goldberg, Y. and Levy, O. (2014). word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722.

Liu, H., Li, C., Wu, Q., and Lee, Y. J. (2023). Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916.

Lovász, L. (1993). Random walks on graphs: A survey. In Miklós, D., Sós, V. T., and Szőnyi, T., editors, Combinatorics, Paul Erdős is Eighty, volume 2, pages 1–46. János Bolyai Mathematical Society.

Manning, C. D., Raghavan, P., and Schütze, H. (2008). Boolean retrieval. Introduction to information retrieval, pages 1–18.

Marafioti, A., Zohar, O., Farré, M., Noyan, M., Bakouch, E., Cuenca, P., Zakka, C., Allal, L. B., Lozhkov, A., Tazi, N., et al. (2025). Smolvlm: Redefining small and efficient multimodal models. arXiv preprint arXiv:2504.05299.

Miech, A., Zhukov, D., Alayrac, J.-B., Tapaswi, M., Laptev, I., and Sivic, J. (2019). HowTo100M: Learning a text-video embedding by watching missing moments. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2630–2640. IEEE/CVF.

Nguyen, T. T. and Veer, E. (2024). Why people watch user-generated videos? a systematic review and meta-analysis. International Journal of Human-Computer Studies, 181:103–144.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.

Reimers, N. and Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992. Association for Computational Linguistics.

Sivic, J. and Zisserman, A. (2003). Video Google: A text retrieval approach to object matching in videos. In Proceedings of the Ninth IEEE International Conference on Computer Vision, volume 2, pages 1470–1477. IEEE.

Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A., and Jain, R. (2000). Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):1349–1380.

Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., and Saenko, K. (2015). Sequence to sequence – video to text. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 4534–4542. IEEE.

Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416.

Wray, M., Doughty, H., and Damen, D. (2021). On semantic similarity in video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3650–3660.

Xu, H., Ghosh, G., Huang, P.-Y., Okhonko, D., Aghajanyan, A., Metze, F., Zettlemoyer, L., and Feichtenhofer, C. (2021). Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084.

Xu, J., Mei, T., Yao, T., and Rui, Y. (2016). MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5288–5296. IEEE.

Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., and Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4694–4702. IEEE.

Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. (2023). Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.

Zhu, X., Ghahramani, Z., and Lafferty, J. D. (2003). Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International Conference on Machine Learning (ICML), pages 912–919. AAAI Press.

Zohar, O., Farré, M., Marafioti, A., Noyan, M., Cuenca, P., Zakka, C., and Joshua (2025). SmolVLM-2: Bringing video understanding to every device. Hugging Face Blog. [link]. Accessed: 2025-06-24.

Language-Driven Graphs for Short Video Similarity

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)