Similarity Detection in Android Screen: a comparative analysis using Information Retrieval, Embeddings and Vision Language Models
Resumo
The growing mobile application development market brings an urgent need for tools capable of evaluating and comparing similarities between screens of different applications. This article proposes a comparative analysis of methods for similarity detection on Android screens, integrating Information Retrieval (IR), embedding models (BGE-M3, Snowflake-Arctic-Embed2, Nomic-Embed-Text) and Vision Language Models (VLM) such as qwen2.5vl:7b and gemma-3:27b. The study addresses challenges like design redundancies and UI test inconsistencies by using combined pipelines: IR, embeddings, and VLM for screen identification. The results demonstrate the strengths and limitations of each approach, providing insights into their applicability in mobile UI similarity detection.
Palavras-chave:
Android UI similarity, IR, embedding models, VLM
Referências
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025).
Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Shuai Ren, and Hongsheng Li. 2025. AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents. arXiv:2407.17490 [cs.HC] [link]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint arXiv:2402.03216 (2024).
Sabrina Haque and Christoph Csallner. 2024. Inferring Alt-text For UI Icons With Large Language Models During App Development. arXiv:2409.18060 [cs.HC] [link]
Luke Merrick, Danmei Xu, Gaurav Nuti, and Daniel Campos. [n. d.]. Arcticembed: scalable, efficient, and accurate text embedding models (2024). arXiv preprint arXiv:2405.05374 ( [n. d.]).
Aayush Modi, Vrajkumar Patel, Harsh Mistry, Abhishesh Mishra, Rocky Upadhyay, and Apoorva Shah. 2025. From Alt-text to Real Context: Revolutionizing image captioning using the potential of LLM. International Journal of Scientific Research in Computer Science Engineering and Information Technology 11 (01 2025), 379–387.
Trong-Hieu Nguyen-Mau, Nhu-Binh Truc, Nhu-Vinh Hoang, Minh-Triet Tran, and Hai-Dang Nguyen. 2025. Enhancing Visual Question Answering with Pretrained Vision-Language Models: An Ensemble Approach at the LAVA Challenge 2024. 281–292. DOI: 10.1007/978-981-96-2641-0_19
Zach Nussbaum, John X Morris, Brandon Duderstadt, and Andriy Mulyar. [n. d.]. Nomic embed: training a reproducible long context text embedder (2024). arXiv preprint arXiv:2402.01613 ( [n. d.]).
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. 2025. Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025).
Parul Verma and Brijesh Khandelwal. 2019. Word embeddings and its application in deep learning. Int J Innov Technol Explor Eng 8, 11 (2019), 337–341.
Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zheng Liu, Zhicheng Dou, and Ji-RongWen. 2023. Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107 (2023).
Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Shuai Ren, and Hongsheng Li. 2025. AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents. arXiv:2407.17490 [cs.HC] [link]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint arXiv:2402.03216 (2024).
Sabrina Haque and Christoph Csallner. 2024. Inferring Alt-text For UI Icons With Large Language Models During App Development. arXiv:2409.18060 [cs.HC] [link]
Luke Merrick, Danmei Xu, Gaurav Nuti, and Daniel Campos. [n. d.]. Arcticembed: scalable, efficient, and accurate text embedding models (2024). arXiv preprint arXiv:2405.05374 ( [n. d.]).
Aayush Modi, Vrajkumar Patel, Harsh Mistry, Abhishesh Mishra, Rocky Upadhyay, and Apoorva Shah. 2025. From Alt-text to Real Context: Revolutionizing image captioning using the potential of LLM. International Journal of Scientific Research in Computer Science Engineering and Information Technology 11 (01 2025), 379–387.
Trong-Hieu Nguyen-Mau, Nhu-Binh Truc, Nhu-Vinh Hoang, Minh-Triet Tran, and Hai-Dang Nguyen. 2025. Enhancing Visual Question Answering with Pretrained Vision-Language Models: An Ensemble Approach at the LAVA Challenge 2024. 281–292. DOI: 10.1007/978-981-96-2641-0_19
Zach Nussbaum, John X Morris, Brandon Duderstadt, and Andriy Mulyar. [n. d.]. Nomic embed: training a reproducible long context text embedder (2024). arXiv preprint arXiv:2402.01613 ( [n. d.]).
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. 2025. Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025).
Parul Verma and Brijesh Khandelwal. 2019. Word embeddings and its application in deep learning. Int J Innov Technol Explor Eng 8, 11 (2019), 337–341.
Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zheng Liu, Zhicheng Dou, and Ji-RongWen. 2023. Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107 (2023).
Publicado
10/11/2025
Como Citar
SOUZA, Daniel Augusto R. Lima de; RAMOS, Fabio C.; OLIVEIRA, Cainã S. de; VERAS, Edluce L.; LOPES, Paulo Fabricio da F.; SANTOS, Barbara L.; SOUZA, Jose Diogo B. de; OLIVEIRA, Adriano De O..
Similarity Detection in Android Screen: a comparative analysis using Information Retrieval, Embeddings and Vision Language Models. In: BRAZILIAN SYMPOSIUM ON MULTIMEDIA AND THE WEB (WEBMEDIA), 31. , 2025, Rio de Janeiro/RJ.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 570-574.
DOI: https://doi.org/10.5753/webmedia.2025.15052.
