Comparative Analysis of Visual Models for Indoor Environments with Temporal and Lighting Variation

  • William Azevedo Pessoa de Melo UFAM
  • Alícia Caldeira da Silva UFAM
  • Carlos Victor de Araújo Lima UFAM
  • Alternei de Souza Brito UFAM
  • Felipe Gomes de Oliveira UFAM

Abstract


Visual classification models are essential in applications such as autonomous navigation and mobile robotics, but they still face challenges in indoor environments with lighting and temporal variations. This work compares the performance of DINOv2 feature extractor, a state-of-the-art self-supervised model, with supervised architectures such as ConvNeXt, EfficientNet, ResNet, and ViT. Using the KTH-IDOL2 dataset, we evaluated the models under different environmental conditions. Results show that DINOv2 consistently outperformed the others, achieving up to 98.02% accuracy. These findings highlight the robustness of self-supervised representations in the face of visual variability, positioning DINOv2 as a promising alternative for realistic indoor scene classification.

References

Anwer, R. M., Khan, F. S., Laaksonen, J., and Zaki, N. (2019). Multi-stream convolutional networks for indoor scene recognition. In Computer Analysis of Images and Patterns: 18th International Conference, CAIP 2019, Salerno, Italy, September 3–5, 2019, Proceedings, Part I 18, pages 196–208. Springer.

Barros, T., Pereira, R., Garrote, L., Premebida, C., and Nunes, U. J. (2021). Place recognition survey: An update on deep learning approaches. CoRR, abs/2106.10458.

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., and Joulin, A. (2021). Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR).

Garg, S., Fischer, T., and Milford, M. (2021). Where is your place, visual place recognition? CoRR, abs/2103.06443.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.

Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., and Xie, S. (2022). A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

Luo, J., Pronobis, A., Caputo, B., and Jensfelt, P. (2006). The KTH-IDOL2 Database. Technical Report CVAP304, KTH Royal Institute of Technology, CVAP/CAS, Stockholm, Sweden.

Masone, C. and Caputo, B. (2021). A survey on deep visual place recognition. IEEE Access, 9:19516–19547.

Oquab, M., Darcet, T., Moutakanni, T., Ramé, A., Taylor, L., Misra, I., and Caron, M. (2024). Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research, published online.

Pronobis, A., Jie, L., and Caputo, B. (2010). The more you learn, the less you store: Memory-controlled incremental svm for visual place recognition. Image and Vision Computing, 28(7):1080–1097.

Tan, M. and Le, Q. V. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, pages 6105–6114. PMLR.

Wang, R., Shen, Y., Zuo, W., Zhou, S., and Zheng, N. (2022). Transvpr: Transformer-based place recognition with multi-level attention aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13648–13657.

Zaffar, M., Ehsan, S., Milford, M., Flynn, D., and McDonald-Maier, K. D. (2020). Vpr-bench: An open-source visual place recognition evaluation framework with quantifiable viewpoint and appearance change. CoRR, abs/2005.08135.

Zhang, X., Wang, L., and Su, Y. (2021). Visual place recognition: A survey from deep learning perspective. Pattern Recognition, 113:107760.

Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., and Oliva, A. (2014). Learning deep features for scene recognition using places database. Advances in neural information processing systems, 27.
Published
2025-07-01
MELO, William Azevedo Pessoa de; SILVA, Alícia Caldeira da; LIMA, Carlos Victor de Araújo; BRITO, Alternei de Souza; OLIVEIRA, Felipe Gomes de. Comparative Analysis of Visual Models for Indoor Environments with Temporal and Lighting Variation. In: ICET TECHNOLOGY CONFERENCE (CONNECTECH), 2. , 2025, Itacoatiara/AM. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 202-214. DOI: https://doi.org/10.5753/connect.2025.12317.