Vision-Language Models for Automated Property Feature Extraction in Tax Assessment: A Case Study

  • Gustavo R. Ribeiro UFG
  • Pedro A. M. Saraiva UFG
  • Luis H. A. Rosa UFG
  • Enzo L. Marques UFG
  • Gustavo L. B. Pereira UFG
  • Pedro M. L. Campos UFG
  • Luiz M. L. Pascoal UFG
  • Sávio S. T. de Oliveira UFG

Resumo


The calculation of property taxes, such as the Urban Building and Land Tax (IPTU) in Brazil, is a critical function for municipal revenue. This process traditionally relies on manual, on-site inspections to assess property characteristics, a task that is costly, time-consuming, and prone to subjectivity. This paper explores the potential of automating this process through the application of state-of-the-art Vision-Language Models (VLMs). We present a novel benchmark to evaluate the capabilities of twelve different VLMs in identifying and classifying specific building features as defined by the municipal legislation of Goiânia, Brazil. Using a dataset of images from public real estate listings and a zero-shot prompting strategy, we tasked the models with extracting 11 distinct construction categories, such as flooring, structure, and finishes. Our results indicate that proprietary models, particularly Google’s Gemini 1.5 Pro and Gemini 1.5 Flash, achieve the highest performance, with macro F1-scores of 0.77 and 0.76, respectively. We provide a detailed analysis of model performance across different categories, revealing that while some features like ’Structure’ and ’Electrical Installation’ are identified with high accuracy, others like ’Sanitary Installation’ and ’External Finishes’ remain challenging due to their visual subtlety or absence in typical photographs. Our findings demonstrate the significant potential of VLMs to streamline public administration tasks, while also highlighting current limitations and avenues for future research.

Referências

Afonso, B. K. d. A., Melo, L. C., de Oliveira, W. D. G., Sousa, S. B. d. S., and Berton, L. (2019). Housing prices prediction with a deep learning and random forest ensemble. In Anais do XX Encontro Nacional de Inteligência Artificial e Computacional, pages 556–567. SBC.

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., and Lin, J. (2025). Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. In Advances in neural information processing systems, volume 33, pages 1877–1901.

Choy, L. H. and Ho, W. K. (2023). The use of machine learning in real estate research. Land, 12(4):740.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.

Jaderberg, M., Simonyan, K., Vedaldi, A., and Zisserman, A. (2016). Reading text in the wild with convolutional neural networks. International journal of computer vision, 116:1–20.

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., and Fung, P. (2023). Survey of hallucination in natural language generation. ACM computing surveys, 55(12):1–38.

Koch, D., Despotovic, M., Leiber, S., Sakeena, M., Döller, M., and Zeppelzauer, M. (2019). Real estate image analysis: A literature review. Journal of Real Estate Literature, 27(2):269–300.

Kok, N., Koponen, E.-L., and Partanen, A.-P. (2017). Big data in real estate? from manual appraisal to automated valuation. Journal of Portfolio Management, 43(6):202–211.

Law, S. T., Köse, I. I., Shen, Y., Zhai, X., and Li, S. (2019). House price estimation from visual and textual features. arXiv preprint arXiv:1902.04944.

Liu, H., Li, C., Wu, Q., and Lee, Y. J. (2023). Visual instruction tuning. Advances in Neural Information Processing Systems, 36.

Mishra, A., Alahari, K., and Jawahar, C. (2012). Scene text recognition using higher order language priors. pages 1–11.

Poursaeed, O., Matera, T., and Belongie, S. (2018). Vision-based real estate price estimation. In Proceedings of the European conference on computer vision (ECCV) workshops, pages 0–0.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020.

Saeidnia, H. R. (2023). Welcome to the gemini era: Google deepmind and the information industry. Library Hi Tech News, (ahead-of-print).

Scoparo, M. N. and Serapião, A. B. (2020). Deep learning for automatic image captioning. In Anais do XVII Encontro Nacional de Inteligência Artificial e Computacional, pages 706–717. SBC.

Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. (2023). Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.

Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., et al. (2025). Gemma 3 technical report. arXiv preprint arXiv:2503.19786.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.

Zhang, J., Huang, J., Jin, S., and Lu, S. (2024). Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Publicado
29/09/2025
RIBEIRO, Gustavo R.; SARAIVA, Pedro A. M.; ROSA, Luis H. A.; MARQUES, Enzo L.; PEREIRA, Gustavo L. B.; CAMPOS, Pedro M. L.; PASCOAL, Luiz M. L.; OLIVEIRA, Sávio S. T. de. Vision-Language Models for Automated Property Feature Extraction in Tax Assessment: A Case Study. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 22. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 676-687. ISSN 2763-9061. DOI: https://doi.org/10.5753/eniac.2025.13998.