Multimodal Vision-Language Models for Automated Property Feature Extraction: A Comparative Analysis of Image, Text, and Combined Inputs

  • Pedro M. L. Campos UFG
  • Gustavo R. Ribeiro UFG
  • Enzo L. Marques UFG
  • Gustavo L. B. Pereira UFG
  • Luiz M. L. Pascoal UFG
  • Fernando M. Federson UFG
  • Sávio S. T. de Oliveira UFG

Resumo


Manual property assessment for taxation is costly, time-consuming, and subjective. This paper investigates Vision-Language Models (VLMs) to automate this task through comprehensive evaluation across image-only, text-only, and combined inputs. We evaluated six models from the Gemini and Gemma families on 200 properties from Goiânia, Brazil, classified across 11 legally-defined construction categories using specialized prompting strategies. Our analysis reveals a counterintuitive finding: text-only inputs achieve the highest accuracy, outperforming image-only and matching combined multimodal approaches. This demonstrates that structured textual descriptions contain exceptionally high signal value for legally-defined tasks.

Palavras-chave: Vision-Language Models, Property Assessment, Multimodal AI, Text-Only Classification, Feature Extraction

Referências

Afonso, B. K. d. A., Melo, L. C., de Oliveira, W. D. G., Sousa, S. B. d. S., and Berton, L. (2019). Housing prices prediction with a deep learning and random forest ensemble. In Anais do XX Encontro Nacional de Inteligência Artificial e Computacional, pages 556–567. SBC.

Afonso, J. R. R., Araújo, E. A., and Nóbrega, M. A. R. d. (2013). O iptu no brasil: um diagnóstico abrangente.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. In Advances in neural information processing systems, volume 33, pages 1877–1901.

Dosovitskiy et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.

Force, I. A. I. T. (2022). A review of the methods, applications, and challenges of adopting artificial intelligence in the property assessment office. Journal of Property Tax Assessment & Administration, 19(1):2.

Kok, N., Koponen, E.-L., and Partanen, A.-P. (2017). Big data in real estate? from manual appraisal to automated valuation. Journal of Portfolio Management, 43(6):202–211.

Law, S. T., Köse, I. I., Shen, Y., Zhai, X., and Li, S. (2019). House price estimation from visual and textual features. arXiv preprint arXiv:1902.04944.

Liu, H., Li, C., Wu, Q., and Lee, Y. J. (2023a). Visual instruction tuning. Advances in Neural Information Processing Systems, 36.

Liu, P. et al. (2023b). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.

Nascimento, D. A. M. and de Jesus Souza, W. (2025). Da inadimplência à oportunidade de inovação: O potencial do iptu verde para a reforma tributária e a transformação sustentável das cidades. REVISTA FOCO, 18(8):e9382–e9382.

Poursaeed, O., Matera, T., and Belongie, S. (2018). Vision-based real estate price estimation. In Proceedings of the European conference on computer vision (ECCV) workshops, pages 0–0.

Radford, A. et al. (2021). Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020.

Ribeiro, G., Teles, S., Saraiva, P., Henrique, L., and Pascoal, L. M. L. (2025). Vision-language models for automated property feature extraction in tax assessment: A comprehensive benchmark. In Encontro Nacional de Inteligência Artificial e Computacional (ENIAC). SBC.

Team, G. et al. (2023). Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models.

Zhang, J., Huang, J., Jin, S., and Lu, S. (2024). Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Publicado
04/12/2025
CAMPOS, Pedro M. L.; RIBEIRO, Gustavo R.; MARQUES, Enzo L.; PEREIRA, Gustavo L. B.; PASCOAL, Luiz M. L.; FEDERSON, Fernando M.; OLIVEIRA, Sávio S. T. de. Multimodal Vision-Language Models for Automated Property Feature Extraction: A Comparative Analysis of Image, Text, and Combined Inputs. In: ESCOLA REGIONAL DE INFORMÁTICA DE GOIÁS (ERI-GO), 13. , 2025, Luziânia/GO. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 195-204. DOI: https://doi.org/10.5753/erigo.2025.17120.