Text-Guided 3D Object Extraction from 360° Tours: A Single-Image Pipeline for Museum Asset Generation
Resumo
Currently, many museums are integrating virtual reality into their tours, offering both virtual and mixed reality experiences. This process is still in its early stages, often requiring skilled professionals and time-intensive computational techniques such as photogrammetry to generate good results. In this context, this work presents a streamlined approach that starts from a single 360° panorama and a plain-language prompt, producing a textured 3D model suitable for a VR/AR application. The workflow couples promptable segmentation with recent single-image reconstruction models, eliminating photogrammetry and manual mesh clean-up. Qualitative results on a variety of scenes show that the proposed method delivers visually convincing geometry in well under a minute on a commodity GPU, making large-scale, low-cost digitization practical for institutions of any size. By reducing both human effort and computational overhead, the proposed pipeline produces assets suitable for XR pipelines, which can enable more interactive, immersive, and rapidly updatable museum experiences.
Referências
D. A. L. Carvajal, M. M. Morita, and G. M. Bilmes. Virtual museums. captured reality and 3d modeling. Journal of Cultural Heritage, 45:234–239, 2020.
G. Lepouras, A. Katifori, C. Vassilakis, and D. Charitos. Real exhibitions in a virtual museum. Virtual Reality, 7:120–128, 2004.
M. Skamantzari and A. Georgopoulos. 3d visualization for virtual museum development. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 41:961–968, 2016.
F. Niccolucci. Virtual reality in archaeology: a useful tool or a dreadful toy? Mediaterra Art & Technology Festival, 99, 1999.
V. Sitzmann, M. Zollohofer, and G. Wetzstein. Scene representation networks: Continuous 3d-structure-aware neural scene representations, 2020. Available: [link]
R. Liu, R. Wu, B. V. Hoorick, P. Tokmakov, S. Zakharov, and C. Vondrick. Zero-1-to-3: Zero-shot one image to 3d object, 2023. Available: [link]
B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis, 2020. Available: [link]
B. Kerbl, G. Kopanas, T. Leimkuhler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4):139–1, 2023.
A. Yu, V. Ye, M. Tancik, and A. Kanazawa. pixelnerf: Neural radiance fields from one or few images, 2021. Available: [link]
A. Rabby and C. Zhang. Beyondpixels: A comprehensive review of the evolution of neural radiance fields. arXiv preprint arXiv:2306.03000, 2023.
M. Bortolon, T. Tsesmelis, S. James, F. Poiesi, and A. Del Bue. 6dgs: 6d pose estimation from a single image and a 3d gaussian splatting model. arXiv preprint arXiv:2407.15484, 2024.
T. Wu, Y.-J. Yuan, L.-X. Zhang, J. Yang, Y.-P. Cao, L.-Q. Yan, and L. Gao. Recent advances in 3d gaussian splatting. Computational Visual Media, 1–30, 2024.
L. Radl, M. Steiner, M. Parger, A. Weinrauch, B. Kerbl, and M. Steinberger. Stopthepop: Sorted gaussian splatting for view-consistent real-time rendering. ACM Transactions on Graphics (TOG), 43(4):1–17, 2024.
G. Clough. Best of both worlds – museums, libraries, and archives in a digital age. Smithsonian Institution, Washington, DC, 2013.
M. Orr, E. Poitras, and K. R. Butcher. Informal learning with extended reality environments: Current trends in museums, heritage, and tourism. In Augmented Reality in Tourism, Museums and Heritage: A New Technology to Inform and Entertain. Springer, 3–26, 2021.
K. He, G. Gkioxari, P. Dollar, and R. B. Girshick. Mask R-CNN. CoRR, abs/1703.06870, 2017. Available: [link]
L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder–decoder with atrous separable convolution for semantic image segmentation. CoRR, abs/1802.02611v3, 2018. Available: [link]
L. S. Meyer, J. E. Aaen, A. R. Tranberg, P. Kun, M. Freiberger, S. Risi, and A. S. Lovlie. Algorithmic ways of seeing: Using object detection to facilitate art exploration. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24), ACM, 1–18, 2024. DOI: 10.1145/3613904.3642157
IDEA Research. Grounded-sam-2: Promptable open-vocabulary segmentation. [link], 2025. Commit abc123, accessed 21 Jun. 2025.
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Gao, L. Wang, Z. Liu, and L. Zhang. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16898–16909, 2023.
N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Radle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C.-Y. Wu, R. Girshick, P. Dollar, and C. Feichtenhofer. SAM 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024. Version 2, revised Oct. 28, 2024.
J. Xiang, Z. Lv, S. Xu, Y. Deng, R. Wang, B. Zhang, D. Chen, X. Tong, and J. Yang. Structured 3D latents for scalable and versatile 3D generation. arXiv preprint arXiv:2412.01506, 2024. Available: [link]
Hunyuan3D Team. Hunyuan3D 2.0: Scaling diffusion models for high-resolution textured 3D assets generation. arXiv preprint arXiv:2501.12202, 2025. Available: [link]
K. Wu, F. Liu, Z. Cai, R. Yan, H. Wang, Y. Hu, Y. Duan, and K. Ma. Unique3D: High-quality and efficient 3D mesh generation from a single image. arXiv preprint arXiv:2405.20343, 2024.
D. Pantile, R. Frasca, A. Mazzeo, M. Ventrella, and G. Verreschi. New technologies and tools for immersive and engaging visitor experiences in museums: The evolution of the visit-actor in next-generation storytelling, through augmented and virtual reality, and immersive 3D projections. 2016 12th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), IEEE, 463–467, 2016.
M. Magnani, A. Guttorm, and N. Magnani. Three-dimensional, community-based heritage management of indigenous museum collections: Archaeological ethnography, revitalization and repatriation at the Sami Museum Siida. Journal of Cultural Heritage, 31:162–169, 2018.
J. Tannus, I. Favareto, E. L. Junior, and A. Cardoso. Comparacao entre tecnicas de fotogrametria e escaneamento de luz estruturada para reconstrucao de objetos em 3d. Anais Estendidos do XXI Simposio de Realidade Virtual e Aumentada, SBC, 21–22, 2019.
J. Xu, W. Cheng, Y. Gao, X. Wang, S. Gao, and Y. Shan. InstantMesh: Efficient 3D mesh generation from a single image with sparse-view large reconstruction models. arXiv preprint arXiv:2404.07191, 2024.
Y. Liu, C. Lin, Z. Zeng, X. Long, L. Liu, T. Komura, and W. Wang. SyncDreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023.
Y. Hong, K. Zhang, J. Gu, S. Bi, Y. Zhou, D. Liu, F. Liu, K. Sunkavalli, T. Bui, and H. Tan. LRM: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400, 2023.
Y. Shi, P. Wang, J. Ye, M. Long, K. Li, and X. Yang. MVDream: Multiview diffusion for 3D generation. arXiv preprint arXiv:2308.16512, 2023.
R. Shi, H. Chen, Z. Zhang, M. Liu, C. Xu, X. Wei, L. Chen, C. Zeng, and H. Su. Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110, 2023.
C. Wen, Y. Zhang, Z. Li, and Y. Fu. Pixel2Mesh++: Multi-view 3D mesh generation via deformation. Proceedings of the IEEE/CVF International Conference on Computer Vision, 1042–1051, 2019.
A. Verhofstadt, L. Hallstrom, W. V. den Broek, C. Reichherzer, and A. Smolic. Exploring user feedback in VR: The added value of qualitative evaluation methods. Proceedings of the ACM International Conference on Interactive Media Experiences (IMX) Workshops, Graz, Austria, 2025. Available: [link]
