Language-Agnostic Visual-Semantic Embeddings

Jônatas Wehrmann; Rodrigo C. Barros

doi:10.5753/ctd.2021.15751

Jônatas Wehrmann PUC-RS
Rodrigo C. Barros PUC-RS

DOI: https://doi.org/10.5753/ctd.2021.15751

Resumo

We propose a framework for training language-invariant cross-modal retrieval models. We introduce four novel text encoding approaches, as well as a character-based word-embedding approach, allowing the model to project similar words across languages into the same word-embedding space. In addition, by performing cross-modal retrieval at the character level, the storage requirements for a text encoder decrease substantially, allowing for lighter and more scalable retrieval architectures. The proposed language-invariant textual encoder based on characters is virtually unaffected in terms of storage requirements when novel languages are added to the system. Contributions include new methods for building character-level-based word-embeddings, an improved loss function, and a novel cross-language alignment module that not only makes the architecture language-invariant, but also presents better predictive performance. Moreover, we introduce a module called \adapt, which is responsible for providing query-aware visual representations that generate large improvements in terms of recall for four widely-used large-scale image-text datasets. We show that our models outperform the current state-of-the-art all scenarios. This thesis can serve as a new path on retrieval research, now allowing for the effective use of captions in multiple-language scenarios.

Palavras-chave: multimodal retrieval, language-agnostic models, neural networks, computer vision

Referências

Becker, W., Wehrmann, J., Cagnini, H. E. L., and Barros, R. C. (2017). An efficient deep neural architecture for multilingual sentiment analysis in twitter. In FLAIRS.

Faghri, F., Fleet, D. J., Kiros, J. R., and Fidler, S. (2017). Vse++: Improving visualsemantic embeddings with hard negatives. In BMVC, pages 1–13.

Kiros, R., Salakhutdinov, R., and Zemel, R. (2014). Multimodal neural language models. In Proceedings of the International Conference on Machine Learning, pages 595–603.

Kolling, C., Wehrmann, J., and Barros, R. C. (2020). Component analysis for visual question answering architectures. In IJCNN, pages 1–8.

Souza, D. M., Wehrmann, J., and Ruiz, D. D. (2020). Efficient neural architecture for text-to-image synthesis. In IJCNN, pages 1–8.

Wehrmann, J. and Barros, R. C. (2018). Bidirectional retrieval made simple. In Proceedings of the IEEE Computer Vision and Pattern Recognition, pages 7718–7726.

Wehrmann, J., Becker, W., Cagnini, H. E., and Barros, R. C. (2017). A characterbased convolutional neural network for language-agnostic twitter sentiment analysis. In IJCNN, pages 2384–2391. IEEE.

Wehrmann, J., Kolling, C., and C Barros, R. (2020). Adaptive cross-modal embeddings for image-text alignment. In AAAI, volume 34, pages 12313–12320.

Wehrmann, J., Lopes, M. A., and Barros, R. C. (2018a). Self-attention for synopsis-based multi-label movie genre classification. In FLAIRS, pages 236–242.

Wehrmann, J., Lopes, M. A., More, M. D., and Barros, R. C. (2018b). Fast self-attentive multimodal retrieval. In WACV, pages 1871–1878.

Wehrmann, J., Mattjie, A., and Barros, R. C. (2018c). Order embeddings and characterlevel convolutions for multimodal alignment. PRL, 102:15–22.

Wehrmann, J., Souza, D. M., Lopes, M. A., and Barros, R. C. (2019). Language-agnostic visual-semantic embeddings. In ICCV, pages 5804–5813.