TransAlign: Translation and Alignment of Corpora for the Portuguese Language.

Abstract


In this paper, we introduce TransAlign, an innovative framework to enhance Open Information Extraction (OpenIE) in underrepresented languages, such as Portuguese, by leveraging data from resource-rich languages. Utilizing specific grammatical rules and high-quality translation models, we adapted LSOIE, a large-scale dataset, for Portuguese. This approach generated 21.161 high-quality triples for OpenIE in Portuguese. The resulting dataset enabled the training of a new model that improved F1 scores by 50% over existing systems for Portuguese

Keywords: Open Information Extraction, Corpora, Dataset, Data, Artificial Inteligence, Data Translating, Data Alignment, OIE, NLP, Natural Language Processing

References

Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., and Vollgraf, R. (2019a). Flair: An easy-to-use framework for state-of-the-art nlp. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 54–59.

Akbik, A., Bergmann, T., and Vollgraf, R. (2019b). Pooled contextualized embeddings for named entity recognition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 724–728. Association for Computational Linguistics.

Akbik, A., Blythe, D., and Vollgraf, R. (2018). Contextual string embeddings for sequence labeling. In COLING 2018, 27th International Conference on Computational Linguistics, pages 1638–1649.

Angeli, G., Premkumar, M. J. J., and Manning, C. D. (2015). Leveraging linguistic structure for open domain information extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 344– 354. Association for Computational Linguistics.

Banko, M., Cafarella, M., Soderland, S., Broadhead, M., and Etzioni, O. (2007). Open information extraction from the web. In Proceedings of the 20th International Joint Conference on Artifical intelligence, pages 2670–2676. University of Washington.

Bhardwaj, S., Aggarwal, S., and Mausam, M. (2019). CaRB: A crowdsourced benchmark for open IE. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6262–6267, Hong Kong, China. Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1651 https://aclanthology.org/D19-1651

Cabral, B., Souza, M., and Claro, D. B. (2022). Portnoie: A neural framework for open information extraction for the portuguese language. In Pinheiro, V., Gamallo, P., Amaro, R., Scarton, C., Batista, F., Silva, D., Magro, C., and Pinto, H., editors, Computational Processing of the Portuguese Language, pages 243–255, Cham. Springer International Publishing.

Etzioni, O., Banko, M., Soderland, S., and Weld, D. S. (2008). Open information extraction from the web. Communications of the ACM, 51(12):68–74.

Fader, A., Soderland, S., and Etzioni, O. (2011). Identifying relations for open information extraction. https://www.aclweb.org/anthology/D11-1142

Honnibal, M. and Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear.

Kolluru, K., Mohammed, M., Mittal, S., Chakrabarti, S., and ., M. (2022). Alignment-augmented consistent translation for multilingual open information extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2502–2517, Dublin, Ireland. Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.179 https://aclanthology.org/2022.acl-long.179

Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv e-prints, page arXiv:1606.05250.

Solawetz, J. and Larson, S. (2019). LSOIE: A large-scale dataset for supervised open information extraction. arXiv preprint arXiv:2101.11177. https://arxiv.org/pdf/2101.11177.pdf

Stanovsky, G., Michael, J., Zettlemoyer, L., and Dagan, I. (2018). Supervised open information extraction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 885–895. Association for Computational Linguistics.
Published
2023-09-25
MELO, Alan Rios; CLARO, Daniela Barreiro. TransAlign: Translation and Alignment of Corpora for the Portuguese Language.. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 14. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 382-387. DOI: https://doi.org/10.5753/stil.2023.234605.