TransAlign: tradução e alinhamento de corpora para a língua portuguesa

Resumo


Neste artigo, apresentamos o TransAlign, uma estrutura inovadora para ampliar a Extração Aberta de Informações (OpenIE) em idiomas sub-representados, como o português, usando dados de idiomas ricos em recursos. Utilizando regras gramaticais específicas e modelos de tradução de alta qualidade, adaptamos o LSOIE, um conjunto de dados de grande escala, para o português. Essa abordagem gerou 21.161 triplas de alta qualidade para OpenIE em português. O conjunto de dados resultante possibilitou o treinamento de um novo modelo que melhorou em 50% os escores F1 dos sistemas existentes para o português.

Palavras-chave: Extração de Informação Aberta, Corpus, Dataset, Dados, Inteligencia Artificial, Tradução de dados, Alinhamento de dados, EIA, PLN, Processamento de Linguagem Natural

Referências

Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., and Vollgraf, R. (2019a). Flair: An easy-to-use framework for state-of-the-art nlp. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 54–59.

Akbik, A., Bergmann, T., and Vollgraf, R. (2019b). Pooled contextualized embeddings for named entity recognition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 724–728. Association for Computational Linguistics.

Akbik, A., Blythe, D., and Vollgraf, R. (2018). Contextual string embeddings for sequence labeling. In COLING 2018, 27th International Conference on Computational Linguistics, pages 1638–1649.

Angeli, G., Premkumar, M. J. J., and Manning, C. D. (2015). Leveraging linguistic structure for open domain information extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 344– 354. Association for Computational Linguistics.

Banko, M., Cafarella, M., Soderland, S., Broadhead, M., and Etzioni, O. (2007). Open information extraction from the web. In Proceedings of the 20th International Joint Conference on Artifical intelligence, pages 2670–2676. University of Washington.

Bhardwaj, S., Aggarwal, S., and Mausam, M. (2019). CaRB: A crowdsourced benchmark for open IE. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6262–6267, Hong Kong, China. Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1651 https://aclanthology.org/D19-1651

Cabral, B., Souza, M., and Claro, D. B. (2022). Portnoie: A neural framework for open information extraction for the portuguese language. In Pinheiro, V., Gamallo, P., Amaro, R., Scarton, C., Batista, F., Silva, D., Magro, C., and Pinto, H., editors, Computational Processing of the Portuguese Language, pages 243–255, Cham. Springer International Publishing.

Etzioni, O., Banko, M., Soderland, S., and Weld, D. S. (2008). Open information extraction from the web. Communications of the ACM, 51(12):68–74.

Fader, A., Soderland, S., and Etzioni, O. (2011). Identifying relations for open information extraction. https://www.aclweb.org/anthology/D11-1142

Honnibal, M. and Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear.

Kolluru, K., Mohammed, M., Mittal, S., Chakrabarti, S., and ., M. (2022). Alignment-augmented consistent translation for multilingual open information extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2502–2517, Dublin, Ireland. Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.179 https://aclanthology.org/2022.acl-long.179

Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv e-prints, page arXiv:1606.05250.

Solawetz, J. and Larson, S. (2019). LSOIE: A large-scale dataset for supervised open information extraction. arXiv preprint arXiv:2101.11177. https://arxiv.org/pdf/2101.11177.pdf

Stanovsky, G., Michael, J., Zettlemoyer, L., and Dagan, I. (2018). Supervised open information extraction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 885–895. Association for Computational Linguistics.
Publicado
25/09/2023
MELO, Alan Rios; CLARO, Daniela Barreiro. TransAlign: tradução e alinhamento de corpora para a língua portuguesa. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 14. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 382-387. DOI: https://doi.org/10.5753/stil.2023.234605.