Automatic induction of translation lexicons from aligned parallel corpus

  • Helena de M. Caseli USP
  • Maria das Graças V. Nunes USP

Resumo


Translation lexicons are one of the most important linguistic resources for machine translation. However, this bilingual set of word and multiword correspondences requires a lot of manual work to be built. This paper describes a method to automatically build translation lexicons by extracting knowledge from PoS-tagged and lexically aligned parallel corpora. Preliminary experiments were carried out on Brazilian Portuguese, Spanish and English parallel texts. The results showed that 85% of pt–es and 89% of pt–en entries are plausible correspondences. These results were obtained taking into consideration only the classes of entries which achieved the best results.

Referências

Armentano-Oller, C., Carrasco, R. C., Corbí-Bellot, A. M., Forcada, M. L., GinestíRosell, M., Ortiz-Rojas, S., Pérez-Ortiz, J. A., Ramírez-Sánchez, G., Sánchez-Martínez, F., and Scalco, M. A. (2006). Open-source Portuguese-Spanish machine translation. In Proceedings of the VII PROPOR, pages 50–59, Itatiaia-RJ, Brazil.

Brown, P., Della Pietra, V., Della Pietra, S., and Mercer, R. (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2):263–312.

Canals-Marote, R., Esteve-Guillén, A., Garrido-Alenda, A., Guardiola-Savall, M., Iturraspe-Bellver, A., Montserrat-Buendia, S., Ortiz-Rojas, S., Pastor-Pina, H., Pérez-Antón, P., and Forcada, M. (2001). The Spanish-Catalan machine translation system interNOSTRUM. In Proceedings of MT Summit VIII, pages 73–76.

Carl, M. (2001). Inducing probabilistic invertible translation grammars from aligned texts. In Proceedings of CoNLL-2001, pages 145–151, Toulouse, France.

Carletta, J. (1996). Assessing agreement on classification tasks: the kappa statistics. Computational Linguistics, 22(2):249–254.

Caseli, H. M. (2007). Indução de léxicos bilíngües e regras para a tradução automática. PhD thesis, ICMC–USP–São Carlos.

Caseli, H. M., Nunes, M. G. V., and Forcada, M. L. (2005). Evaluating the LIHLA lexical aligner on Spanish, Brazilian Portuguese and Basque parallel texts. Procesamiento del Lenguaje Natural, 35:237–244.

Fung, P. (1995). A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. In Proceedings of ACL-1995, pages 236–243.

Gómez Guinovart, X. and Sacau Fontenla, E. (2004). Métodos de optimización de la extracción de léxico bilingüe a partir de corpus paralelos. Procesamiento del Lenguaje Natural, 33:133–140.

Koehn, P. and Knight, K. (2002). Learning a translation lexicon from monolingual corpora. In Association for Computational Linguistics, editor, Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX), pages 9–16, Philadelphia.

Langlais, P., Foster, G., and Lapalme, G. (2001). Integrating bilingual lexicons in a probabilistic translation assistant. In Proceedings of MT Summit VIII, pages 197–202, Santiago de Compostela, Spain.

Menezes, A. and Richardson, S. D. (2001). A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora. In Proceedings of the Workshop on Data-driven Machine Translation at 39th ACL, pages 39–46.

Och, F. J. and Ney, H. (2000). Improved statistical alignment models. In Proceedings of the 38th ACL, pages 440–447, Hong Kong, China.

Och, F. J. and Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51.

Paumier, S. (2006). Unitex 1.2 user manual. Université de Marne-la-Vallée.

Schafer, C. and Yarowsky, D. (2002). Inducing translation lexicons via diverse similarity measures and bridge languages. In Proceedings of the 6th CoNLL, co-located with COLING-2002, Taipei, Taiwan.

Wu, D. and Xia, X. (1994). Learning an English-Chinese lexicon from parallel corpus. In Proceedings of the 1st AMTA, pages 206–213, Columbia, MD.
Publicado
30/06/2007
CASELI, Helena de M.; NUNES, Maria das Graças V.. Automatic induction of translation lexicons from aligned parallel corpus. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 5. , 2007, Rio de Janeiro/RJ. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2007 . p. 1669-1678.