A Language Modelling Tool for Statistical NLP

  • Daniel Bastos Pereira USP
  • Ivandré Paraboni USP

Resumo


In recent years the use of statistical language models (SLMs) has become widespread in most NLP fields. In this work we introduce jNina, a basic language modelling tool to aid the development of Machine Translation systems and many other text-generating applications. The tool allows for the quick comparison of multiple text outputs (e.g., alternative translations of a single source) based on a given SLM, and enables the user to build and evaluate her own SLMs from any corpora provided.

Referências

Brown, P.F. et al. (1990) A statistical approach to machine translation. Computational Linguistics vol. 16, pp.79-85.

Brown, P.F. et al. (1993) The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, Vol. 19, pp.263-311.

Charniak, E. (1993) Statistical Language Learning. Cambridge: MIT Press.

Chen, S.F. and J. Goodman (1999) An empirical study of smoothing techniques for language modeling. Computer Speech and Language 13, pp.359-394.

Clarkson, P. and R. Rosenfeld (1994) The CMU statistical language modeling toolkit as it is used in the 1994 ARPA CSR evaluation. Proc. of the Spoken Language Systems Technology Workshop.

Gale, W.A. and G. Sampson (1995) Good-Turing frequency estimation without tears. Journal of Quantitative Linguistics 2:217-237.

Jelinek, F. and R. L. Mercer (1980) Interpolated estimation of Markov source parameters from sparse data. Proc. of the Workshop ‘Pattern Recognition in Practice’. Amsterdam, The Netherlands. North-Holland, pp.381-397.

Lin, C-Y. & Hovy, E. (2003) Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics. Human Technology Conference HLT-NAACL-2003. Edmonton, Canada, June.

Manning, C. D. and Schütze, H. (2003) Foundations of Statistical Natural Language Processing. Cambridge: MIT Press.

Papineni, K., S. Roukos, T. Ward and W-J. Zhu (2002) BLEU: a Method for Automatic Evaluation of Machine Translation. 40th Annual Meeting of the Association for Computational Linguistics, pp. 311-318. Philadelphia, PA.

Stolcke, A. (2002) SRILM - An extensible language modeling toolkit. International Conference on Spoken Language Processing, vol. 2, (Denver, CO), pp. 901-904, September.
Publicado
30/06/2007
PEREIRA, Daniel Bastos; PARABONI, Ivandré. A Language Modelling Tool for Statistical NLP. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 5. , 2007, Rio de Janeiro/RJ. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2007 . p. 1679-1688.