Indução Gramatical para o Português: a Contribuição da Informação Mútua para Descoberta de Relações de Dependência

Diego Pedro Gonçalves da Silva; Thiago Alexandre Salgueiro Pardo

doi:10.5753/stil.2023.234178

Diego Pedro Gonçalves da Silva USP http://orcid.org/0009-0005-9642-9972
Thiago Alexandre Salgueiro Pardo USP http://orcid.org/0000-0003-2111-1319

DOI: https://doi.org/10.5753/stil.2023.234178

Resumo

Indução gramatical é uma tarefa que busca aprender automaticamente estruturas sintáticas a partir de texto. Poucos trabalhos de indução gramatical foram produzidos direcionados para a língua portuguesa. Neste artigo, reproduzidos o trabalho de [Futrell et al. 2019] para a língua portuguesa e o estendemos ao incluir análise de informação mútua para relações sintáticas específicas. Utilizamos dois treebanks anotados e realizamos experimentos utilizando embeddings de dimensões variadas, demonstrando a hipótese de alta informação mútua para palavras em relações de dependência.

Palavras-chave: indução gramatical, gramática de dependência, informação mútua

Referências

Afonso, S., Bick, E., Haber, R., e Santos, D. (2002). Floresta sinta(c)tica: A treebank for portuguese. In the Proceedings of the Conference on Language Resources and Evaluation (LREC), 1698–1703.

Baker, J. K. (1979). Trainable grammars for speech recognition. The Journal of the Acoustical Society of America, 132–132.

Bannard, C., Lieven, E., e Tomasello, M. (2009). Modeling children’s early grammatical knowledge. In the Proceedings of the National Academy of Sciences (PNAS), 17284–17289.

Bengio, Y., Louradour, J., Collobert, R., e Weston, J. (2009). Curriculum learning. In the Proceedings of the 26th Annual International Conference on Machine Learning (ICML), 41–48.

Blei, D. M. e Lafferty, J. D. (2005). Correlated topic models. In the Proceedings of Advances in Neural Information Processing Systems (NIPS), 147–154.

Bod, R. (2007). Is the end of supervised parsing in sight? In the Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL), 400–407.

Bresnan, J., Asudeh, A., Toivonen, I., e Wechsler, S. (2015). Lexical-functional syntax. John Wiley & Sons.

Chomsky, N. (2014). Aspects of the Theory of Syntax, volume 11. MIT press.

Cohen, S. B. e Smith, N. A. (2009). Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction. In the Proceedings of Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics (NAACL), 74–82.

da Costa, P. B. e Kepler, F. N. (2014). Semi-supervised parsing of portuguese. In the Proceedings of the Computational Processing of the Portuguese Language - 11th International Conference (PROPOR), 102–107.

Dahl, V., Bel-Enguix, G., Tirado, V., e Miralles, J. E. (2023). Grammar induction for under-resourced languages: The case of ch’ol. In the Proceedings of the Analysis, Verification and Transformation for Declarative Programming and Intelligent Systems - Essays Dedicated to Manuel Hermenegildo on the Occasion of His 60th Birthday, 113–132.

de Marneffe, M.-C., Manning, C. D., Nivre, J., e Zeman, D. (2021). Universal Dependencies. Computational Linguistics, 255–308.

de Souza, E., Silveira, A., Cavalcanti, T., Castro, M. C., e Freitas, C. (2021). Petrogold corpus padrão ouro para o domínio do petroleo. In Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana (STIL), 29–38.

Drozdov, A., Verga, P., Yadav, M., Iyyer, M., e McCallum, A. (2019). Unsupervised latent tree induction with deep inside-outside recursive autoencoders. In the Proceedings of Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics (NAACL), 1129-1141.

Klein, D. e Manning, C. D. (2002). A generative constituent-context model for improved grammar induction. In the Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), 128–135.

Futrell, R., Qian, P., Gibson, E., Fedorenko, E., e Blank, I. (2019). Syntactic dependencies correspond to word pairs with high mutual information. In the Proceedings of the fifth international conference on dependency linguistics (depling), 3–13.

Han, W., Jiang, Y., e Tu, K. (2017). Dependency grammar induction with neural lexicalization and big training data. In the Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1683–1688.

Han, W., Jiang, Y., e Tu, K. (2019a). Enhancing unsupervised generative dependency parser with contextual information. In the Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL), 5315–5325.

Han, W., Jiang, Y., e Tu, K. (2019b). Lexicalized neural unsupervised dependency parsing. Neurocomputing, 105–115.

Hartmann, N., Fonseca, E. R., Shulby, C., Treviso, M. V., Rodrigues, J. S., e Aluísio, S. M. (2017). Portuguese word embeddings: Evaluating on word analogies and natural language tasks. In the Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology (STIL), 122–131.

Hoover, J. L., Du, W., Sordoni, A., e O’Donnell, T. J. (2021). Linguistic dependencies and statistical dependence. In the Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP), 2941–2963.

Headden III, W. P., Johnson, M., e McClosky, D. (2009). Improving unsupervised dependency parsing with richer contexts and smoothing. In the Proceedings of Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics (NAACL), 101–109.

Jiang, Y., Han, W., e Tu, K. (2016). Unsupervised neural dependency parsing. In the Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), 763–771.

Klein, D. e Manning, C. D. (2004). Corpus-based induction of syntactic structure: Models of dependency and constituency. In the Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL), 478–485.

Lin, B., Yao, Z., Shi, J., Cao, S., Tang, B., Li, S., Luo, Y., Li, J., e Hou, L. (2022). Dependency parsing via sequence generation. Findings of the Association for Computational Linguistics, 7339–7353.

Linguateca (2023). Cetem publico: Um corpus de grandes dimensões de linguagem jornalística portuguesa. Linguateca, [link], última visita: Junho de 2023.

Magerman, D. M. e Marcus, M.a P. (1990). Parsing a natural language using mutual information statistics. In the Proceedings of the 8th National Conference on Artificial Intelligence (AAAI), 984–989.

Pardo, T. A. S., Duran, M. S., Lopes, L., Felippo, A. d., Roman, N. T., e Nunes, M. d. G. V. (2021). Porttinari: a large multi-genre treebank for brazilian portuguese. In the Proceedings of the XIII Symposium in Information and Human Language (STIL), 1–10. http://dx.doi.org/10.5753/stil.2021.17778

Pate, J. K. e Johnson, M. (2016). Grammar induction from (lots of) words alone. In the Proceedings of 26th International Conference on Computational Linguistics (COLING), 23–32.

Seginer, Y. (2007). Fast unsupervised incremental parsing. In the Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL), 384–391

Shen, Y., Tay, Y., Zheng, C., Bahri, D., Metzler, D., e Courville, A. C. (2021). Structformer: Joint unsupervised induction of dependency and constituency structure from masked language modeling. In the Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJNLP), 7196–7209.

Spitkovsky, V. I., Alshawi, H., e Jurafsky, D. (2010). From baby steps to leapfrog: How ``less is more'' in unsupervised dependency parsing. In the Proceedings of Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics (NAACL), 751–759.

Spitkovsky, V. I., Alshawi, H., e Jurafsky, D. (2013). Breaking out of local optima with count transforms and model recombination: A study in grammar induction. In the Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1983–1995.

Stevenson, A. e Cordy, J. R. (2014). A survey of grammatical inference in software engineering. Science of Computer Programming, 444–459. http://dx.doi.org/10.1016/j.scico.2014.05.008

Theodor, C. C. e Siebert-Cole, E. (2020). Family tree of languages. [link], ultima visita:junho 2023.

Unold, O., Gabor, M., e Dyrka, W. (2020). Unsupervised grammar induction for revealing the internal structure of protein sequence motifs. In the Proceedings of Artificial Intelligence in Medicine: 18th International Conference on Artificial Intelligence in Medicine (AIME), 299–309.

Yang, S., Jiang, Y., Han, W., e Tu, K. (2020). Second-order unsupervised neural dependency parsing. In the Proceedings of the 28th International Conference on Computational Linguistics (COLING), 3911–3924 http://dx.doi.org/10.18653/v1/2020.coling-main.347