Novos Caminhos para Aumento de Documentos com templates e Modelos de Linguagem

Lucas Wojcik; Luiz Coelho; Roger Granada; David Menotti

doi:10.5753/sibgrapi.est.2024.31652

Lucas Wojcik UFPR
Luiz Coelho UFPR
Roger Granada UFPR
David Menotti UFPR

DOI: https://doi.org/10.5753/sibgrapi.est.2024.31652

Resumo

Avanços recentes em processamento de linguagem natural percolam para o campo de reconhecimento de documentos em novos modelos e tarefas, mas o tópico de aumento de dados é raramente discutido. Isto é relevante especialmente para o escopo de documentos, onde tarefas com poucas instâncias de treinamento são de grande importância para muitos domínios, visto que dados bem anotados são escassos, e estes modelos podem ser mesmo utilizados para a própria tarefa de anotação. Para melhorar estes cenários, apresentamos duas novas técnicas de aumento de dados focadas em maximizar o conhecimento de poucas instâncias. Uma é baseada em documentos de estrutura simples, utilizando templates que codificam a informação de layout. A outra usa Large Language Models (LLMs) para reescreveros textos de documentos. Estes métodos funcionam com dois modos: texto e layout. Validamos nossas técnicas nos datasets EPHOIE e FUNSD, respectivamente. Mostramos que nossas técnicas melhoram o baseline, de acordo com as métricas para treinamento simples e combinado.

Referências

J. Kaddour, J. Harris, M. Mozes, H. Bradley, R. Raileanu, and R. McHardy, “Challenges and applications of large language models,” ArXiv, vol. abs/2307.10169, 2023. [Online]. Available: [link]

G. Kim, T. Hong, M. Yim, J. Nam, J. Park, J. Yim, W. Hwang, S. Yun, D. Han, and S. Park, “Ocr-free document understanding transformer,” in European Conf. on Computer Vision (ECCV), 2022.

S. Biswas, P. Riba, J. Lladós, and U. Pal, “Docsynth: A layout guided approach for controllable document image synthesis,” in Int. Conf. on Document Analysis and Recognition (ICDAR), 2021.

M. Li, Y. Xu, L. Cui, S. Huang, F. Wei, Z. Li, and M. Zhou, “Docbank: A benchmark dataset for document layout analysis,” 2020.

J.-P. T. Guillaume Jaume, Hazim Kemal Ekenel, “Funsd: A dataset for form understanding in noisy scanned documents,” in Accepted to ICDAR-OST, 2019.

J. Wang, C. Liu, L. Jin, G. Tang, J. Zhang, S. Zhang, Q. Wang, Y. Wu, and M. Cai, “Towards robust visual information extraction in real world: New dataset and novel solution,” in Proceedings of the AAAI Conf. on Artificial Intelligence, 2021.

Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou, “Layoutlm: Pre-training of text and layout for document image understanding,” CoRR/arXiv, vol. abs/1912.13318, 2019.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Conf. of the North American Chapter of the Association for Computational Linguistics, Jun. 2019, pp. 4171–4186.

Z. Zhang, J. Ma, J. Du, L. Wang, and J. Zhang, “Multimodal pretraining based on graph attention network for document understanding,” CoRR/arXiv, vol. abs/2203.13530, 2022.

Q. Guo, X. Qiu, P. Liu, Y. Shao, X. Xue, and Z. Zhang, “Startransformer,” in Conf. of the North American Chapter of the Association for Computational Linguistics, jun 2019.

Y. Lee, T. Hong, and S. Kim, “Data augmentations for document images,” in SDU@AAAI, 2021. [Online]. Available: [link]

K. Li, C. Wigington, C. Tensmeyer, H. Zhao, N. Barmpalios, V. I. Morariu, V. Manjunatha, T. Sun, and Y. Fu, “Cross-domain document object detection: Benchmark suite and method,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 915–12 924.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Int. Conf. on Neural Information Processing Systems (NeurIPS), 2012, pp. 1097–1105.

X. Zhong, J. Tang, and A. J. Yepes, “Publaynet: largest dataset ever for document layout analysis,” in 2019 Int. Conf. on Document Analysis and Recognition (ICDAR). IEEE, Sep. 2019, pp. 1015–1022.

C. Márk and T. Orosz, “Comparison of data augmentation methods for legal document classification,” Acta Technica Jaurinensis, vol. 15, 07 2021.

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” 2018.

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019. [Online]. Available: [link]

J. Ye, N. Xu, Y. Wang, J. Zhou, Q. Zhang, T. Gui, and X. Huang, “Llmda: Data augmentation via large language models for few-shot named entity recognition,” 2024.

Z. Guo, P.Wang, Y.Wang, and S. Yu, “Improving small language models on pubmedqa via generative data augmentation,” 2023.

J. Wang, L. Jin, and K. Ding, “LiLT: A simple yet effective language-independent layout transformer for structured document understanding,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, May 2022, pp. 7747–7757. [Online]. Available: [link]

A. W. Harley, A. Ufkes, and K. G. Derpanis, “Evaluation of deep convolutional nets for document image classification and retrieval,” in International Conference on Document Analysis and Recognition (ICDAR), 2015.

P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 8780–8794. [Online]. Available: [link]

Y. Huang, T. Lv, L. Cui, Y. Lu, and F. Wei, “Layoutlmv3: Pre-training for document ai with unified text and image masking,” CoRR/arXiv, vol. abs/2204.08387, 2022.

H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,” ArXiv, vol. abs/2307.09288, 2023. [Online]. Available: [link]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY, USA: Curran Associates Inc., 2017, p. 6000–6010.

Y. Xu, T. Lv, L. Cui, G. Wang, Y. Lu, D. Florencio, C. Zhang, and F. Wei, “XFUND: A benchmark dataset for multilingual visually rich form understanding,” in Findings of the Association for Computational Linguistics: ACL 2022. Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 3214–3224. [Online]. Available: [link]

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” 2019, cite arxiv:1907.11692. [Online]. Available: [link]

Z. Chi, L. Dong, F. Wei, N. Yang, S. Singhal, W. Wang, X. Song, X.-L. Mao, H. Huang, and M. Zhou, “InfoXLM: An information-theoretic framework for cross-lingual language model pre-training,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, Jun. 2021, pp. 3576–3588. [Online]. Available: [link]

Y. Cui, W. Che, T. Liu, B. Qin, S. Wang, and G. Hu, “Revisiting pre-trained models for Chinese natural language processing,” in Findings of the Association for Computational Linguistics: EMNLP 2020, T. Cohn, Y. He, and Y. Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 657–668. [Online]. Available: [link]

Novos Caminhos para Aumento de Documentos com templates e Modelos de Linguagem

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)