New Approaches to Document Augmentation with Templates and Language Models
Abstract
Recent advances in the Natural Language Processing field percolate toward the Document Understanding fields in the manner of new models and tasks, but the topic of data augmentation is often left untouched. This is specially relevant for the document scope, wherein few-shot fine-tuning tasks are of great relevance for most scopes, as properly annotated data is very scarce, and often these systems are used for the very task of annotation. To thrive in these scenarios, we present two new data augmentation techniques where we aim to maximize knowledge using very few instances. One is based on simple structured documents, using templates that symbolize the layout information. The other one uses Large Language Models (LLMs) to rewrite document texts. These methods work only with these two modes, layout and textual. We validate our approaches on the datasets EPHOIE and FUNSD, respectively. Our approach is shown to improve the baseline methods, according to the metrics on simple and joint training.
References
G. Kim, T. Hong, M. Yim, J. Nam, J. Park, J. Yim, W. Hwang, S. Yun, D. Han, and S. Park, “Ocr-free document understanding transformer,” in European Conf. on Computer Vision (ECCV), 2022.
S. Biswas, P. Riba, J. Lladós, and U. Pal, “Docsynth: A layout guided approach for controllable document image synthesis,” in Int. Conf. on Document Analysis and Recognition (ICDAR), 2021.
M. Li, Y. Xu, L. Cui, S. Huang, F. Wei, Z. Li, and M. Zhou, “Docbank: A benchmark dataset for document layout analysis,” 2020.
J.-P. T. Guillaume Jaume, Hazim Kemal Ekenel, “Funsd: A dataset for form understanding in noisy scanned documents,” in Accepted to ICDAR-OST, 2019.
J. Wang, C. Liu, L. Jin, G. Tang, J. Zhang, S. Zhang, Q. Wang, Y. Wu, and M. Cai, “Towards robust visual information extraction in real world: New dataset and novel solution,” in Proceedings of the AAAI Conf. on Artificial Intelligence, 2021.
Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou, “Layoutlm: Pre-training of text and layout for document image understanding,” CoRR/arXiv, vol. abs/1912.13318, 2019.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Conf. of the North American Chapter of the Association for Computational Linguistics, Jun. 2019, pp. 4171–4186.
Z. Zhang, J. Ma, J. Du, L. Wang, and J. Zhang, “Multimodal pretraining based on graph attention network for document understanding,” CoRR/arXiv, vol. abs/2203.13530, 2022.
Q. Guo, X. Qiu, P. Liu, Y. Shao, X. Xue, and Z. Zhang, “Startransformer,” in Conf. of the North American Chapter of the Association for Computational Linguistics, jun 2019.
Y. Lee, T. Hong, and S. Kim, “Data augmentations for document images,” in SDU@AAAI, 2021. [Online]. Available: [link]
K. Li, C. Wigington, C. Tensmeyer, H. Zhao, N. Barmpalios, V. I. Morariu, V. Manjunatha, T. Sun, and Y. Fu, “Cross-domain document object detection: Benchmark suite and method,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 915–12 924.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Int. Conf. on Neural Information Processing Systems (NeurIPS), 2012, pp. 1097–1105.
X. Zhong, J. Tang, and A. J. Yepes, “Publaynet: largest dataset ever for document layout analysis,” in 2019 Int. Conf. on Document Analysis and Recognition (ICDAR). IEEE, Sep. 2019, pp. 1015–1022.
C. Márk and T. Orosz, “Comparison of data augmentation methods for legal document classification,” Acta Technica Jaurinensis, vol. 15, 07 2021.
A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” 2018.
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019. [Online]. Available: [link]
J. Ye, N. Xu, Y. Wang, J. Zhou, Q. Zhang, T. Gui, and X. Huang, “Llmda: Data augmentation via large language models for few-shot named entity recognition,” 2024.
Z. Guo, P.Wang, Y.Wang, and S. Yu, “Improving small language models on pubmedqa via generative data augmentation,” 2023.
J. Wang, L. Jin, and K. Ding, “LiLT: A simple yet effective language-independent layout transformer for structured document understanding,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, May 2022, pp. 7747–7757. [Online]. Available: [link]
A. W. Harley, A. Ufkes, and K. G. Derpanis, “Evaluation of deep convolutional nets for document image classification and retrieval,” in International Conference on Document Analysis and Recognition (ICDAR), 2015.
P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 8780–8794. [Online]. Available: [link]
Y. Huang, T. Lv, L. Cui, Y. Lu, and F. Wei, “Layoutlmv3: Pre-training for document ai with unified text and image masking,” CoRR/arXiv, vol. abs/2204.08387, 2022.
H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,” ArXiv, vol. abs/2307.09288, 2023. [Online]. Available: [link]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY, USA: Curran Associates Inc., 2017, p. 6000–6010.
Y. Xu, T. Lv, L. Cui, G. Wang, Y. Lu, D. Florencio, C. Zhang, and F. Wei, “XFUND: A benchmark dataset for multilingual visually rich form understanding,” in Findings of the Association for Computational Linguistics: ACL 2022. Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 3214–3224. [Online]. Available: [link]
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” 2019, cite arxiv:1907.11692. [Online]. Available: [link]
Z. Chi, L. Dong, F. Wei, N. Yang, S. Singhal, W. Wang, X. Song, X.-L. Mao, H. Huang, and M. Zhou, “InfoXLM: An information-theoretic framework for cross-lingual language model pre-training,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, Jun. 2021, pp. 3576–3588. [Online]. Available: [link]
Y. Cui, W. Che, T. Liu, B. Qin, S. Wang, and G. Hu, “Revisiting pre-trained models for Chinese natural language processing,” in Findings of the Association for Computational Linguistics: EMNLP 2020, T. Cohn, Y. He, and Y. Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 657–668. [Online]. Available: [link]
