A Form Understanding Approach to Printed and Structured Engineering Documentation

  • Gabriel L. Santos FURG
  • Vanessa T. Silva FURG
  • Laura A. Dalmolin FURG
  • Ricardo N. Rodrigues FURG
  • Paulo L. J. Drews FURG
  • Nelson L. Duarte Filho FURG


A significant amount of companies still depends on printed documents, such as healthcare reports, engineering specifications, or historical documents. Those documents are diverse in terms of layout and content, thereby it requires different approaches for each document structure, which makes information extraction a costly and inefficient task. We classify documents into three categories, non-structured, semi-structured, and structured documents. The last one being the focus of the present work. We propose a pattern recognition method for structured documents with an anchoring relationship between question-answer objects through a system of hypotheses and a probability distribution in order to identify which predefined model the document belongs to. Therefore, acting as a system for both identification and content extraction to structured documents. The method has promising results for pattern recognition from all document models, with 78% to 97% objects extracted correctly.
Palavras-chave: Layout, Medical services, Documentation, Companies, Information retrieval, Probability distribution, Pattern recognition, Form Understanding, Text Detection, Spatial Layout Analysis
