Unsupervised Information Extraction by Text Segmentation
Abstract
In this work we propose, implement and evaluate a new unsupervised approach for the problem of Information Extraction by Text Segmentation (IETS). Our approach relies on information available on pre-existing data to learn how to associate segments in the input with attributes of a given domain relying on a very effective set of content-based features. The effectiveness of these content-based features is also exploited to directly learn from test data structure-based features, with no previous human-driven training, a feature unique to our approach. Based on our approach, we have produced a number of results to address the IETS problem. We have performed different experiments that indicate that our approach yields high quality results when compared to state-of-the-art approaches and that it is able to properly support IETS methods in a number of real applications.References
Borkar, Deshmukh, and Sarawagi (2001). Automatic Segmentation of Text into Structured Records. In Int. Conf. on Manag. of Data (SIGMOD), pages 175–186.
Cortez (2012). Unsupervised Approach for Information Extraction by Text Segmentation. Phd thesis, Universidade Federal do Amazonas.
Cortez, et al. (2010a). ONDUX: On-Demand Unsupervised Learning for Information Extraction. In Int. Conf. on Manag. of Data (SIGMOD), pages 807–818.
Cortez, et al. (2011a). Joint unsupervised structure discovery and information extraction. In Int. Conf. on Manag. of Data (SIGMOD), pages 541–552.
Cortez, et al. (2009). A flexible approach for extracting metadata from bibliographic citations. J. American Soc. for Inf. Science and Tech. (JASIST), 60(6):1144–1158.
Cortez, et al. (2010b). Unsupervised strategies for information extraction by text segmentation. In SIGMOD PhD Workshop on Innov. Database Res., pages 49–54.
Cortez, et al. (2011b). Lightweight methods for large-scale product categorization. J. American Soc. for Inf. Science and Tech. (JASIST), 62(9):1839–1848.
Evangelista, et al. (2010). Adaptive and flexible blocking for record linkage tasks. J. Inf. and Data Manag. (JIDM), 1(2):167.
Evangelista, et al. (2009). Blocagem adaptativa e flexível para o pareamento aproximado de registros. In Simp. Bras. de Banco de Dados (SBBD), pages 61–75.
Laender, et al. (2011). Building a research social network from an individual perspective. In Joint Conf. on Dig. Libraries (JCDL), pages 427–428.
Mansuri and Sarawagi (2006). Integrating Unstructured Data into Relational Databases. In Int. Conf. on Data Engineering (ICDE), pages 29–41.
Porto, et al. (2011). Unsupervised information extraction with the ondux tool. In Simp. Bras. de Banco de Dados (SBBD).
Sarawagi (2008). Information extraction. Found. Trends in Databases, 1(3):261–377.
Serra, et al. (2011). On using wikipedia to build knowledge bases for information extraction by text segmentation. J. Inf. and Data Manag. (JIDM), 2(3):259.
Toda, et al. (2010). A probabilistic approach for automatically filling form-based web interfaces. Proceedings of the VLDB Endowment, 4(3):151–160.
Toda, et al. (2009). Automatically filling form-based web interfaces with free text inputs. In Int. World Wide Web Conf. (WWW), pages 1163–1164.
Zhao, et al. (2008). Exploiting structured reference data for unsupervised text segmentation with conditional random fields. In SIAM Int. Conf. on Data Min., pages 420–431.
Cortez (2012). Unsupervised Approach for Information Extraction by Text Segmentation. Phd thesis, Universidade Federal do Amazonas.
Cortez, et al. (2010a). ONDUX: On-Demand Unsupervised Learning for Information Extraction. In Int. Conf. on Manag. of Data (SIGMOD), pages 807–818.
Cortez, et al. (2011a). Joint unsupervised structure discovery and information extraction. In Int. Conf. on Manag. of Data (SIGMOD), pages 541–552.
Cortez, et al. (2009). A flexible approach for extracting metadata from bibliographic citations. J. American Soc. for Inf. Science and Tech. (JASIST), 60(6):1144–1158.
Cortez, et al. (2010b). Unsupervised strategies for information extraction by text segmentation. In SIGMOD PhD Workshop on Innov. Database Res., pages 49–54.
Cortez, et al. (2011b). Lightweight methods for large-scale product categorization. J. American Soc. for Inf. Science and Tech. (JASIST), 62(9):1839–1848.
Evangelista, et al. (2010). Adaptive and flexible blocking for record linkage tasks. J. Inf. and Data Manag. (JIDM), 1(2):167.
Evangelista, et al. (2009). Blocagem adaptativa e flexível para o pareamento aproximado de registros. In Simp. Bras. de Banco de Dados (SBBD), pages 61–75.
Laender, et al. (2011). Building a research social network from an individual perspective. In Joint Conf. on Dig. Libraries (JCDL), pages 427–428.
Mansuri and Sarawagi (2006). Integrating Unstructured Data into Relational Databases. In Int. Conf. on Data Engineering (ICDE), pages 29–41.
Porto, et al. (2011). Unsupervised information extraction with the ondux tool. In Simp. Bras. de Banco de Dados (SBBD).
Sarawagi (2008). Information extraction. Found. Trends in Databases, 1(3):261–377.
Serra, et al. (2011). On using wikipedia to build knowledge bases for information extraction by text segmentation. J. Inf. and Data Manag. (JIDM), 2(3):259.
Toda, et al. (2010). A probabilistic approach for automatically filling form-based web interfaces. Proceedings of the VLDB Endowment, 4(3):151–160.
Toda, et al. (2009). Automatically filling form-based web interfaces with free text inputs. In Int. World Wide Web Conf. (WWW), pages 1163–1164.
Zhao, et al. (2008). Exploiting structured reference data for unsupervised text segmentation with conditional random fields. In SIAM Int. Conf. on Data Min., pages 420–431.
Published
2013-07-23
How to Cite
CORTEZ, Eli; SILVA, Altigran Soares da.
Unsupervised Information Extraction by Text Segmentation. In: THESIS AND DISSERTATION CONTEST (CTD), 26. , 2013, Maceió/AL.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2013
.
p. 95-100.
ISSN 2763-8820.
