Data extraction from textual sources: an approach for enriching interlinked open data
Abstract
In the Web of Data, data items are interconnected and associated with descriptive annotations, taking advantage of a representation in the form of triples. In this context, documents and other textual sources can be annotated to be incorporated into this universe as resources or serving as sources for extracting new triples. The purpose of this article is to present an approach for data extraction and triple generation from texts with specific styles, aiming at their association and connection to existing databases. The approach was applied and evaluated in the context of a portal with information on the consumption of pesticides in Brazil.
References
Augenstein, I., Padó, S., and Rudolph, S. (2012). LODifier: Generating Linked Data from Unstructured Text. In Proc. of the 9th Inter. Conf. on The Semantic Web: Research and Applications, ESWC’12, pages 210–224, Berlin, Heidelberg. Springer-Verlag.
Brin, S. (1999). Extracting Patterns and Relations from the World Wide Web. In Selected Papers from the Int. Workshop on The World Wide Web and Databases, WebDB ’98, pages 172–183, London, UK, UK. Springer-Verlag.
Byrne, K. and Klein, E. (2010). Automatic Extraction of Archaeological Events from Text. In Proc. of the 37th Int. Conf. Computer App. and Quantitative Methods in Archaeology, pages 48–56, Williamsburg, Virginia, USA.
Caracciolo, C., Stellato, A., Morshed, A., Johannsen, G., Rajbhandari, S., Jaques, Y., and Keizer, J. (2013). The AGROVOC Linked Dataset. volume 4, pages 341–348, Amsterdam, The Netherlands, The Netherlands. IOS Press.
Carvalho, D. S., Freitas, A., and da Silva, J. C. P. (2013). Graphia: Extracting Contextual Relation Graphs from Text. In The Semantic Web: ESWC 2013 Satellite Events - ESWC 2013 Satellite Events, Montpellier, France, May 26-30, 2013, Revised Selected Papers, pages 236–241. Springer.
de Abreu, S. C., Bonamigo, T. L., and Vieira, R. (2013). A review on Relation Extraction with an eye on Portuguese. Journal of the Brazilian Computer Society, 19(4):553–571.
de Souza, E. N. P. and Claro, D. B. (2014). Extração de Relações utilizando Features Diferenciadas para Português. Linguamática, 6:57–65.
Del Corro, L. and Gemulla, R. (2013). ClausIE: Clause-Based Open Information Extraction. In Pro. of the 22nd Int. Conf. on World Wide Web, WWW ’13, pages 355–366, New York, NY, USA. ACM.
Etzioni, O., Banko, M., Soderland, S., and Weld, D. S. (2008). Open Information Extraction from the Web. Commun. ACM, 51(12):68–74.
Etzioni, O., Cafarella, M., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D. S., and Yates, A. (2005). Unsupervised Named-Entity Extraction from the Web: An Experimental Study. Artif. Intell., 165(1):91–134.
Fader, A., Soderland, S., and Etzioni, O. (2011). Identifying Relations for Open Information Extraction. In Proc. of the Conf. on Empirical Methods in Natural Language Processing, EMNLP ’11, pages 1535–1545, Stroudsburg, PA, USA. Ass. for Comp. Linguistics.
Gamallo, P., Garcia, M., and Fernández-Lanza, S. (2012). Dependency-Based Open Information Extraction. In Proc. of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP, ROBUS-UNSUP ’12, pages 10–18. Ass. for Comp. Linguistics.
Grishman, R., Sterling, J., and Macleod, C. (1991). Description of the Proteus System as used for MUC-3. In Proc. of the Third Message Understanding Conference, San Diego, CA, May 1991, pages 183–190. Morgan Kaufmann.
Joshi, A., Lal, R., Finin, T., and Joshi, A. (2013). Extracting Cybersecurity Related Linked Data from Text. In Proc. of the 7th IEEE Int. Conf. on Semantic Computing, pages 252–259. IEEE Computer Society Press.
Lange, D., Böhm, C., and Naumann, F. (2010). Extracting Structured Information from Wikipedia Articles to Populate Infoboxes. In Proc. of the 19th ACM Int. Conf. on Inf. and Knowledge Management, CIKM ’10, pages 1661–1664, New York, USA. ACM.
Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P. N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., and Bizer, C. (2015). DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web Journal, 6(2):167–195.
Lehnert, W., Williams, R., Cardie, C., Riloff, E., and Fisher, D. (1991). The CIRCUS System as Used in MUC-3. Technical report, Amherst, MA, USA.
Nadkarni, P. M., Ohno-Machado, L., and Chapman, W. W. (2011). Natural language processing: an introduction. Journal of the American Med. Inf. Ass., 18(5):544–551.
Noy, N. F., Shah, N. H., Whetzel, P. L., Dai, B., Dorf, M., Griffith, N., Jonquet, C., Rubin, D. L., Storey, M.-A. D., Chute, C. G., and Musen, M. A. (2009). BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Research, 37(Web-Server-Issue):170–173.
Pantel, P. and Pennacchiotti, M. (2006). Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations. ACL-44, pages 113–120, Stroudsburg, PA, USA. Ass. for Comp. Linguistics.
Tygel, A., Auer, S., Debattista, J., Orlandi, F., and Campos, M. L. M. (2016). Towards Cleaning-Up Open Data Portals: A Metadata Reconciliation Approach. In Tenth IEEE Inte. Conf. on Semantic Comp., ICSC 2016, Laguna Hills, CA, USA, 2016, pages 71–78.
Yates, A., Cafarella, M., Banko, M., Etzioni, O., Broadhead, M., and Soderland, S. (2007). Textrunner: Open information extraction on the web. In Proc. of Human Language Technologies: The Annual Conf. of the North American Chapter of the Association for Computational Linguistics: Demonstrations, NAACL-Demonstrations ’07, pages 25–26, Stroudsburg, PA, USA. Ass. for Comp. Linguistics.
