Extração de dados de fontes textuais: uma abordagem para enriquecimento de dados abertos interligados

Karen Torres Teixeira; Maria Luiza Machado Campos; João C. P. da Silva

doi:10.5753/semish.2018.3435

Karen Torres Teixeira UFRJ
Maria Luiza Machado Campos UFRJ
João C. P. da Silva UFRJ

DOI: https://doi.org/10.5753/semish.2018.3435

Resumo

Na Web de Dados, itens de dados são interconectados e associados a anotações descritivas na forma de vocabulários, tirando vantagem de uma representação em triplas. Neste contexto, documentos e outras fontes textuais podem ser anotados para serem incorporados a este universo como recursos ou servindo também de base para extração de novas triplas. O objetivo deste artigo é apresentar uma abordagem para extração de dados e geração de triplas a partir de textos com estilos específicos visando o enriquecimento de dados abertos interligados, através de sua associação e ligação a bases existentes. A abordagem foi aplicada e avaliada no contexto de um portal com informações sobre o consumo de agrotóxicos no Brasil.

Referências

Agichtein, E. and Gravano, L. (2000). Snowball: Extracting Relations from Large PlainText Collections. In Proc. of the Fifth ACM Conf. on Digital Libraries, DL ’00, pages 85–94, New York, NY, USA. ACM.

Augenstein, I., Padó, S., and Rudolph, S. (2012). LODifier: Generating Linked Data from Unstructured Text. In Proc. of the 9th Inter. Conf. on The Semantic Web: Research and Applications, ESWC’12, pages 210–224, Berlin, Heidelberg. Springer-Verlag.

Brin, S. (1999). Extracting Patterns and Relations from the World Wide Web. In Selected Papers from the Int. Workshop on The World Wide Web and Databases, WebDB ’98, pages 172–183, London, UK, UK. Springer-Verlag.

Byrne, K. and Klein, E. (2010). Automatic Extraction of Archaeological Events from Text. In Proc. of the 37th Int. Conf. Computer App. and Quantitative Methods in Archaeology, pages 48–56, Williamsburg, Virginia, USA.

Caracciolo, C., Stellato, A., Morshed, A., Johannsen, G., Rajbhandari, S., Jaques, Y., and Keizer, J. (2013). The AGROVOC Linked Dataset. volume 4, pages 341–348, Amsterdam, The Netherlands, The Netherlands. IOS Press.

Carvalho, D. S., Freitas, A., and da Silva, J. C. P. (2013). Graphia: Extracting Contextual Relation Graphs from Text. In The Semantic Web: ESWC 2013 Satellite Events - ESWC 2013 Satellite Events, Montpellier, France, May 26-30, 2013, Revised Selected Papers, pages 236–241. Springer.

de Abreu, S. C., Bonamigo, T. L., and Vieira, R. (2013). A review on Relation Extraction with an eye on Portuguese. Journal of the Brazilian Computer Society, 19(4):553–571.

de Souza, E. N. P. and Claro, D. B. (2014). Extração de Relações utilizando Features Diferenciadas para Português. Linguamática, 6:57–65.

Del Corro, L. and Gemulla, R. (2013). ClausIE: Clause-Based Open Information Extraction. In Pro. of the 22nd Int. Conf. on World Wide Web, WWW ’13, pages 355–366, New York, NY, USA. ACM.

Etzioni, O., Banko, M., Soderland, S., and Weld, D. S. (2008). Open Information Extraction from the Web. Commun. ACM, 51(12):68–74.

Etzioni, O., Cafarella, M., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D. S., and Yates, A. (2005). Unsupervised Named-Entity Extraction from the Web: An Experimental Study. Artif. Intell., 165(1):91–134.

Fader, A., Soderland, S., and Etzioni, O. (2011). Identifying Relations for Open Information Extraction. In Proc. of the Conf. on Empirical Methods in Natural Language Processing, EMNLP ’11, pages 1535–1545, Stroudsburg, PA, USA. Ass. for Comp. Linguistics.

Gamallo, P., Garcia, M., and Fernández-Lanza, S. (2012). Dependency-Based Open Information Extraction. In Proc. of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP, ROBUS-UNSUP ’12, pages 10–18. Ass. for Comp. Linguistics.

Grishman, R., Sterling, J., and Macleod, C. (1991). Description of the Proteus System as used for MUC-3. In Proc. of the Third Message Understanding Conference, San Diego, CA, May 1991, pages 183–190. Morgan Kaufmann.

Joshi, A., Lal, R., Finin, T., and Joshi, A. (2013). Extracting Cybersecurity Related Linked Data from Text. In Proc. of the 7th IEEE Int. Conf. on Semantic Computing, pages 252–259. IEEE Computer Society Press.

Lange, D., Böhm, C., and Naumann, F. (2010). Extracting Structured Information from Wikipedia Articles to Populate Infoboxes. In Proc. of the 19th ACM Int. Conf. on Inf. and Knowledge Management, CIKM ’10, pages 1661–1664, New York, USA. ACM.

Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P. N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., and Bizer, C. (2015). DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web Journal, 6(2):167–195.

Lehnert, W., Williams, R., Cardie, C., Riloff, E., and Fisher, D. (1991). The CIRCUS System as Used in MUC-3. Technical report, Amherst, MA, USA.

Nadkarni, P. M., Ohno-Machado, L., and Chapman, W. W. (2011). Natural language processing: an introduction. Journal of the American Med. Inf. Ass., 18(5):544–551.

Noy, N. F., Shah, N. H., Whetzel, P. L., Dai, B., Dorf, M., Griffith, N., Jonquet, C., Rubin, D. L., Storey, M.-A. D., Chute, C. G., and Musen, M. A. (2009). BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Research, 37(Web-Server-Issue):170–173.

Pantel, P. and Pennacchiotti, M. (2006). Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations. ACL-44, pages 113–120, Stroudsburg, PA, USA. Ass. for Comp. Linguistics.

Tygel, A., Auer, S., Debattista, J., Orlandi, F., and Campos, M. L. M. (2016). Towards Cleaning-Up Open Data Portals: A Metadata Reconciliation Approach. In Tenth IEEE Inte. Conf. on Semantic Comp., ICSC 2016, Laguna Hills, CA, USA, 2016, pages 71–78.

Yates, A., Cafarella, M., Banko, M., Etzioni, O., Broadhead, M., and Soderland, S. (2007). Textrunner: Open information extraction on the web. In Proc. of Human Language Technologies: The Annual Conf. of the North American Chapter of the Association for Computational Linguistics: Demonstrations, NAACL-Demonstrations ’07, pages 25–26, Stroudsburg, PA, USA. Ass. for Comp. Linguistics.