NAEWI - Non-rendering Approach to Extract Web Information

  • Marcelo C. Nunes Universidade Federal de Santa Catarina (UFSC)
  • Carina F. Dorneles Universidade Federal de Santa Catarina (UFSC)

Resumo


Extração de informações em páginas da Web é uma tarefa importante que visa facilitar a criação de bases de conhecimento. Levando em consideração que uma página Web é desenvolvida para ser agradável à utilização do usuário, porém é renderizada a partir de uma árvore HTML DOM, identificar e extrair suas informações ainda é um grande desafio. Para superar este desafio, este trabalho propõem uma abordagem que utilizará as informações da árvore DOM em conjunto com as informações visuais extraídas em forma de metadados dos elementos HTML da página para classificar e extrair os conteúdos relevantes de uma página Web. Para isso, será criado um modelo textual que representará a identidade visual do elemento da página, a fim de emular o contexto visual dos elementos e sua hierarquia na página, sem a necessidade de renderização da página por um navegador, para a extração das informações. Para a classificação dos elementos, será utilizado o modelo de linguagem bidirecional ELMo para contextualizar e identificar as características individuais de cada tipo de elemento.
Palavras-chave: semi-structured web extraction, web information extraction

Referências

Anderson, N. and Hong, J. (2013a). Visually extracting data records from the deep web. In Proceedings of the 22nd International Conference on World Wide Web, WWW ’13 Companion, page 1233-1238, New York, NY, USA. Association for Computing Machinery.

Anderson, N. and Hong, J. (2013b). Visually extracting data records from the deep web. WWW ’13 Companion, page 1233-1238, New York, NY, USA. Association for Computing Machinery.

Chang, C.-H. and Lui, S.-C. (2001). Iepad: Information extraction based on pattern discovery. In Proceedings of the 10th International Conference on World Wide Web, WWW ’01, page 681-688, New York, NY, USA. Association for Computing Machinery.

Crestan, E. and Pantel, P. (2011). Web-scale table census and classification. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM ’11, page 545-554, New York, NY, USA. Association for Computing Machinery.

Downey, D., Etzioni, O., Soderland, S., and Weld, D. S. (2004). Learning text patterns for web information extraction and assessment. In AAAI-04 workshop on adaptive text extraction and mining, pages 50-55.

Fayzrakhmanov, R. R., Sallinger, E., Spencer, B., Furche, T., and Gottlob, G. (2018). Browserless web data extraction: Challenges and opportunities. In Proceedings of the 2018 World Wide Web Conference, WWW ’18, page 1095-1104.

Guo, J., Crescenzi, V., Furche, T., Grasso, G., and Gottlob, G. (2019). Red: Redundancydriven data extraction from result pages? In The World Wide Web Conference, WWW ’19, page 605-615.

Liu, B., Grossman, R., and Zhai, Y. (2003). Mining data records in web pages. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’03, page 601-606.

Liu, W., Meng, X., and Meng, W. (2010a). Vide: A vision-based approach for deep web data extraction. IEEE Transactions on Knowledge and Data Engineering, 22(3):447-460.

Mehta, B. and Narvekar, M. (2015). Dom tree based approach for web content extraction. In 2015 International Conference on Communication, Information & Computing Technology (ICCICT), pages 1-6.

Pandarge, S. S. and Chakkarwar, V. A. (2017). Automatic web information extraction and alignment using ctvs technique. In 2017 International conference of Electronics, Communication and Aerospace Technology (ICECA), volume 2, pages 94-99.

Park, K., Nguyen, M. C., and Won, H. (2015). Web-based collaborative big data analytics on big data as a service platform. In 2015 17th International Conference on Advanced Communication Technology (ICACT), pages 564-567.

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227-2237.

Simon, K. and Lausen, G. (2005). Viper: Augmenting automatic information extraction with visual perceptions. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management, CIKM ’05, page 381-388.

Soderland, S. (1999). Learning information extraction rules for semi-structured and free text. Machine learning, 34(1):233-272.

Tani, F. Y., Farid, D. M., and Rahman, M. Z. (2012). Ensemble of decision tree classifiers for mining web data streams. International Journal of Applied Information Systems, 1(2):30-36.

Tseng, C.-H. (2014). Crowd aided web search. In 2014 6th International Conference on Knowledge and Smart Technology (KST), pages 1-6.

Velloso, R. P. and Dorneles, C. F. (2020). Optimized extraction of records from the web using signal processing and machine learning. In SBBD, pages 109-120.

Velloso, R. P. et al. (2014). Algoritmo não supervisionado para segmentação e remoção de ruído web utilizando tag paths, url: https://repositorio.ufsc.br/handle/123456789/129142.

Velloso, R. P. et al. (2020). Optimized record extraction from web pages using signal processing and machine learning, url: https://repositorio.ufsc.br/handle/123456789/219351.

Wai, F. K., Yong, L. W., Thing, V. L. L., and Pomponiu, V. (2017). Cmdr: Classifying nodes for mining data records with different html structures. In TENCON 2017-2017 IEEE Region 10 Conference, pages 1862-1862.

Weninger, T. and Hsu, W. H. (2008). Text extraction from the web via text-to-tag ratio. In 2008 19th International Workshop on Database and Expert Systems Applications, pages 23-28.

Xie, C., Huang, W., Liang, J., Huang, C., and Xiao, Y. (2021). WebKE: Knowledge Extraction from Semi-Structured Web with Pre-Trained Markup Language Model, page 2211-2220.

Zhai, Y. and Liu, B. (2005). Web data extraction based on partial tree alignment. In Proceedings of the 14th International Conference on World Wide Web, WWW ’05, page 76-85, New York, NY, USA. Association for Computing Machinery.

Zhou, Y., Sheng, Y., Vo, N., Edmonds, N., and Tata, S. (2022). Learning transferable node representations for attribute extraction from web documents. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, pages 1479-1487.
Publicado
19/09/2022
NUNES, Marcelo C.; DORNELES, Carina F.. NAEWI - Non-rendering Approach to Extract Web Information. In: WORKSHOP DE TESES E DISSERTAÇÕES (WTDBD) - SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 37. , 2022, Búzios. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2022 . p. 161-167. DOI: https://doi.org/10.5753/sbbd_estendido.2022.21859.