EDREW - Enhanced Data Representation for Extraction in Web
Resumo
Extracting data from Web sites is still a challenge since pages have a complex and changeable structure, and the reason is simple: Web pages are designed to be visually user-friendly to users and not for the task of extracting data. In addition, each of them has its own and varied structures based on the HTML DOM structure. Since Web page designers can have their own standards for designing the pages, web page structures are widely divergent. So, identifying and extracting information still represents a significant barrier. To overcome this challenge, we propose a new approach called EDREW, which uses the information from the HTML DOM structure and the information generated through the HTML elements to represent the context of the elements on the page without the need for rendering. We use the ELMo model to extract information and classify them as noise or useful content. The experiments were performed on the public dataset Structured Web Data Extraction (SWDE) and on a new dataset created for this work, based on the most current versions of the pages in the dataset SWDE. Using EDREW, it was possible to overcome the baselines using the original SWDE dataset and extract twice as much page content using a new version of SWDE built by us with updated pages.
Palavras-chave:
semi-structured web extraction, web information extraction
Referências
Neil Anderson and Jun Hong. 2013. Visually Extracting Data Records from the Deep Web. In Proceedings of the 22nd International Conference on World Wide Web (Rio de Janeiro, Brazil) (WWW ’13 Companion). ACM, New York, NY, USA, 1233–1238. https://doi.org/10.1145/2487788.2488156
Neil Anderson and Jun Hong. 2013. Visually Extracting Data Records from the Deep Web(WWW ’13 Companion). ACM, New York, NY, USA, 1233–1238. https://doi.org/10.1145/2487788.2488156
Chia-Hui Chang and Shao-Chen Lui. 2001. IEPAD: Information Extraction Based on Pattern Discovery. In Proceedings of the 10th International Conference on World Wide Web (Hong Kong, Hong Kong) (WWW ’01). ACM, New York, NY, USA, 681–688. https://doi.org/10.1145/371920.372182
Eric Crestan and Patrick Pantel. 2011. Web-Scale Table Census and Classification. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining (Hong Kong, China) (WSDM ’11). ACM, New York, NY, USA, 545–554. https://doi.org/10.1145/1935826.1935904
Doug Downey, Oren Etzioni, Stephen Soderland, and Daniel S Weld. 2004. Learning text patterns for web information extraction and assessment. In AAAI-04 workshop on adaptive text extraction and mining. 50–55
Ruslan R. Fayzrakhmanov, Emanuel Sallinger, Ben Spencer, Tim Furche, and Georg Gottlob. 2018. Browserless Web Data Extraction: Challenges and Opportunities. In Proceedings of the 2018 World Wide Web Conference (Lyon, France) (WWW ’18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 1095–1104. https://doi.org/10.1145/3178876.3186008
Jinsong Guo, Valter Crescenzi, Tim Furche, Giovanni Grasso, and Georg Gottlob. 2019. RED: Redundancy-Driven Data Extraction from Result Pages?. In The World Wide Web Conference (San Francisco, CA, USA) (WWW ’19). ACM, New York, NY, USA, 605–615. https://doi.org/10.1145/3308558.3313529
Qiang Hao, Rui Cai, Yanwei Pang, and Lei Zhang. 2011. From one tree to a forest: a unified solution for structured web data extraction. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. 775–784.
Hao He, Lulu Chen, and Wenpu Guo. 2017. Research on web application vulnerability scanning system based on fingerprint feature. In 2017 International Conference on Mechanical, Electronic, Control and Automation Engineering (MECAE 2017). Atlantis Press, 150–155
Bing Liu, Robert Grossman, and Yanhong Zhai. 2003. Mining Data Records in Web Pages. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Washington, D.C.) (KDD ’03). ACM, New York, NY, USA, 601–606. https://doi.org/10.1145/956750.956826
Wei Liu, Xiaofeng Meng, and Weiyi Meng. 2010. ViDE: A Vision-Based Approach for Deep Web Data Extraction. IEEE Transactions on Knowledge and Data Engineering 22, 3 (2010), 447–460. https://doi.org/10.1109/TKDE.2009.109
Wei Liu, Xiaofeng Meng, and Weiyi Meng. 2010. ViDE: A Vision-Based Approach for Deep Web Data Extraction. IEEE Transactions on Knowledge and Data Engineering 22, 3 (2010), 447–460. https://doi.org/10.1109/TKDE.2009.109
Bhavdeep Mehta and Meera Narvekar. 2015. DOM tree based approach for Web content extraction. In 2015 International Conference on Communication, Information & Computing Technology (ICCICT). 1–6. https://doi.org/10.1109/ICCICT.2015.7045706
Gengxin Miao, Junichi Tatemura, Wang-Pin Hsiung, Arsany Sawires, and Louise E. Moser. 2009. Extracting Data Records from the Web Using Tag Path Clustering. In Proceedings of the 18th International Conference on World Wide Web (Madrid, Spain) (WWW ’09). ACM, New York, NY, USA, 981–990. https://doi.org/10.1145/1526709.1526841
Sangmesh S. Pandarge and V. A. Chakkarwar. 2017. Automatic web information extraction and alignment using CTVS technique. In 2017 International conference of Electronics, Communication and Aerospace Technology (ICECA), Vol. 2. 94–99. https://doi.org/10.1109/ICECA.2017.8212771
Kyounghyun Park, Minh Chau Nguyen, and Heesun Won. 2015. Web-based collaborative big data analytics on big data as a service platform. In 2015 17th International Conference on Advanced Communication Technology (ICACT). 564–567. https://doi.org/10.1109/ICACT.2015.7224859
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 2227–2237. https://doi.org/10.18653/v1/N18-1202
Nur A Rakhmawati, Sayekti Harits, Deny Hermansyah, and Muhammad Ariful Furqon. 2018. A survey of web technologies used in Indonesia local governments. Sisfo 7, 3 (2018), 213–222
Kai Simon and Georg Lausen. 2005. ViPER: Augmenting Automatic Information Extraction with Visual Perceptions. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (Bremen, Germany) (CIKM ’05). ACM, New York, NY, USA, 381–388. https://doi.org/10.1145/1099554.1099672
Stephen Soderland. 1999. Learning information extraction rules for semi-structured and free text. Machine learning 34, 1 (1999), 233–272
Fauzia Yasmeen Tani, Dewan Md Farid, and Mohammad Zahidur Rahman. 2012. Ensemble of decision tree classifiers for mining web data streams. International Journal of Applied Information Systems 1, 2 (2012), 30–36
Chun-Hsiung Tseng. 2014. Crowd aided Web search. In 2014 6th International Conference on Knowledge and Smart Technology (KST). 1–6. https://doi.org/10.1109/KST.2014.6775384
Roberto Panerai Velloso 2014. Algoritmo não supervisionado para segmentação e remoção de ruído de páginas web utilizando tag paths. (2014)
Roberto Panerai Velloso and Carina F Dorneles. 2020. Optimized Extraction of Records from the Web Using Signal Processing and Machine Learning. In Anais do XXXV Simpósio Brasileiro de Bancos de Dados. SBC, 109–120
Fok Kar Wai, Lim Wee Yong, Vrizlynn L. L. Thing, and Victor Pomponiu. 2017. CMDR: Classifying nodes for mining data records with different HTML structures. In TENCON 2017 - 2017 IEEE Region 10 Conference. 1862–1862. https://doi.org/10.1109/TENCON.2017.8228162
Tim Weninger and William H. Hsu. 2008. Text Extraction from the Web via Text-to-Tag Ratio. In 2008 19th International Workshop on Database and Expert Systems Applications. 23–28. https://doi.org/10.1109/DEXA.2008.12
Chenhao Xie, Wenhao Huang, Jiaqing Liang, Chengsong Huang, and Yanghua Xiao. 2021. WebKE: Knowledge Extraction from Semi-Structured Web with Pre-Trained Markup Language Model. ACM, New York, NY, USA, 2211–2220. https://doi.org/10.1145/3459637.3482491
Yuqiang Xie, Luxi Xing, Wei Peng, and Yue Hu. 2021. IIE-NLP-Eyas at SemEval-2021 Task 4: Enhancing plm for recam with special tokens, re-ranking, siamese encoders and back translation. arXiv preprint arXiv:2102.12777 (2021)
Yanhong Zhai and Bing Liu. 2005. Web Data Extraction Based on Partial Tree Alignment. In Proceedings of the 14th International Conference on World Wide Web (Chiba, Japan) (WWW ’05). ACM, New York, NY, USA, 76–85. https://doi.org/10.1145/1060745.1060761
Yichao Zhou, Ying Sheng, Nguyen Vo, Nick Edmonds, and Sandeep Tata. 2022. Learning Transferable Node Representations for Attribute Extraction from Web Documents. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 1479–1487
Neil Anderson and Jun Hong. 2013. Visually Extracting Data Records from the Deep Web(WWW ’13 Companion). ACM, New York, NY, USA, 1233–1238. https://doi.org/10.1145/2487788.2488156
Chia-Hui Chang and Shao-Chen Lui. 2001. IEPAD: Information Extraction Based on Pattern Discovery. In Proceedings of the 10th International Conference on World Wide Web (Hong Kong, Hong Kong) (WWW ’01). ACM, New York, NY, USA, 681–688. https://doi.org/10.1145/371920.372182
Eric Crestan and Patrick Pantel. 2011. Web-Scale Table Census and Classification. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining (Hong Kong, China) (WSDM ’11). ACM, New York, NY, USA, 545–554. https://doi.org/10.1145/1935826.1935904
Doug Downey, Oren Etzioni, Stephen Soderland, and Daniel S Weld. 2004. Learning text patterns for web information extraction and assessment. In AAAI-04 workshop on adaptive text extraction and mining. 50–55
Ruslan R. Fayzrakhmanov, Emanuel Sallinger, Ben Spencer, Tim Furche, and Georg Gottlob. 2018. Browserless Web Data Extraction: Challenges and Opportunities. In Proceedings of the 2018 World Wide Web Conference (Lyon, France) (WWW ’18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 1095–1104. https://doi.org/10.1145/3178876.3186008
Jinsong Guo, Valter Crescenzi, Tim Furche, Giovanni Grasso, and Georg Gottlob. 2019. RED: Redundancy-Driven Data Extraction from Result Pages?. In The World Wide Web Conference (San Francisco, CA, USA) (WWW ’19). ACM, New York, NY, USA, 605–615. https://doi.org/10.1145/3308558.3313529
Qiang Hao, Rui Cai, Yanwei Pang, and Lei Zhang. 2011. From one tree to a forest: a unified solution for structured web data extraction. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. 775–784.
Hao He, Lulu Chen, and Wenpu Guo. 2017. Research on web application vulnerability scanning system based on fingerprint feature. In 2017 International Conference on Mechanical, Electronic, Control and Automation Engineering (MECAE 2017). Atlantis Press, 150–155
Bing Liu, Robert Grossman, and Yanhong Zhai. 2003. Mining Data Records in Web Pages. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Washington, D.C.) (KDD ’03). ACM, New York, NY, USA, 601–606. https://doi.org/10.1145/956750.956826
Wei Liu, Xiaofeng Meng, and Weiyi Meng. 2010. ViDE: A Vision-Based Approach for Deep Web Data Extraction. IEEE Transactions on Knowledge and Data Engineering 22, 3 (2010), 447–460. https://doi.org/10.1109/TKDE.2009.109
Wei Liu, Xiaofeng Meng, and Weiyi Meng. 2010. ViDE: A Vision-Based Approach for Deep Web Data Extraction. IEEE Transactions on Knowledge and Data Engineering 22, 3 (2010), 447–460. https://doi.org/10.1109/TKDE.2009.109
Bhavdeep Mehta and Meera Narvekar. 2015. DOM tree based approach for Web content extraction. In 2015 International Conference on Communication, Information & Computing Technology (ICCICT). 1–6. https://doi.org/10.1109/ICCICT.2015.7045706
Gengxin Miao, Junichi Tatemura, Wang-Pin Hsiung, Arsany Sawires, and Louise E. Moser. 2009. Extracting Data Records from the Web Using Tag Path Clustering. In Proceedings of the 18th International Conference on World Wide Web (Madrid, Spain) (WWW ’09). ACM, New York, NY, USA, 981–990. https://doi.org/10.1145/1526709.1526841
Sangmesh S. Pandarge and V. A. Chakkarwar. 2017. Automatic web information extraction and alignment using CTVS technique. In 2017 International conference of Electronics, Communication and Aerospace Technology (ICECA), Vol. 2. 94–99. https://doi.org/10.1109/ICECA.2017.8212771
Kyounghyun Park, Minh Chau Nguyen, and Heesun Won. 2015. Web-based collaborative big data analytics on big data as a service platform. In 2015 17th International Conference on Advanced Communication Technology (ICACT). 564–567. https://doi.org/10.1109/ICACT.2015.7224859
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 2227–2237. https://doi.org/10.18653/v1/N18-1202
Nur A Rakhmawati, Sayekti Harits, Deny Hermansyah, and Muhammad Ariful Furqon. 2018. A survey of web technologies used in Indonesia local governments. Sisfo 7, 3 (2018), 213–222
Kai Simon and Georg Lausen. 2005. ViPER: Augmenting Automatic Information Extraction with Visual Perceptions. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (Bremen, Germany) (CIKM ’05). ACM, New York, NY, USA, 381–388. https://doi.org/10.1145/1099554.1099672
Stephen Soderland. 1999. Learning information extraction rules for semi-structured and free text. Machine learning 34, 1 (1999), 233–272
Fauzia Yasmeen Tani, Dewan Md Farid, and Mohammad Zahidur Rahman. 2012. Ensemble of decision tree classifiers for mining web data streams. International Journal of Applied Information Systems 1, 2 (2012), 30–36
Chun-Hsiung Tseng. 2014. Crowd aided Web search. In 2014 6th International Conference on Knowledge and Smart Technology (KST). 1–6. https://doi.org/10.1109/KST.2014.6775384
Roberto Panerai Velloso 2014. Algoritmo não supervisionado para segmentação e remoção de ruído de páginas web utilizando tag paths. (2014)
Roberto Panerai Velloso and Carina F Dorneles. 2020. Optimized Extraction of Records from the Web Using Signal Processing and Machine Learning. In Anais do XXXV Simpósio Brasileiro de Bancos de Dados. SBC, 109–120
Fok Kar Wai, Lim Wee Yong, Vrizlynn L. L. Thing, and Victor Pomponiu. 2017. CMDR: Classifying nodes for mining data records with different HTML structures. In TENCON 2017 - 2017 IEEE Region 10 Conference. 1862–1862. https://doi.org/10.1109/TENCON.2017.8228162
Tim Weninger and William H. Hsu. 2008. Text Extraction from the Web via Text-to-Tag Ratio. In 2008 19th International Workshop on Database and Expert Systems Applications. 23–28. https://doi.org/10.1109/DEXA.2008.12
Chenhao Xie, Wenhao Huang, Jiaqing Liang, Chengsong Huang, and Yanghua Xiao. 2021. WebKE: Knowledge Extraction from Semi-Structured Web with Pre-Trained Markup Language Model. ACM, New York, NY, USA, 2211–2220. https://doi.org/10.1145/3459637.3482491
Yuqiang Xie, Luxi Xing, Wei Peng, and Yue Hu. 2021. IIE-NLP-Eyas at SemEval-2021 Task 4: Enhancing plm for recam with special tokens, re-ranking, siamese encoders and back translation. arXiv preprint arXiv:2102.12777 (2021)
Yanhong Zhai and Bing Liu. 2005. Web Data Extraction Based on Partial Tree Alignment. In Proceedings of the 14th International Conference on World Wide Web (Chiba, Japan) (WWW ’05). ACM, New York, NY, USA, 76–85. https://doi.org/10.1145/1060745.1060761
Yichao Zhou, Ying Sheng, Nguyen Vo, Nick Edmonds, and Sandeep Tata. 2022. Learning Transferable Node Representations for Attribute Extraction from Web Documents. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 1479–1487
Publicado
23/10/2023
Como Citar
NUNES, Marcelo C.; DORNELES, Carina F..
EDREW - Enhanced Data Representation for Extraction in Web. In: BRAZILIAN SYMPOSIUM ON MULTIMEDIA AND THE WEB (WEBMEDIA), 29. , 2023, Ribeirão Preto/SP.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2023
.
p. 230–237.