Extraction of conference data from the web
Abstract
Choosing the most suitable conference to submit a paper is a task that depends on a number of factors including: (i) the topic of the paper needs to be among the topics of interest of the conference; (ii) submission deadlines need to be compatible with the necessary time for paper writing; and (iii) the quality or impact of the conference. These factors allied to the existence of thousands of conferences, make the search of the right event very time consuming, especially when researching in a new area. Intending to help researchers finding conferences, this paper presents a method developed to retrieve and extract data from conferences web sites. Our method combines the identification of conference URL and deadline extraction. The retrieved data is stored in a database to be searched with an online tool. The paper also reports on experiments that evaluate the quality of the extracted data, focusing on the deadlines.
Keywords:
Data extraction, URL identification, Qualis Table
References
Fábio L Correia, Rui FS Amaro, Luís Sarmento, and Rosaldo JF Rossetti. Allcall: An automated call for paper information extractor. In Information Systems and Technologies (CISTI), 2010 5th Iberian Conference on, pages 1–4, 2010.
Oren Etzioni, Michele Banko, Stephen Soderland, and Daniel S Weld. Open information extraction from the web. Communications of the ACM, 51(12):68–74, 2008.
Lei Fu, Yingju Xia, Yao Meng, and Hao Yu. Conditional random fields model for web content extraction. In Computing in the Global Information Technology (ICCGI), pages 30–34, 2010.
Tomas Gogar, Ondrej Hubacek, and Jan Sedivy. Deep Neural Networks for Web Page Information Extraction, pages 154–163. 2016.
Yunfei Gong and Qiang Liu. Automatic web page segmentation and information extraction using conditional random fields. In Computer Supported Cooperative Work in Design (CSCWD), pages 334–340, 2012.
John Lafferty, Andrew McCallum, Fernando Pereira, et al. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the international conference on machine learning, ICML, volume 1, pages 282–289, 2001.
Xinyu Li, Roya Rastan, John Shepherd, and Hye Young Paik. Automatic affiliation extraction from calls-for-papers. In Proceedings of the Workshop on Automated Knowledge Base Construction, AKBC ’13, pages 97–102, 2013. ISBN 978-1-4503-2411-3.
Jochen Mattes. Automated meta-data extraction for confsearch. Technical report, 2011.
Hoa Nguyen, Thanh Nguyen, and Juliana Freire. Learning to extract form labels. Proceedings of the VLDB Endowment, 1(1):684–694, 2008.
David Pinto, Andrew McCallum, Xing Wei, and W Bruce Croft. Table extraction using conditional random fields. In Proceedings of the annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 235–242, 2003.
Elaine Pereira de Souza and Maria Carlota de Souza Paula. Qualis: a base de qualificação dos periódicos científicos utilizada na avaliação capes. InfoCAPES Boletim Informativo, 10(2), 2002.
Henry S Vieira, Altigran S da Silva, Marco Cristo, and Edleno S de Moura. A self-training crf method for recognizing product model mentions in web forums. In European Conference on Information Retrieval, pages 257–264, 2015.
Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, and Wei-Ying Ma. 2d conditional random fields for web information extraction. In Proceedings of the International Conference on Machine Learning, pages 1044–1051, 2005.
Oren Etzioni, Michele Banko, Stephen Soderland, and Daniel S Weld. Open information extraction from the web. Communications of the ACM, 51(12):68–74, 2008.
Lei Fu, Yingju Xia, Yao Meng, and Hao Yu. Conditional random fields model for web content extraction. In Computing in the Global Information Technology (ICCGI), pages 30–34, 2010.
Tomas Gogar, Ondrej Hubacek, and Jan Sedivy. Deep Neural Networks for Web Page Information Extraction, pages 154–163. 2016.
Yunfei Gong and Qiang Liu. Automatic web page segmentation and information extraction using conditional random fields. In Computer Supported Cooperative Work in Design (CSCWD), pages 334–340, 2012.
John Lafferty, Andrew McCallum, Fernando Pereira, et al. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the international conference on machine learning, ICML, volume 1, pages 282–289, 2001.
Xinyu Li, Roya Rastan, John Shepherd, and Hye Young Paik. Automatic affiliation extraction from calls-for-papers. In Proceedings of the Workshop on Automated Knowledge Base Construction, AKBC ’13, pages 97–102, 2013. ISBN 978-1-4503-2411-3.
Jochen Mattes. Automated meta-data extraction for confsearch. Technical report, 2011.
Hoa Nguyen, Thanh Nguyen, and Juliana Freire. Learning to extract form labels. Proceedings of the VLDB Endowment, 1(1):684–694, 2008.
David Pinto, Andrew McCallum, Xing Wei, and W Bruce Croft. Table extraction using conditional random fields. In Proceedings of the annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 235–242, 2003.
Elaine Pereira de Souza and Maria Carlota de Souza Paula. Qualis: a base de qualificação dos periódicos científicos utilizada na avaliação capes. InfoCAPES Boletim Informativo, 10(2), 2002.
Henry S Vieira, Altigran S da Silva, Marco Cristo, and Edleno S de Moura. A self-training crf method for recognizing product model mentions in web forums. In European Conference on Information Retrieval, pages 257–264, 2015.
Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, and Wei-Ying Ma. 2d conditional random fields for web information extraction. In Proceedings of the International Conference on Machine Learning, pages 1044–1051, 2005.
Published
2017-10-02
How to Cite
GARCIA, Cássio Alan; P. MOREIRA, Viviane.
Extraction of conference data from the web. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 32. , 2017, Uberlândia/MG.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2017
.
p. 64-75.
ISSN 2763-8979.
DOI: https://doi.org/10.5753/sbbd.2017.171356.
