Collecting, extracting and storing web research survey questionnaires data

Authors

  • Carina F. Dorneles Universidade Federal de Santa Catarina
  • Gilney N. Mathias Universidade Federal de Santa Catarina

DOI:

https://doi.org/10.5753/jidm.2022.2318

Keywords:

HTML research questionnaires, dataset, crawler, data extraction

Abstract

Companies or institutions can use survey questionnaires to evaluate items or products, analyze their employees/customers’ satisfaction or collect any data they consider helpful. Furthermore, questionnaires can be used to collect data that can be used in research studies. Some problems in creating such questionnaires involve: deciding which questions to ask, how to ask them, and how to organize them. Many research communities, especially in the healthcare field, maintain repositories that are publicly accessible and include different questionnaires that help professionals and researchers analyze the results of questions, add new questions, or even point out nonsense questions. In this paper, we describe: (i) web crawler, which scans the Web searching for sites that possibly contain questionnaires; (ii) an extractor, which extracts the questionnaires from the list of pages collected by the crawler and saves them into a relational database; and (iii) the public dataset we have created to persist the questionnaires. The database created can then serve to analyze these data and/or as a centralized base of examples to prepare new questionnaires or reuse existing questions. The experiments we have conducted demonstrate that our crawler has achieved 94,47%, and the extractor has achieved a precision between 90% and 92%.

Downloads

Download data is not yet available.

References

da Silva, J. M. Collecta: um sistema computacional de coleta de dados e avaliação institucional para apoio à tomada de decisão na Universidade Federal de Santa Catarina. M.S. thesis, Universidade Federal de Santa Catarina, Florianópolis, 2012.

Hernández, I., Rivero, C. R., and Ruiz, D. Deep web crawling: A survey. World Wide Web 22 (4): 1577–1610, July, 2019.

Ismailova, L., Wolfengagen, V., and Kosikov, S. A semantic model for indexing in the hidden web. Procedia Computer Science vol. 190, pp. 324–331, 2021. 2020 Annual International Conference on Brain-Inspired Cognitive Architectures for Artificial Intelligence: Eleventh Annual Meeting of the BICA Society.

J., B. Intelligent and secure autofill system in web browsers. In Proceedings of the 12th International Conference on Soft Computing and Pattern Recognition, 2021.

Kantorski, G. Z., Moreira, V. P., and Heuser, C. A. Automatic filling of hidden web forms: A survey. SIGMOD Rec. 44 (1): 24–35, may, 2015.

Laender, A. H., Ribeiro-Neto, B., da Silva, A., and Teixeira, J. A Brief Survey of Web Data Extraction Tools. Sigmod Record 31 (2), 06, 2002.

Leonardo Bres dos Santos, Carina F. Dorneles, R. S. M. An approach for extracting web form labels based on distance analysis of html components. In Proceedings IADIS International Conference WWW-Internet 2012, 2012.

Liu, B. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Data-Centric Systems and Applications. Springer, 2007.

Madan K., B. R. Reinforcement learning in deep web crawling: Survey. In Proceedings of Second Doctoral Symposium on Computational Intelligence., 2021.

Murugudu, M. R. and Reddy, L. S. S. Efficiently harvesting deep web interfaces based on adaptive learning using two-phase data crawler framework. Soft Computing, 2021.

Olston, C. and Najork, M. Web Crawling. Foundations and Trends in Information Retrieval 4 (3): 175–246, 2010.

Souza, R. H. and Dorneles, C. F. Searching and ranking questionnaires: An approach to calculate similarity between questionnaires. In Proceedings of the ACM Symposium on Document Engineering 2019. DocEng ’19. Association for Computing Machinery, New York, NY, USA, 2019.

Tatarinov, I., Viglas, S. D., Beyer, K., Shanmugasundaram, J., Shekita, E., and Zhang, C. Storing and querying ordered xml using a relational database system. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. SIGMOD ’02. Association for Computing Machinery, New York, NY, USA, pp. 204–215, 2002.

Wright, K. B. Researching Internet-Based Populations: Advantages and Disadvantages of Online Survey Research, Online Questionnaire Authoring Software Packages, and Web Survey Services. Journal of Computer-Mediated Communication 10 (3), 07, 2017. JCMC1034.

Zheng, W., Cheng, H., Zou, L., Yu, J. X., and Zhao, K. Natural language question/answering: Let users talk with the knowledge graph. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. CIKM ’17. Association for Computing Machinery, New York, NY, USA, pp. 217–226, 2017.

Downloads

Published

2022-08-15

How to Cite

F. Dorneles, C., & N. Mathias, G. (2022). Collecting, extracting and storing web research survey questionnaires data. Journal of Information and Data Management, 13(1). https://doi.org/10.5753/jidm.2022.2318

Issue

Section

Dataset Showcase Workshop 2021 - Extended Papers