Optimized Extraction of Records from the Web Using Signal Processing and Machine Learning

  • Roberto Panerai Velloso Federal University of Santa Catarina
  • Carina F. Dorneles Federal University of Santa Catarina

Abstract


In this paper, we present an optimization of our previous record extraction approach from web pages. The proposed optimization improves the upper bound from O(nlogn) to O(n) while maintaining the same qualitative results as before (i.e., no loss in efficacy). We have achieved the following results: a 47% improvement in runtime efficiency when compared to our previous work and 95% f-score (same as our previous work).

Keywords: web mining, record extraction, structure detection, information retrieval, record alignment, content detection, noise removal

References

Dosilovic, F. K., Brcic, M., and Hlupic, N. (2018). Explainable articial intelligence: A survey. In 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pages 0210–0215.

Goodman, B. and Flaxman, S. (2017). European union regulations on algorithmic decision making and a "right to explanation". AI Magazine, 38(3):50–57.

Hu, M. and Liu, B. (2004). Mining and summarizing customer reviews. In 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 168–177.

Ju, H. and Yu, H. (2018). Sentiment Classication with Convolutional Neural Network using Multiple Word Representations. In 12th Int. Conf. on Ubiquitous Information Management and Communication, pages 1–7.

Liu, B. (2012). Sentiment Analysis and Opinion Mining. Morgan & Claypool Publishers. Pontiki, M., Galanis, D., Androutsopoulos, I., Manandhar, S., and Papageorgiou, H. (2014). SemEval-2014 Task 4: Aspect Based Sentiment Analysis. In 8th International Workshop on Semantic Evaluation, pages 27–35.

Pontiki, M., Galanis, D., Papageorgiou, H., Manandhar, S., and Androutsopoulos, I. (2015). Semeval-2015 task 12: Aspect based sentiment analysis. In 9th International Workshop on Semantic Evaluation, pages 486–495.

Scheicher, R. B., Sinoara, R. A., Felinto, J. C., and Rezende, S. O. (2019). Sentiment In 19th ACM classication improvement using semantically enriched information. Symposium on Document Engineering, pages 1–4.

Scheicher, R. B., Sinoara, R. A., Koga, N. J., and Rezende, S. O. (2016). Uso de expressões do domínio na classicação automática de documentos. Nacional de Inteligência Articial e Computacional, pages 625 – 636.

Xiong, S. (2016). Improving twitter sentiment classication via multi-level sentiment-enriched word embeddings.
Published
2020-09-28
VELLOSO, Roberto Panerai; DORNELES, Carina F.. Optimized Extraction of Records from the Web Using Signal Processing and Machine Learning. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 35. , 2020, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2020 . p. 109-120. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2020.13629.