Enhancing Classification Models with Enriched Data via Web Scraping: A Case Study of the Dog Breed Identification Competition

  • Marcos V. M. Faria IFES
  • Ludmila Dias IFES
  • Eduardo O. P. Ferreira IFES
  • Thiago M. Paixão IFES
  • Francisco A. Boldt IFES

Abstract


This article presents a study on the use of Web Scraping for automated data extraction from the web, aimed at enhancing classification models through the enrichment of the training data base. In our experiments, we utilized two databases: one from the Kaggle “Dog Breed Identification” competition, which served as a case study, and another resulting from the merger of this with a database extracted via scraping. In the extraction process, we employed the Puppeteer library and other auxiliary tools at specific stages of the process. The classification model adopted was Xception. The results were compared based on the metrics of Accuracy, Recall, Precision, and F1 Score. We conclude that the addition of data via web scraping can improve classification performance, provided that the data is properly cleaned.

References

Chen, J., Bai, G., Liang, S., and Li, Z. (2016). Automatic image cropping : A computational complexity study. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions.

Chollet, F. (2021). Deep learning with Python. Simon and Schuster.

Correia, C. H. G., Komati, K. S., and Boldt, F. d. A. (2021). Reconhecimento de gestos de mão em sequência a partir de sensores inerciais. Journal of Health Informatics, 12.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.

Deng, L. (2012). The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141–142.

Jocher, G., Chaurasia, A., Stoken, A., Borovec, J., NanoCode012, Kwon, Y., Michael, K., TaoXie, Fang, J., imyhxy, Lorna, Yifu), Wong, C., V, A., Montes, D., Wang, Z., Fati, C., Nadar, J., Laughing, UnglvKitDe, Sonck, V., tkianai, yxNONG, Skalski, P., Hogan, A., Nair, D., Strobel, M., and Jain, M. (2022). ultralytics/yolov5: v7.0 - YOLOv5 SOTA Realtime Instance Segmentation.

Kaggle (2024). Kaggle datasets.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer.

Munappy, A., Bosch, J., Olsson, H. H., Arpteg, A., and Brinne, B. (2019). Data management challenges for deep learning. In 2019 45th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), pages 140–147.

Puppeteer (2024). Puppeteer.

R, R. R. N., S, N. R., and M., V. (2023). Web scrapping tools and techniques: A brief survey. In 2023 4th International Conference on Innovative Trends in Information Technology (ICITIIT), pages 1–4.

Russel, S. and Norving, P. (2022). Inteligência Artificial - Uma Abordagem Moderna. GEN LTC, 4th edition.

Sager, C., Janiesch, C., and Zschech, P. (2021). A survey of image labelling for computer vision applications. Journal of Business Analytics, 4(2):91–110.

Sirisuriya, S. D. S. (2023). Importance of web scraping as a data source for machine learning algorithms - review. In 2023 IEEE 17th International Conference on Industrial and Information Systems (ICIIS), pages 134–139.

Srinivasan, K., Raman, K., Chen, J., Bendersky, M., and Najork, M. (2021). Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21. ACM.

Torralba, A., Russell, B. C., and Yuen, J. (2010). Labelme: Online image annotation and applications. Proceedings of the IEEE, 98(8):1467–1484.

University of California, I. (2024). Uci machine learning repository.

Valarmathi, B., Gupta, N. S., Prakash, G., Reddy, R. H., Saravanan, S., and Shanmuga-sundaram, P. (2023). Hybrid deep learning algorithms for dog breed identification—a comparative analysis. IEEE Access, 11:77228–77239.

Voulodimos, A., Doulamis, N., Doulamis, A., and Protopapadakis, E. (2018). Deep learning for computer vision: A brief review. Computational intelligence and neuroscience, 2018.

Wikipedia (2024). List of dog breeds.

with Code, P. (2024). Papers with code.

Zhang, D., Islam, M. M., and Lu, G. (2012). A review on automatic image annotation techniques. Pattern Recognition, 45(1):346–362.
Published
2024-10-17
FARIA, Marcos V. M.; DIAS, Ludmila; FERREIRA, Eduardo O. P.; PAIXÃO, Thiago M.; BOLDT, Francisco A.. Enhancing Classification Models with Enriched Data via Web Scraping: A Case Study of the Dog Breed Identification Competition. In: REGIONAL SCHOOL OF INFORMATICS OF ESPÍRITO SANTO (ERI-ES), 9. , 2024, Vitória/ES. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 127-136. DOI: https://doi.org/10.5753/eries.2024.244695.