A Deep Learning–based Data Lake architecture for searching parasitic images of socially determined diseases

  • João Gabriel Marques de Lima UFAL
  • Danilo Fernandes UFAL
  • Fabiane da Silva Queiroz UFAL
  • André L. L. Aquino UFAL

Abstract


This work compares data pipelines using image files in PNG format and Deep Lake for medical images (SHdataset) through performance benchmarks and a Deep Metric Learning (DML) case study to analyze performance and effectiveness trade-offs. Results indicate that Deep Lake, while faster in data iteration, required 59.2% more storage; although it produced a model of comparable quality, the image file-based approach yielded a feature space with marginally superior quantitative separability. We conclude that modern formats present a trade-off between management benefits and storage/optimization costs while largely preserving model effectiveness.

References

Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., Nagappan, N., Nushi, B., and Zimmermann, T. (2019). Software engineering for machine learning: A case study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pages 291–300. IEEE.

Armbrust, M., Ghodsi, A., Xin, R., and Zaharia, M. (2021). Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. In Proceedings of CIDR, volume 8, page 28.

Deheyab, A. O. A., Alwan, M. H., khalid Abdul Rezzaqe, I., Mahmood, O. A., Hammadi, Y. I., Kareem, A. N., and Ibrahim, M. (2022). An overview of challenges in medical image processing. In The 6th International Conference on Future Networks & Distributed Systems (ICFND ’22), page 6, Tashkent, TAS, Uzbekistan. ACM.

Hambardzumyan, S., Tuli, A., Ghukasyan, L., Rahman, F., Topchyan, H., Isayan, D., McQuade, M., Harutyunyan, M., Hakobyan, T., Stranic, I., et al. (2022). Deep lake: A lakehouse for deep learning.

Hu, H., Wen, Y., Chua, T.-S., and Li, X. (2014). Toward Scalable Systems for Big Data Analytics: A Technology Tutorial. IEEE Access.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553):436–444.

Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian, M., Van Der Laak, J. A., Van Ginneken, B., and Sánchez, C. I. (2017). A survey on deep learning in medical image analysis. Medical image analysis, 42:60–88.

Mensink, T., Verbeek, J., Perronnin, F., and Csurka, G. (2012). Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In European Conference on Computer Vision, pages 488–501. Springer.

Oyibo, P., Meulah, B., Agbana, T., Bengtson, M., van Lieshout, L., Oyibo, W., Vdovine, G., and Diehl, J.-C. (2023). Schistosoma Haematobium Egg Image Dataset.

Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., Duan, T., Ding, D., Bagul, A., Langlotz, C., Shpanskaya, K., et al. (2017). Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225.

Schroff, F., Kalenichenko, D., and Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823.

Yousef Ameen Esmail Ahmed et al., Biao Yue, Z. G. J. Y. (2023). An overview: Big data analysis by deep learning and image processing. World Scientific Journals, 21(07).
Published
2025-08-12
LIMA, João Gabriel Marques de; FERNANDES, Danilo; QUEIROZ, Fabiane da Silva; AQUINO, André L. L.. A Deep Learning–based Data Lake architecture for searching parasitic images of socially determined diseases. In: REGIONAL SCHOOL ON COMPUTING OF BAHIA, ALAGOAS, AND SERGIPE (ERBASE), 25. , 2025, Lagarto/SE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 62-71. DOI: https://doi.org/10.5753/erbase.2025.13015.