Automated Data Integration and Labeling on the Cnidarian Physalia physalis, using Geolocation as a Reference

  • Lisiane Reips Federal University of Paraná
  • Carmem Satie Hara Federal University of Paraná

Abstract


Classification techniques in machine learning models have been effectively applied to text and image recognition. But for any and every application, data need to be trained and tested. In order to achieve good performance in the classification process, these data need to be reliably labeled, which makes the process expensive and time-consuming. In this paper, we propose an approach to reduce the cost of manual labeling a database composed of Portuguese man of war (Physalia physalis) sightings on Brazilian beaches. The technique is based on integrating Instagram posts with newspaper articles based on their temporal and spatial proximity. The ultimate goal is to use these labeled data for training a classification technique in the machine learning process.

Keywords: Portuguese caravels, data integration, data labeling, social networks, geolocation

References

Abhari, S., Rostam Niakan Kalhori, S., Ebrahimi, M., Hasannejadasl, H., and Garavand, A. (2019). Artificial intelligence applications in type 2 diabetes mellitus care: Focus on machine learning methods. Healthcare Informatics Research, 25:248–261.

Bach, S. H., Rodriguez, D., Liu, Y., Luo, C., Shao, H., Xia, C., Sen, S., Ratner, A., Hancock, B., Alborzi, H., et al. (2019). Snorkel drybell: A case study in deploying weak supervision at industrial scale. In Proceedings of the 2019 International Conference on Management of Data, pages 362–375.

Bochner, R. and Struchiner, C. J. (2002). Acidentes por animais peçonhentos e sistemas nacionais de informação. Cadernos de Saúde Pública, 18:735–746.

Daume, S. (2016). Mining twitter to monitor invasive alien species - an analytical framework and sample information topologies. Ecological Informatics, 31:70–82.

Kulkarni, R. and Di Minin, E. (2021). Automated retrieval of information on threatened species from online sources using machine learning. Methods in Ecology and Evolution, 12(7):1226–1239.

Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., and Ré, C. (2017). Snorkel: Rapid training data creation with weak supervision. Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, 11(3):269.

Tang, C., Yuan, G., and Zheng, T. (2021). Weakly supervised learning creates a fusion of modeling cultures. Observational Studies, 7(1):203–211.

Varma, P. and Ré, C. (2018). Snuba: Automating weak supervision to label training data. Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, 12(3):223.
Published
2022-09-19
REIPS, Lisiane; HARA, Carmem Satie. Automated Data Integration and Labeling on the Cnidarian Physalia physalis, using Geolocation as a Reference. In: WORKSHOP ON THESIS AND DISSERTATION (WTDBD) - BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 37. , 2022, Búzios. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2022 . p. 105-111. DOI: https://doi.org/10.5753/sbbd_estendido.2022.21851.