Optimizing Botanical Data Integrity: A Comparative Study of Text Similarity Methods

  • Luma G. R. Cerqueira Federal University of Santa Catarina (UFSC)
  • Carina F. Dorneles Federal University of Santa Catarina (UFSC)
  • Simone S. Werner Federal University of Santa Catarina (UFSC)

Abstract


In this study, we address the challenges of managing authorship nomenclature as dictated by the International Code of Nomenclature for algae, fungi, and plants (ICN), within the Begoniaceae and Bignoniaceae families databases. Our goal was to evaluate various text similarity algorithms for their effectiveness in deduplicating botanical data, ensuring accuracy in authorship and synonymy. Our results highlighted Smith-Waterman’s superior balance in precision, recall, and F1 Score, suggesting its potential as a robust solution for improving database integrity. The study also demonstrates the importance of fine-tuning these algorithms to navigate the unique challenges of botanical data management, emphasizing the necessity for specialized approaches in this field.

Keywords: Short Text Similarity, Botanical Databases, Similarity Function

References

Baeza-Yates, R. and Ribeiro-Neto, B. (2008). Modern Information Retrieval. Addison-Wesley Publishing Company, USA, 2nd edition.

Cheek, M., Nic Lughadha, E., Kirk, P., Lindon, H., Carretero, J., Looney, B., Douglas, B., Haelewaters, D., Gaya, E., Llewellyn, T., Ainsworth, A. M., Gafforov, Y., Hyde, K., Crous, P., Hughes, M., Walker, B. E., Campostrini Forzza, R., Wong, K. M., and Niskanen, T. (2020). New scientific discoveries: Plants and fungi. PLANTS, PEOPLE, PLANET, 2(5):371–388.

Glick, J. et al. (2020). Information-based similarity measures for botanical data. Journal of Data Science and Botanical Information, 8(2):101–119.

Gomaa, W. H. and Fahmy, A. A. (2013). A survey of text similarity approaches. International Journal of Computer Applications, 68(13):13–18.

Gyawali, B., Anastasiou, L., and Knoth, P. (2020). Deduplication of scholarly documents using locality sensitive hashing and word embeddings. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 901–910, Marseille, France. European Language Resources Association (ELRA).

Manning, C. D. (2008). Introduction to information retrieval. Syngress Publishing,.

Prakoso, D. et al. (2021). Short text similarity measurement methods: A review. Journal of Big Data and Analytics in Practice, 3(1):33–44.

Silva, C. et al. (2019). Measurement of text similarity: A survey. Information, 11(421):1–25.

Silva, J. et al. (2021). Tool for validation and import in herbarium database. In Proceedings of the Botanical Data Conference, pages 123–130. Botanical Society.
Published
2024-10-14
CERQUEIRA, Luma G. R.; DORNELES, Carina F.; WERNER, Simone S.. Optimizing Botanical Data Integrity: A Comparative Study of Text Similarity Methods. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 39. , 2024, Florianópolis/SC. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 406-417. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2024.240254.