Big Data Architectures for FAIR-compliant Repositories: A Systematic Review

Resumo


The FAIR Principles state that scientific data should be Findable, Accessible, Interoperable, and Reusable in order to adhere to the Open Science movement. However, designing a FAIR-compliant repository can be a challenge due to the complexity of managing a huge volume and variety of research data and metadata, which can also be generated at a high velocity. This complexity calls for a Software Reference Architecture (SRA) to guide data engineers during the implementation process. In this paper, we conduct a systematic review that encompasses research efforts regarding architectural solutions for implementing FAIR-compliant repositories. We analyze 323 references from Scopus, ACM, IEEEXplore, and specialists recommendations. From this analysis, we discover 7 studies that describe general purpose big data SRAs, 13 pipelines that implement the FAIR Principles to specific contexts, and 3 FAIR-compliant big data SRAs. We describe their key characteristics and discuss their limitations, highlighting tendencies and research opportunities.

Palavras-chave: Open Science, FAIR Principles, Big Data, Software Reference Architecture, SRA

Referências

Assante, M. et al. (2021). Realising a science gateway for the agri-food: the AGINFRA PLUS experience. In CEUR Workshop Proc.

Ataei, P. and Litchfield, A. (2021). NeoMycelia: A software reference architecture for big data systems. In Proc. APSEC, pages 452–462.

Borges, V. et al. (2022). A platform to generate FAIR data for COVID-19 clinical research in Brazil. In Proc. ICEIS, pages 218–225.

Bruha, P. et al. (2022). Workflow for health-related and brain data lifecycle. Front. Digit. Health, 4.

Castro, J. P. C. et al. (2022a). FAIR Principles and Big Data: A software reference architecture for Open Science. In Proc. ICEIS, pages 27–38.

Castro, J. P. C. et al. (2022b). Open Science in the cloud: The CloudFAIR architecture for FAIR-compliant repositories. In Proc. ADBIS, pages 56–66.

Chaudhuri, S. and Dayal, U. (1997). An overview of data warehousing and OLAP technology. SIGMOD Rec., 26(1):65–74.

Chen, M., Mao, S., and Liu, Y. (2014). Big data: A survey. Mob. Netw. Appl., 19(2):171–209.

Davoudian, A. and Liu, M. (2020). Big data systems: A software engineering perspective. ACM Comput. Surv., 53(5):1–39.

Deng, N. et al. (2022). ImmuneData: an integrated data discovery system for immunology data repositories. Database, 2022.

Felikson, D. et al. (2022). NASA’s earth information system: Sea-level change. In OCEANS 2022, Hampton Roads, pages 1–8.

Fernandez, R. C. et al. (2015). Liquid: Unifying nearline and offline big data integration. In Proc. CIDR.

Jacobsen, A. et al. (2020). FAIR principles: interpretations and implementation considerations. Data Intell., 2(1-2):10–29.

Jha, A. K. et al. (2022). Implementation of big imaging data pipeline adhering to FAIR principles for federated machine learning in oncology. IEEE Trans. Radiat. Plasma Med. Sci., 6(2):207–213.

Kiran, M. et al. (2015). Lambda architecture for cost-effective batch and speed big data processing. In IEEE Trans. Big Data, pages 2785–2792.

Kreps, J. (2014). Questioning the Lambda architecture. Available at [link]. Accessed in April 8, 2023.

Lehmann, J. et al. (2023). Establishing reliable research data management by integrating measurement devices utilizing intelligent digital twins. Sensors, 23(1):468.

Martínez-Prieto, M. A. et al. (2015). The solid architecture for real-time management of big semantic data. Future Gener. Comput. Syst., 47:62–79.

Medeiros, C. B. et al. (2020). IAP input into the UNESCO Open Science Recommendation. Available at [link]. Accessed in April 8, 2023.

Nadal, S. et al. (2017). A software reference architecture for semantic-aware big data systems. Inf. Softw. Technol., 90:75–92.

Nakagawa, E. Y., Antonino, P. O., and Becker, M. (2011). Reference architecture and product line architecture: A subtle but critical difference. In Proc. ECSA, pages 207–211.

Panǎ, G. T. et al. (2021). Towards the implementation of FAIR principles on an earthquake analysis platform. In Proc. RoEduNet, pages 1–4.

Pestryakova, S. et al. (2022). CovidPubGraph: A FAIR knowledge graph of COVID-19 publications. Sci. Data, 9(1):389.

Rueda-Ruiz, A. J. et al. (2022). SPSLiDAR: towards a multi-purpose repository for large scale LiDAR datasets. Int. J. Geogr. Inf. Sci., 36(5):992–1011.

Scannavino, K. R. F. et al. (2017). Revisão Sistemática da Literatura em Engenharia de Software: Teoria e Prática. Elsevier.

Schwagereit, F. et al. (2022). FAIR data APIs in the FAIR in vivo data sharing platform. In CEUR Workshop Proc.

Sciacca, E. et al. (2022). Scientific visualization on the cloud: the NEANIAS services towards EOSC integration. J. Grid Comput., 20(1):7.

Toulet, A. et al. (2022). ISSA: generic pipeline, knowledge model and visualization tools to help scientists search and make sense of a scientific archive. In Proc. ISWC, pages 660–677.

Van Reisen, M. et al. (2020). Towards the tipping point for FAIR implementation. Data Intell., 2(1-2):264–275.

Vazquez, P. et al. (2022). Globally accessible distributed data sharing (GADDS): A decentralized FAIR platform to facilitate data sharing in the life sciences. Bioinformatics, 38:3812–3817.

Wilkinson, M. D. et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data, 3(1):1–9.
Publicado
25/09/2023
CASTRO, João P. C.; AGUIAR, Cristina D.. Big Data Architectures for FAIR-compliant Repositories: A Systematic Review. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 38. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 76-88. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2023.232494.