Towards Analyzing Computational Costs of Spark for SARS-CoV-2 Sequences Comparisons on a Commercial Cloud

Alan L. Nunes; Alba Cristina Magalhaes Alves de Melo; Cristina Boeres; Daniel de Oliveira; Lúcia Maria de Assumpção Drummond

doi:10.5753/wscad.2021.18523

Alan L. Nunes UFF
Alba Cristina Magalhaes Alves de Melo UnB
Cristina Boeres UFF
Daniel de Oliveira UFF
Lúcia Maria de Assumpção Drummond UFF

DOI: https://doi.org/10.5753/wscad.2021.18523

Resumo

In this paper, we developed a Spark application, named Diff Sequences Spark, which compares 540 SARS-CoV-2 sequences from South America in Amazon EC2 Cloud, generating as output the positions where the differences occur. We analyzed the performance of the proposed application on selected memory and storage optimized virtual machines (VMs) at on-demand and spot markets. The execution times and financial costs of the memory optimized VMs outperformed the storage optimized ones. Regarding the markets, Diff Sequences Spark reduced the average execution times and monetary costs when using spot VMs compared to their respective on-demand VMs, even in scenarios with several spot revocations, benefiting from the low overhead fault tolerance Spark framework.

Referências

Armbrust, M., Bateman, D., Xin, R., and Zaharia, M. (2016). Introduction to spark 2.0 for database researchers. In SIGMOD ’16, page 2193–2194.

Brum, R., Sousa, W., Melo, A., Bentes, C., Castro, M. C., and Drummond, L. (2021). A fault tolerant and deadline constrained sequence alignment application on cloud-based spot GPU. In 27th EuroPar Conference, to appear.

de Oliveira, D., Porto, F., Boeres, C., and de Oliveira, D. (2021). Towards optimizing the execution of spark scientific workflows using machine learning-based parameter tuning. CCPE, 33(5):e5972.

Durbin, R., Eddy, S., Krogh, A., and G., M. (1998). Biological sequence analysis. Cambridge University Press.

Hey, T. and Trefethen, A. E. (2020). The fourth paradigm 10 years on. Inform. Spektrum, 42(6):441–447.

Hindman, B. et. al. (2011). Mesos: A platform for fine-grained resource sharing in the data center. In Proc.s of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI’11, page 295–308, USA. USENIX.

Hu, H., Wen, Y., Chua, T.-S., and Li, X. (2014). Toward scalable systems for big data analytics: A technology tutorial. IEEE Access, 2:652–687.

Lau, B. T., Pavlichin, D., and Hooker, A. C. e. a. (2021). Profiling sars-cov-2 mutation fingerprints that range from the viral pangenome to individual infection quasispecies. Genome Medicine, 13:28:1–28:23.

Perera, S., Perera, A., and Hakimzadeh, K. (2016). Reproducible experiments for comparing apache flink and apache spark on public clouds.

Rochman, N. D., Wolf, Y. I., Faure, G., Mutz, P., Zhang, F., and Koonin, E. (2021). Ongoing global and regional adaptive evolution of sars-cov-2. Proceedings of the National Academy of Sciences, 118(29).

Teylo, L., Arantes, L., Sens, P., and Drummond, L. M. (2021). A dynamic task scheduler tolerant to multiple hibernations in cloud environments. Cluster Computing, 24(2):1051–1073.

Xu, B., Li, C., Zhuang, H., Wang, J., Wang, Q., Zhou, J., and Zhou, X. (2017a). Dsa: Scalable distributed sequence alignment system using simd instructions. In 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pages 758–761, Los Alamitos, CA, USA. IEEE Computer Society.

Xu, B., Li, C., Zhuang, H., Wang, J., Wang, Q., and Zhou, X. (2017b). Efficient distributed smith-waterman algorithm based on apache spark. In 2017 IEEE 10th International Conference on Cloud Computing (CLOUD), pages 608–615.

Xu, F., Zheng, H., Jiang, H., Shao, W., Liu, H., and Zhou, Z. (2019). Cost-effective cloud server provisioning for predictable performance of big data analytics. IEEE Transactions on Parallel and Distributed Systems, 30(5):1036–1051.

Yan, Y., Gao, Y., Chen, Y., Guo, Z., Chen, B., and Moscibroda, T. (2016). Tr-spark: Transient computing for big data analytics. In SoCC, page 484–496.

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. (2012). Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. USENIX, 1:1–14.

Zaharia, M., Chowdhury, M., Franklin, M., Shenker, S., and Stoica, I. (2010). Spark: Cluster computing with working sets. HotCloud, 10(1-7):95.

Zhao, G., Ling, C., and Sun, D. (2015). Sparksw: Scalable distributed computing system for large-scale biological sequence alignment. In 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pages 845–852.