Investigating the Impact of Congestion Control Algorithms on Apache Spark Execution
Abstract
Apache Spark is a distributed processing framework that overcomes the limitations of models like Hadoop MapReduce through optimizations such as in-memory execution and support for iterative operations. Its ability to process data, transmit it, and apply machine learning algorithms makes it essential for current computational demands. However, when dealing with large volumes of data, Apache Spark faces challenges arising from network congestion at critical moments, such as during data redistribution between nodes, which can saturate the existing bandwidth. This study explores different scenarios, replicating network traffic using the iPerf tool, to compare the effectiveness of TCP congestion control protocols, specifically: Cubic, Reno, and DCTCP, in executing a Spark application with key characteristics for comparison.References
Alizadeh, M., Greenberg, A., Maltz, D. A., Padhye, J., Patel, P., Prabhakar, B., Sengupta, S., and Sridharan, M. (2010). Data center tcp (dctcp). In Proceedings of the ACM SIGCOMM 2010 Conference, pages 63–74.
Foundation, T. A. S. (2024). Apache spark documentation. Disponível em: [link]. Acesso em: 15 jan. 2025.
Ha, S., Rhee, I., and Xu, L. (2008). Cubic: a new tcp-friendly high-speed tcp variant. ACM SIGOPS operating systems review, 42(5):64–74.
IBM (2021). What is apache spark? Disponível em: [link]. Acesso em: 12 jan. 2025.
iPerf (2024). iperf - iperf3 and iperf2 user documentation. Disponível em: [link]. Acesso em: 15 jan. 2025.
Jacobson, V. (1988). Congestion avoidance and control. ACM SIGCOMM computer communication review, 18(4):314–329.
Salloum, S., Dautov, R., Chen, X., Peng, P. X., and Huang, J. Z. (2016). Big data analytics on apache spark. International Journal of Data Science and Analytics, 1:145–164. DOI: 10.1007/s41060-016-0027-9.
Foundation, T. A. S. (2024). Apache spark documentation. Disponível em: [link]. Acesso em: 15 jan. 2025.
Ha, S., Rhee, I., and Xu, L. (2008). Cubic: a new tcp-friendly high-speed tcp variant. ACM SIGOPS operating systems review, 42(5):64–74.
IBM (2021). What is apache spark? Disponível em: [link]. Acesso em: 12 jan. 2025.
iPerf (2024). iperf - iperf3 and iperf2 user documentation. Disponível em: [link]. Acesso em: 15 jan. 2025.
Jacobson, V. (1988). Congestion avoidance and control. ACM SIGCOMM computer communication review, 18(4):314–329.
Salloum, S., Dautov, R., Chen, X., Peng, P. X., and Huang, J. Z. (2016). Big data analytics on apache spark. International Journal of Data Science and Analytics, 1:145–164. DOI: 10.1007/s41060-016-0027-9.
Published
2025-04-23
How to Cite
BOSCATTO, Enzo B.; MARCONDES, Anderson H. da S.; KOSLOVSKI, Guilherme P..
Investigating the Impact of Congestion Control Algorithms on Apache Spark Execution. In: REGIONAL SCHOOL OF HIGH PERFORMANCE COMPUTING FROM SOUTHERN BRAZIL (ERAD-RS), 25. , 2025, Foz do Iguaçu/PR.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 89-92.
ISSN 2595-4164.
DOI: https://doi.org/10.5753/eradrs.2025.6820.
