Enhancing Scalability and Performance in Distributed Systems: Analyzing Locking Mechanisms in HA Key-Value Stores

Nivardo A. L. Castro; Cidcley T. Souza; A. Wendell O. Rodrigues

doi:10.5753/ercemapi.2024.243712

Nivardo A. L. Castro IFCE
Cidcley T. Souza IFCE
A. Wendell O. Rodrigues IFCE

DOI: https://doi.org/10.5753/ercemapi.2024.243712

Resumo

Distributed systems are fundamental to the landscape of modern computing, serving as foundation of applications from large-scale cloud infrastructures to distributed databases. These systems face intricate challenges in maintaining data integrity and managing concurrent processes across decentralized environments. Among the mechanisms devised to navigate these complexities, distributed locking stands out as a pivotal strategy for orchestrating resource access among multiple nodes, ensuring operational coherence and data consistency. This paper addresses the mechanisms of distributed locking within the context of key-value storage systems, which are celebrated for their straightforwardness and high scalability. Our investigation encompasses an analysis of both blocking and non-blocking strategies for resource acquisition, enlightening the balance between securing exclusive access to resources and minimizing latency to enhance user experience and system efficiency. Additionally, we survey the scalability challenges that emerge as the system expands, evaluating how these mechanisms scale across an increasing number of nodes and operations. The study probes into performance bottlenecks that often manifest in distributed environments, identifying strategies to mitigate these constraints while maintaining high throughput and responsive systems. Moreover, we focus on the critical aspects of consistency and latency, exploring architectural and algorithmic solutions designed to harmonize the two, thereby facilitating a seamless and efficient distributed operation. Benchmarking evaluations are presented, incorporating metrics such as throughput, latency, and scalability, providing insightful findings that contribute to the broader understanding of distributed systems coordination, offering valuable guidance for system designers and developers.

Referências

Atlidakis, V., Andrus, J., Geambasu, R., Mitropoulos, D., and Nieh, J. (2016). Posix abstractions in modern operating systems: the old, the new, and the missing. In Proceedings of the Eleventh European Conference on Computer Systems, EuroSys ’16, New York, NY, USA. Association for Computing Machinery.

Authors, E. (2024). etcd. [link].

D’amato, J. P., Dominguez, L., Stramana, F., Rubiales, A., and Perez, A. (2021). An hybrid cpu-gpu parallel multi-tracking framework for long-term video sequences. In Figueroa-García, J. C., Díaz-Gutierrez, Y., Gaona-García, E. E., and Orjuela-Cañón, A. D., editors, Applied Computer Sciences in Engineering, pages 263–274, Cham. Springer International Publishing.

Driscoll, K., Hall, B., Sivencrona, H., and Zumsteg, P. (2003). Byzantine fault tolerance, from theory to reality. In Anderson, S., Felici, M., and Littlewood, B., editors, Computer Safety, Reliability, and Security, pages 235–248, Berlin, Heidelberg. Springer Berlin Heidelberg.

Echavarría, S., Mejía-Gutiérrez, R., and Montoya, A. (2020). Development of an iot platform for monitoring electric vehicle behaviour. In Figueroa-García, J. C., Garay-Rairán, F. S., Hernández-Pérez, G. J., and Díaz-Gutierrez, Y., editors, Applied Computer Sciences in Engineering, pages 363–374, Cham. Springer International Publishing.

etcd (2024). etcd api guarantees.

Gray, C. and Cheriton, D. (1989). Leases: an efficient fault-tolerant mechanism for distributed file cache consistency. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles, SOSP ’89, page 202–210, New York, NY, USA. Association for Computing Machinery.

Grzesik, P. and Mrozek, D. (2019). Evaluation of key-value stores for distributed locking purposes. In Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., and Kostrzewa, D., editors, Beyond Databases, Architectures and Structures. Paving the Road to Smart Data Processing and Analysis, pages 70–81, Cham. Springer International Publishing.

Herlihy, M. P. and Wing, J. M. (1990). Linearizability: a correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst., 12(3):463–492.

Kamil, S. N. S., Thomas, N., and Elsanosi, I. (2021). Performance evaluation of zookeeper atomic broadcast protocol. In Zhao, Q. and Xia, L., editors, Performance Evaluation Methodologies and Tools, pages 56–71, Cham. Springer International Publishing.

Kingsbury, K. (2013a). Asynchronous replication with failover.

Kingsbury, K. (2013b). Jepsen: Zookeeper.

Kingsbury, K. (2020a). etcd 3.4.3.

Kingsbury, K. (2020b). Redis-raft 1b3fbf6.

Kleppmann, M. (2017). Designing Data-Intensive Applications. O’Reilly, Beijing.

Ongaro, D. and Ousterhout, J. (2014). In search of an understandable consensus algorithm. In 2014 USENIX Annual Technical Conference (USENIX ATC 14), pages 305–319, Philadelphia, PA. USENIX Association.

Redis (2024). High availability with redis sentinel.

van Steen, M. and Tanenbaum, A. S. (2016). A brief introduction to distributed systems. Computing, 98(10):967–1009.

Xu, C. (2018). Research on data storage technology in cloud computing environment. IOP Conference Series: Materials Science and Engineering, 394(3):032074.

Yeo, C. S., Buyya, R., Pourreza, H., Eskicioglu, R., Graham, P., and Sommers, F. (2006). Cluster Computing: High-Performance, High-Availability, and High-Throughput Processing on a Network of Computers, pages 521–551. Springer US, Boston, MA.