Permitindo Maior Reprodutibilidade de Experimentos em Ambientes Distribuídos com Nodos de Baixa Confiabilidade
Abstract
Experiment reproducibility, essential for the verification of effectiveness/efficiency of scientific contributions, is particularly challenging in the context of large-scale distributed systems. Non-programmed failures (either at nodes that compose the system, or in the communication between them) may make it difficult for one to achieve statistical significance in the results, or to verify their validity. To address this problem, we propose EASYEXP, a fault-tolerant architecture to ensure the reproducibility of experiments in non-reliable distributed testbeds. In EASYEXP, nodes in the experiment environment “interpret” workers and execute actions that are expected for them, following a predefined schedule. In the event of failure of a node, it is replaced by another functional one, keeping the execution context of the worker interpreted by it. Results obtained show that EASYEXP is able to maintain a lower variation (standard deviation of 1.6 %) and higher precision (95.7 %) among multiple runs of the same experiment, when compared to those performed in a traditional way (25% deviation and 72% accuracy only).
References
Albrecht, J. R., Braud, R., Dao, D., Topilski, N., Tuttle, C., Snoeren, A. C., and Vahdat, A. (2007). Remote control: Distributed application conguration, management, and visualization with plush. In Large Installation System Administration (LISA), volume 7, pages 1–19.
Bajpai, V., Kühlewind, M., Ott, J., Schönwälder, J., Sperotto, A., and Trammell, B. In SIGCOMM Reproducibility Workshop, (2017). Challenges with reproducibility. Reproducibility ’17, pages 1–4, New York, NY, USA. ACM.
Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604):452– 454.
Bonaventure, O., Iannone, L., and Saucez, D. (2017). Proceedings of the ACM SIGCOMM Reproducibility Workshop. ACM, New York, NY, USA.
Costa, L. L., Bona, L. C., and Duarte Jr, E. P. (2015). Melhorando a precisão e repetibilidade de experimentos no planetlab. In Simpósio Brasileiro de Redes de Computadores e de Sistemas Distribuídos (SBRC 2015). SBC.
Garrett, T., Bona, L. C., and Duarte Jr, E. P. (2017).
Improving the performance and reproducibility of experiments on large-scale testbeds with k-cores. Computer Communications.
Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. (2010). Zookeeper: Wait-free coordination for internet-scale systems. In USENIX Annual Technical Conference, volume 8, page 9. Boston, MA, USA.
Imbert, M., Pouilloux, L., Rouzaud-Cornabas, J., Lébre, A., and Hirofuchi, T. (2013). Using the execo toolkit to perform automatic and reproducible cloud experiments. In Int’l Conference on Cloud Computing Technology and Science (CloudCom 2013), volume 2, pages 158–163. IEEE.
Leonini, L., Riviére, íE., and Felber, P. (2009). Splay: Distributed systems evaluation made simple (or how to turn ideas into live systems in a breeze). In Networked Systems Design and Implementation (NSDI), volume 9, pages 185–198.
Nussbaum, L. (2017). Testbeds support for reproducible research. In SIGCOMM Reproducibility Workshop, Reproducibility ’17, pages 24–26, New York, NY, USA. ACM.
Ruiz, C. C., Richard, O. A., Iegorov, O., and Videau, B. (2013). Managing large scale experiments in distributed testbeds. In Int’l Association of Science and Technology for Development (IASTED), pages 628–636.
Santos, M., Fernandes, S., and Kamienski, C. (2014). Conducting network research in large-scale platforms: Avoiding pitfalls in planetlab. In Advanced Information Networking and Applications (AINA), pages 525–532. IEEE.
