Replicação de Máquinas Virtuais Xen com Checkpointing Adaptável

Marcelo P. da Silva; Rafael R. Obelheiro; Guilherme P. Koslovki

doi:10.5753/wtf.2015.22935

Marcelo P. da Silva UDESC
Rafael R. Obelheiro UDESC
Guilherme P. Koslovki UDESC

DOI: https://doi.org/10.5753/wtf.2015.22935

Resumo

Remus é um mecanismo de replicação de máquinas virtuais (MVs) que fornece alta disponibilidade diante de faltas de parada. A replicação é realizada através de checkpointing, seguindo um intervalo fixo de tempo predeterminado. Todavia, existe um antagonismo entre processamento e comunicação em relação ao intervalo ideal entre checkpoints: enquanto intervalos maiores beneficiam aplicações com processamento intensivo, intervalos menores favorecem as aplicações cujo desempenho é dominado pela rede. Logo, o intervalo utilizado nem sempre e o adequado para as características de uso de recursos da aplicação em execução na MV, limitando a aplicabilidade de Remus em determinados cenários. Este trabalho apresenta uma proposta de checkpointing adaptativo para Remus, ajustando dinamicamente a frequência de replicação de acordo com as características das aplicações em execução. Os resultados indicam que a proposta obtém um melhor desempenho de aplicações que utilizam tanto recursos de processamento como de comunicação, sem prejudicar aplicações que usam apenas um dos tipos de recursos.

Referências

Bailey, D. H., Barszcz, E., Barton, J. T., Browning, D. S., Carter, R. L., Dagum, D., Fatoohi, R. A., Frederickson, P. O., Lasinski, T. A., Schreiber, R. S., Simon, H. D., Venkatakrishnan, V., and Weeratunga, S. K. (1991). The NAS Parallel Benchmarks. Int. Journal of Supercomputer Applications.

Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt, I., and Warfield, A. (2003). Xen and the art of virtualization. SIGOPS Oper. Syst. Rev., 37(5):164–177.

Budhiraja, N., Marzullo, K., Schneider, F. B., and Toueg, S. (1993). The primary-backup approach. Distributed Systems (2Nd Ed.), pages 199–216.

Chtepen, M., Dhoedt, B., De Turck, F., Demeester, P., Claeys, F., and Vanrolleghem, P. (2009). Adaptive checkpointing in dynamic grids for uncertain job durations. In Information Technology Interfaces, 2009. ITI ’09. Proceedings of the ITI 2009 31st International Conference on, pages 585–590.

Clark, C., Fraser, K., Hand, S., Hansen, J. G., Jul, E., Limpach, C., Pratt, I., and Warfield, A. (2005). Live migration of virtual machines. In Proc. of the 2Nd Conference on Symposium on Networked Systems Design & Implementation, NSDI’05, pages 273–286.

Cui, W., Ma, D., Wo, T., and Li, Q. (2009). Enhancing reliability for virtual machines via continual migration. In Proc. of the 15th International Conference on Parallel and Distributed Systems, ICPADS ’09, pages 937–942, Washington, DC, USA. IEEE Computer Society.

Cully, B., Lefebvre, G., Meyer, D., Feeley, M., Hutchinson, N., and Warfield, A. (2008). Remus: High availability via asynchronous virtual machine replication. In Proc. of the 5th USENIX Symposium on Networked Systems Design and Implementation, NSDI’08, pages 161–174.

da Silva, M., Koslovski, G., and Obelheiro, R. (2014). Uma análise da sobrecarga imposta pelo mecanismo de replicação de máquinas virtuais Remus. In XV Workshop de Testes e Tolerância a Falhas (WTF), Florianópolis, Brasil.

Elnozahy, E. N. M., Alvisi, L., Wang, Y.-M., and Johnson, D. B. (2002). A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375–408.

Gerofi, B. and Ishikawa, Y. (2011). Workload adaptive checkpoint scheduling of virtual machine replication. In Dependable Computing (PRDC), 2011 IEEE 17th Pacific Rim International Symposium on, pages 204–213.

Guerraoui, R. and Schiper, A. (1997). Software-based replication for fault tolerance. Computer, 30(4):68–74.

Hu, W., Hicks, A., Zhang, L., Dow, E. M., Soni, V., Jiang, H., Bull, R., and Matthews, J. N. (2013). A quantitative study of virtual machine live migration. In Proc. of the 2013 ACM Cloud and Autonomic Computing Conference, CAC ’13, pages 11:1–11:10.

Koppol, P., Namjoshi, K., Stathopoulos, T., and Wilfong, G. (2011). The inherent difficulty of timely primary-backup replication. In Proc. of the 30th Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, PODC ’11, pages 349–350. ACM.

Petrovic, D. and Schiper, A. (2012). Implementing virtual machine replication: A case study using Xen and KVM. In Proc. of the 2012 IEEE 26th International Conference on Advanced Information Networking and Applications, AINA ’12, pages 73–80.

Rajagopalan, S., Cully, B., O’Connor, R., and Warfield, A. (2012). SecondSite: Disaster tolerance as a service. SIGPLAN Not., 47(7):97–108.

Reisner, P. and Ellenberg (2005). DRBD v8 – replicated storage with shared disk semantics. In Proc. of the 12th International Linux System Technology Conference.

Scales, D. J., Nelson, M., and Venkitachalam, G. (2010). The design of a practical system for fault-tolerant virtual machines. SIGOPS Oper. Syst. Rev., 44(4):30–39.

Subhlok, J., Venkataramaiah, S., and Singh, A. (2002). Characterizing NAS benchmark performance on shared heterogeneous networks. In Proc. of the 16th Int. Parallel and Distributed Processing Symposium, IPDPS ’02, pages 91–, Washington, USA. IEEE Computer Society.

Tamura, Y., Sato, K., Kihara, S., and Moriai, S. (2008). Kemari: VM Synchronization for Fault Tolerance. In USENIX ’08 Poster Session.

Wood, T., Ramakrishnan, K. K., Shenoy, P., and van der Merwe, J. (2011). CloudNet: Dynamic pooling of cloud resources by live wan migration of virtual machines. In Proc. of the 7th ACM SIGPLAN/SIGOPS Int. Conference on Virtual Execution Environments, VEE ’11.

Zhang, Y. and Chakrabarty, K. (2003). Energy-aware adaptive checkpointing in embedded real-time systems. In Design, Automation and Test in Europe Conference and Exhibition, 2003, pages 918–923.

Zhu, J., Dong, W., Jiang, Z., Shi, X., Xiao, Z., and Li, X. (2010). Improving the performance of hypervisor-based fault tolerance. Int. Parallel and Distributed Processing Symposium, 0:1–10.

Ziv, A. and Bruck, J. (1997). An on-line algorithm for checkpoint placement. Computers, IEEE Transactions on, 46(9):976–985.