Scalable, Efficient, and Policy-Aware Deduplication for Primary Distributed Storage Systems

  • Henrique Fingler University of Texas at San Antonio
  • Moo-Ryong Ra AT&T Labs - Research
  • Rajesh Panta AT&T Labs - Research

Resumo


Data deduplication has become a crucial technique for reducing data in modern storage systems. We present SEP-D, a practical scale-out distributed storage system to incorporate data deduplication for primary storage. SEP-D introduces a novel metadata handling mechanism which combines content-based hashing with built-in distributed data placement strategies such as CRUSH. This enables SEP-D to eliminate the needs for remote metadata lookups, thus incorporating deduplication without affecting scalability. SEP-D integrates smoothly with the existing storage system, allowing the re-use of storage policies across different pools of storage. We implemented SEP-D in Ceph, a popular distributed storage system widely adopted in the industry, and demonstrated that SEP-D has minimal impact on I/O performance in data while maintaining existing storage policies implemented in underlying distributed storage systems.
Palavras-chave: Metadata, Semantics, Distributed databases, Servers, Scalability, Throughput, Media
Publicado
15/10/2019
FINGLER, Henrique; RA, Moo-Ryong; PANTA, Rajesh. Scalable, Efficient, and Policy-Aware Deduplication for Primary Distributed Storage Systems. In: INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD), 31. , 2019, Campo Grande/MS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2019 . p. 180-187.