Taming the Big Data Monster: Managing Petabytes of Data with Multi-Model Databases

Yang Chen; Feng Zhang; Yinhao Hong; Yunpeng Chai; Wei Lu; Hong Chen; Xiaoyong Du; Peipei Wang; Le Mi; Jintao Li; Xilin Tang; Yanliang Zhou; Wei Zhou; Peng Zhang; Fengyi Chen; Pengfei Li; Yu Li

Yang Chen University of China
Feng Zhang University of China
Yinhao Hong University of China
Yunpeng Chai University of China
Wei Lu University of China
Hong Chen University of China
Xiaoyong Du University of China
Peipei Wang University of China
Le Mi University of China
Jintao Li University of China
Xilin Tang University of China
Yanliang Zhou University of China
Wei Zhou CICC Alpha (Beijing) Private Equity
Peng Zhang Alibaba Group
Fengyi Chen Alibaba Group
Pengfei Li Alibaba Group
Yu Li Alibaba Group

Resumo

With the development of big data technology, the amount of business data that Internet companies need to handle has reached the petabyte level, which poses great pressure on the system processing capacity. For example, the peak order volume of Alibaba's Global Shopping Festival in 2020 reached 583,000 orders per second. Even worse, multi-model data are involved in real business. The inability to perform high-throughput, lowlatency transaction processing can result in a poor user experience that can lead to serious financial losses due to customer churn. Although numerous optimizations have been proposed, they can fail in the face of petabytes of data, or be significantly less effective. In this paper, we propose a novel and practical multi-model big data system that can manage petabytes of data. Particularly, we show three special designs for processing the petabytes of data. First, we perform partition to reduce the amount of unnecessary data to be scanned. Second, we adaptively adopt row storage mode for big tables that are frequently updated and column storage mode for tables that are frequently queried to improve the system efficiency. Third, we conduct compression to accelerate IO access speed. We analyze Alibaba's two real PB-level business scenarios, Double 11 and Zhixingtong, and generate workloads and benchmark accordingly to verify our system. Experiments show that our system can efficiently manage petabyte-scale data in real scenarios, providing high-performance querying of terabyte-scale datasets, and be suitable for various workloads.

Palavras-chave: Big data system, multi-model, petabytes of data, Double 11, Zhixingtong, Alibaba