Taming the Big Data Monster: Managing Petabytes of Data with Multi-Model Databases

  • Yang Chen University of China
  • Feng Zhang University of China
  • Yinhao Hong University of China
  • Yunpeng Chai University of China
  • Wei Lu University of China
  • Hong Chen University of China
  • Xiaoyong Du University of China
  • Peipei Wang University of China
  • Le Mi University of China
  • Jintao Li University of China
  • Xilin Tang University of China
  • Yanliang Zhou University of China
  • Wei Zhou CICC Alpha (Beijing) Private Equity
  • Peng Zhang Alibaba Group
  • Fengyi Chen Alibaba Group
  • Pengfei Li Alibaba Group
  • Yu Li Alibaba Group

Resumo

With the development of big data technology, the amount of business data that Internet companies need to handle has reached the petabyte level, which poses great pressure on the system processing capacity. For example, the peak order volume of Alibaba's Global Shopping Festival in 2020 reached 583,000 orders per second. Even worse, multi-model data are involved in real business. The inability to perform high-throughput, lowlatency transaction processing can result in a poor user experience that can lead to serious financial losses due to customer churn. Although numerous optimizations have been proposed, they can fail in the face of petabytes of data, or be significantly less effective. In this paper, we propose a novel and practical multi-model big data system that can manage petabytes of data. Particularly, we show three special designs for processing the petabytes of data. First, we perform partition to reduce the amount of unnecessary data to be scanned. Second, we adaptively adopt row storage mode for big tables that are frequently updated and column storage mode for tables that are frequently queried to improve the system efficiency. Third, we conduct compression to accelerate IO access speed. We analyze Alibaba's two real PB-level business scenarios, Double 11 and Zhixingtong, and generate workloads and benchmark accordingly to verify our system. Experiments show that our system can efficiently manage petabyte-scale data in real scenarios, providing high-performance querying of terabyte-scale datasets, and be suitable for various workloads.
Publicado
2022-11-02
Como Citar
CHEN, Yang et al. Taming the Big Data Monster: Managing Petabytes of Data with Multi-Model Databases. Anais do International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), [S.l.], p. 283-292, nov. 2022. ISSN 0000-0000. Disponível em: <https://sol.sbc.org.br/index.php/sbac-pad/article/view/28255>. Acesso em: 17 maio 2024.