Forecasting File Lifecycles for Intelligent Data Placement in Hierarchical Storage

  • Adrian Khelili Eviden Atos BDS R&D Data Management / UPSaclay / UVSQ
  • Sophie Robert Hayek Eviden Atos BDS R&D Data Management
  • Soraya Zertal UPSaclay / UVSQ


The increasing disparity between computing speed and data access latency poses significant challenges in managing data storage, particularly for massively parallel supercomputers. To address this issue, storage systems have evolved into hierarchical architectures with different tiers offering varying performance, cost, and capacity based on specific hardware technologies. This heterogeneous and hierarchical nature of storage comes with the need for an optimal data placement strategy. Existing strategies have primarily approached this problem from a block perspective, focusing on analyzing application I/O behavior. However, such approaches fail to capture the contextual usage of information. To overcome this limitation, considering file-level usage patterns and adopting a file-centric perspective for data placement can leverage the context and semantics of files, leading to more efficient data placement strategies. This study proposes a novel file-based approach for data placement, focusing on the concept of file re-use and representing files through their life cycles (FLCs). The FLC of a file captures the sequence of operations it undergoes during its lifetime, referred to as FLCevents. By analyzing the time series data associated with the FLCs of actively used files, our algorithm detects repetitive patterns of FLCevents to predict future events using pattern matching and facilitate the anticipation of file movement. This allows for optimal data placement that aligns with the expected near-future usage. To validate our approach, we conducted experiments using traces extracted from representative high-performance computing (HPC) applications, namely NEMO, NAMD, LQCD, and an IO-Benchmark. The results we obtained are highly promising, demonstrating that in term of file prediction accuracy our proposed F-LRU (File-based Least Recently Used) achieves from 77% to 55% precision in most difficult scenarios to 100% for the others. Also, it is at least as effective as traditional LRU and LFU and can increase the hit rate by a factor of 1.94 against LFU and of 1.06 against LRU for LQCD and IO-Bench respectively. These findings highlight the potential of our approach to significantly enhance hierarchical storage performance, partlcularly in HPC environments.
Palavras-chave: I/O Prediction, File lifecycles, Time Series, Data temperature, Hierarchical storage, High Performance Computing
KHELILI, Adrian; HAYEK, Sophie Robert; ZERTAL, Soraya. Forecasting File Lifecycles for Intelligent Data Placement in Hierarchical Storage. In: INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD), 35. , 2023, Porto Alegre/RS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 181-191.