High-performance computing (HPC) storage systems
have become trusted repositories for hundreds of petabytes of data with
aggregate throughput rates in the terabytes per second. Numerous
research advances have contributed to this success. Object storage
technologies helped eliminate bottlenecks related to the management of
space on storage devices. The development of separate data and metadata
planes facilitated scale-out in the data plane to enable high
throughput. The adoption of network portability layers eased porting to
new HPC networking technologies. Disaggregation was adopted early,
bringing powerful cost and administrative savings and providing
flexibility to serve the diverse batch workloads typical of HPC.
Together with input/output (I/O) middleware technologies, HPC storage
systems have largely addressed the throughput challenges of checkpoint
and restart for traditional Message Passing Interface (MPI) simulation
codes, which was their primary driver for many years.