Abstract—High-performance computing (HPC) storage
systems are a key component of the success of HPC to date. Recently
there have been significant developments in storage related technologies
as well as changes in users of HPC platforms, especially as relates to
the use of HPC for artificial intelligence and experimental data
analysis workloads. These developments merit a revisit of HPC storage
system architectural designs. In this paper we discuss these drivers,
identify key challenges to status quo posed by these developments, and
discuss directions future research might take to unlock the potential of
new technologies for the breadth of HPC applications.
High-performance computing(HPC) storage systems
have become trusted repositories for hundreds of petabytes of data with
aggregate throughput rates in the terabytes per second. Numerous
research advances have contributed to this success. Object storage
technologies helped eliminate bottlenecks related to the management of
space on storage devices. The development of separate data and metadata
planes facilitated scale-out in the data plane to enable high
throughput. The adoption of network portability layers eased porting to
new HPC networking technologies. Disaggregation was adopted early,
bringing powerful cost and administrative savings and providing
flexibility to serve the diverse batch workloads typical of HPC.
Together with input/output (I/O) middleware technologies, HPC storage
systems have largely addressed the throughput challenges of checkpoint
and restart for traditional Message Passing Interface (MPI) simulation
codes, which was their primary driver for many years.