Abstract—High-performance computing (HPC) storage systems are a key component of the success of HPC to date. Recently there have been significant developments in storage related technologies as well as changes in users of HPC platforms, especially as relates to the use of HPC for artificial intelligence and experimental data analysis workloads. These developments merit a revisit of HPC storage system architectural designs. In this paper we discuss these drivers, identify key challenges to status quo posed by these developments, and discuss directions future research might take to unlock the potential of new technologies for the breadth of HPC applications.
High-performance computing(HPC) storage systems have become trusted repositories for hundreds of petabytes of data with aggregate throughput rates in the terabytes per second. Numerous research advances have contributed to this success. Object storage technologies helped eliminate bottlenecks related to the management of space on storage devices. The development of separate data and metadata planes facilitated scale-out in the data plane to enable high throughput. The adoption of network portability layers eased porting to new HPC networking technologies. Disaggregation was adopted early, bringing powerful cost and administrative savings and providing flexibility to serve the diverse batch workloads typical of HPC. Together with input/output (I/O) middleware technologies, HPC storage systems have largely addressed the throughput challenges of checkpoint and restart for traditional Message Passing Interface (MPI) simulation codes, which was their primary driver for many years.