Enabling a Future for Data-driven
Science
A great deal of effort was required to stabilize HPC storage and make it
trustworthy, but it did happen. Multiple production file system options
exist for data centers to choose from, and checkpoint and restart for
HPC codes has largely been addressed. But storage system designers
cannot rest on their laurels, and storage is not a solved problem. Even
more than for simulation codes, the potential benefits of HPC for AI and
analysis applications hinge on high performance storage. We need not
just innovation, but innovation that goes hand in hand with these
scientific objectives.
Architecturally, the community must revisit the data path between
analysis applications and storage devices. In much the same way that
user-space RDMA access has revolutionized HPC networking (removing
handshaking, buffering, and host processing from the interprocess
communication path) and allowed networks to keep pace with memory
throughput, we must adopt new HPC storage access paradigms that minimize
obstructions in the storage data path and allow storage systems to keep
pace with NVMe capabilities. The need for RPC processing can be
minimized (by thoughtful partitioning of work to control planes), any
remaining RPC processing or asymmetric transfer can be offloaded to
smart devices, and the complete data path can be holistically evaluated
to eliminate duplicate and superfluous protocol translations that
collectively leach latency from the system.
From a device interface perspective, storage systems traditionally
divide responsibility between the storage device and host rigidly: the
device is responsible for handling data block updates, and the host is
responsible for data processing. But as SSDs continue to replace hard
disks at the front-line storage tier, block interface support requires
complex firmware that affects device performance and cost, and as
storage becomes disaggregated from computation, reducing data movement
between the device and host becomes crucial. Novel interfaces, like
Zoned Storage, have emerged to reduce firmware complexity by delegating
responsibilities to the host, and computational storage allows data to
be processed on the device in accordance to application needs, blurring
the divide between device and host. Future work will need to adapt
popular application types to fully leverage the capabilities of these
devices and explore the right balance of near-storage computation for
different tasks.
User abstractions are another key piece of the puzzle. Building fast and
productive storage systems will require not only addressing these
technology challenges but also understanding emerging science needs. The
HPC storage community has contributed interface advances in the past,
including concepts eventually adopted in the mainstream, but recently
most storage abstraction innovation has occurred elsewhere, with cloud
service providers offering options such as column stores, document stores, key-value
stores, streaming data infrastructure, and object stores. HPC storage researchers must work together with technology
providers and domain scientists to find abstractions that match science
needs and then to develop scalable storage services embodying those
abstractions.
The HPC storage research community also a needs to be reinvigorated.
A misconception persists that HPC storage is a solved problem: new
storage systems are iteratively designed and deployed by solving
formulas based on commodity market forces and logistical constraints. In
reality, however, many unsolved problems remain in high-performance
storage, especially as high-performance storage comes to the forefront
as the key to enabling both simulation and data-driven analytics use
cases. The high-performance storage community must innovate within this
space and then translate those innovations into solutions for our
data-driven science partners. HPC storage and its workloads must become
first-class citizens within computer science curricula, coordinated
research thrusts, and partnerships between industry, academia, and
governments.