Emerging Challenges in HPC Storage
Modern HPC storage architectures were shaped by the performance
characteristics of conventional hard drives. Conventional hard drives
exhibit minimal (if any) onboard processing capability, low random
access performance, and high latency. These characteristics placed a
ceiling on overall storage system performance, and the remainder of the
storage infrastructure was designed around mitigating their limitations
as much as possible. Specifically, storage servers were designed to
mediate all access to hard drives. By doing so, they could shape traffic
(e.g., by serializing and batching), buffer data (e.g., through caching based on
locality), and process I/O requests on more powerful host CPUs (e.g.,
by handling interrupts, packing and unpacking Remote Procedure Call (RPC)
requests, and enforcing authorization) to make the most of hard drive
capabilities. Hard drive access latency also had subtle implications for
other elements of the storage system; there was no incentive to avoid
latencies in the client-side operating system or the storage fabric as
long as hard drives gated overall performance (Figure 2 ).
Low-latency access to storage
The architectural approach shown in Figure 2 was successful: it allowed
HPC storage systems to extract maximum aggregate throughput from vast
arrays of commodity hard drives. Limitations are evident, however, now
that we attempt to match emerging IOPS and response-time-sensitive
workloads to more capable low-latency storage devices. User-space APIs
such as libaio or liburing can issue millions of operations per second
from a single core, network interface cards can inject hundreds of
millions of messages into a network per second, and these rates can be
matched by just a few hundred NVMe storage devices. Despite these
capabilities, modern storage servers are only able to process 100,000
RPCs per second from a single core. Even an incredibly high-end storage
server with 100 high-frequency cores could service only 10 million read
or write RPCs per second. Such performance strands over 90% of the
network interface capability and saturates fewer than 10 fast NVMe
devices.
In other words, the host-based RPC processing that in the past served to
optimize access to storage devices has now become a hindrance. The
server’s ability to deserialize and process an RPC request and then
serialize and send an RPC response is now the gating factor in the IOPS
rate. The fastest RPC libraries, co-designed with high-performance
interconnects and performing no server-side processing, have been unable
to achieve even 500,000 RPCs per second per core. The traditional HPC
solution of scaling out to achieve higher IOPS is inefficient; expanding
the number of server CPU cores will increase complexity, footprint, and
power demands, offer diminishing returns on aggregate IOPS rate, and
effect no improvement in response time for individual accesses. The
classic HPC storage architecture must now be revisited in the context of
mixed workloads and the widespread availability of low-latency hardware
components.
Scaling and maintaining low latency
Science teams driving these data intensive activities are pushing the
scalability of their computations just as teams with simulation codes
have before them, and it is paramount that storage systems support that
scalability. Traditional caching and prefetching are not generally
effective for these algorithms, eliminating a common option for
accelerating access. On the other hand, the HPC networking community has
learned much that can be applied to next-generation storage systems.
Limiting the state associated with connections is an important enabler
for scale-out, especially when there’s no obvious structure in the
communication as there is in many scientific codes.
Devices supporting protocols that require connection establishment are
incredibly challenging to employ at HPC scale, but unfortunately, that
is the current direction of network-accessible device protocols such as
NVMe-oF. Connectionless models of communication have been demonstrated
in HPC \cite{barrett2012portals} and supported in production hardware \cite{derradji2015bxi}: it is up to
HPC to invent the fast, direct access to remote storage devices that
will be a key enabling technology for scalable storage systems. HPC
platforms have similarly been at the leading edge of requirements for
high concurrency, low-latency access to remote memory, and extending
proven techniques to enable similarly parallel and low-latency access to
storage is a natural research direction. Alterations and alternatives to
existing data transport methods for storage—perhaps built using
compute-enabled devices—should be investigated and their potential
demonstrated. User-land access to resources has also been shown as
critical for maintaining low latencies, which will be critical in the
data plane if not also in at least some aspects of the metadata plane.
Approaches along these lines have begun to be explored in the larger
storage community \cite{chen2021scalable} but must be adapted to the scales and
networks of HPC.
Securing access to storage
devices
In addition to providing efficient access to storage devices, storage
system software is also tasked with providing access control to the data
stored within high-performance storage systems. In the current
server-mediated access to storage model, the system software is tasked
with enforcing all data access controls. As we move to a storage access
paradigm that supports faster, low-latency access to storage devices, a
server-mediated access control scheme becomes a bottleneck that
paralyzes emerging workloads rather than acting as a useful enforcement
mechanism. At the same time, storage devices have gained richer
interfaces and capabilities, including zoned namespaces (ZNS) and
embedded functions in the form of computational storage, and thus it is
clear that security models that treat storage devices as only a
repository for stored data are obsolete.
More direct access to storage devices from large numbers of client
processes, which may include user-space access to remote storage
devices, must provide new models of security not currently provided by
either the network protocols or storage devices. While the NVMe
standards body has defined multiple methods for securely accessing
storage, none of these mechanisms are currently a good match for
data-intensive scientific discovery. The two most common NVMe security
methods, in-band authentication and per-request security, are focused on
ensuring that clients are authenticated with servers but cannot
differentiate between data plane operations that read data or write data
and control plane operations that create or destroy on-device
namespaces. And while key-per-IO is a novel model that enables every
disk access to be secured separately, the overheads of checking an
encryption key for every operation is antithetical to low-latency access
to storage devices. Instead, new security models that expose the
performance advantages of zoned namespaces \cite{bjorling2021zns} and leverage scalable
approaches to embedded compute, such as computational storage and
SmartNICs \cite{li2020leapio}, require additional research.