Methods
Data
Daily flow data were gathered from Western North America (WNA) stream
gauges in the Referenced Hydrometric Basin Network (RHBN) and
Hydro-climate Data Network (HCDN), which are operated by Water Survey of
Canada (WSC) and United States Geological Survey (USGS), respectively.
Streams selected for the networks have predominantly natural flow
regimes, with minimum human disturbance (e.g. significant land-use
change, dams, reservoirs, and hydro-power stations) during long-term
observation periods. In total, 304 HCDN and RHBN stream gauges were
deemed suitable in WNA, including four Canadian provinces/territories
(British Columbia, Alberta, Yukon, and Northwest Territory) and eleven
American states (Washington, Oregon, California, Idaho, Nevada, Montana,
Utah, Wyoming, Colorado, Arizona and New Mexico) (Fig. 1). The
observation length varies among gauges, ranging from ten to over one
hundred years. At each site, the long-term daily hydrograph was broken
into the Annual Daily Hydrographs (ADHs). Each ADH contains 365 values
of daily flow over a year from 1 January to 31 December (leap days
excluded if applicable). Small gaps (≤7 days) in ADHs were filled up via
linear interpolation. ADHs with gaps > 7 consecutive
days were excluded.
Figure 1: Selected HCDN and RHBN streamflow gauges in western North
America
During initial data screening, ADHs with atypical shapes (e.g. sudden
zig-zag patterns) were detected and excluded due to uncertainty in data
quality. Furthermore, we excluded the ADHs of extreme years as they are
less representative to the general flow pattern. In this case, extreme
wet years are defined as ADHs with a maximum flow one hundred times
higher than its long-term average, while extreme dry years are ADHs with
a minimum flow 100 times smaller than its long-term average. From an
initial set of 19499 ADHs, 17110 ADHs were preserved for analysis.
Selected ADHs were pre-processed prior to analysis. First, ADHs for a
given stream were divided by its long-term average, which helps limit
the scaling factor of watershed size and enhance comparability.
Subsequently, a log transformation was applied to reduce skewness of the
data as machine learning algorithms typically have improved performance
on normally distributed data. A small number (10−6)
was added to ADHs to avoid invalid values during log transformation (see
Eq. 1). Furthermore, min-max normalization (see Eq. 2) was applied to
scale values of ADHs into a range of 0 and 1, which was recommended by
the original paper of t-SNE (van der Maaten and Hinton, 2008).
f (x ) = log (x + e −6)
(1)