Methods

Data

Daily flow data were gathered from Western North America (WNA) stream gauges in the Referenced Hydrometric Basin Network (RHBN) and Hydro-climate Data Network (HCDN), which are operated by Water Survey of Canada (WSC) and United States Geological Survey (USGS), respectively. Streams selected for the networks have predominantly natural flow regimes, with minimum human disturbance (e.g. significant land-use change, dams, reservoirs, and hydro-power stations) during long-term observation periods. In total, 304 HCDN and RHBN stream gauges were deemed suitable in WNA, including four Canadian provinces/territories (British Columbia, Alberta, Yukon, and Northwest Territory) and eleven American states (Washington, Oregon, California, Idaho, Nevada, Montana, Utah, Wyoming, Colorado, Arizona and New Mexico) (Fig. 1). The observation length varies among gauges, ranging from ten to over one hundred years. At each site, the long-term daily hydrograph was broken into the Annual Daily Hydrographs (ADHs). Each ADH contains 365 values of daily flow over a year from 1 January to 31 December (leap days excluded if applicable). Small gaps (≤7 days) in ADHs were filled up via linear interpolation. ADHs with gaps > 7 consecutive days were excluded.
Figure 1: Selected HCDN and RHBN streamflow gauges in western North America
During initial data screening, ADHs with atypical shapes (e.g. sudden zig-zag patterns) were detected and excluded due to uncertainty in data quality. Furthermore, we excluded the ADHs of extreme years as they are less representative to the general flow pattern. In this case, extreme wet years are defined as ADHs with a maximum flow one hundred times higher than its long-term average, while extreme dry years are ADHs with a minimum flow 100 times smaller than its long-term average. From an initial set of 19499 ADHs, 17110 ADHs were preserved for analysis.
Selected ADHs were pre-processed prior to analysis. First, ADHs for a given stream were divided by its long-term average, which helps limit the scaling factor of watershed size and enhance comparability. Subsequently, a log transformation was applied to reduce skewness of the data as machine learning algorithms typically have improved performance on normally distributed data. A small number (10−6) was added to ADHs to avoid invalid values during log transformation (see Eq. 1). Furthermore, min-max normalization (see Eq. 2) was applied to scale values of ADHs into a range of 0 and 1, which was recommended by the original paper of t-SNE (van der Maaten and Hinton, 2008).
f (x ) = log (x + e −6) (1)