Figure 2 is here.
Fig. 2 Illustration for the symbolic strings method, a) the basic approach, b) illustration for data aggregation using Aggregation Length (AL=2).
The first step is to determine the median value from each streamflow time series. The next step is to map each value of the streamflow time series to 0 if it is below or at the median otherwise it is mapped as 1. Once the corresponding binarized time series is created, we define a window length L (L \(\mathbb{\in N}\)) (alternatively word length) composed of L consecutive symbols. Thus, the different possible words that can be encountered in a studied system are 2L . As illustrated in Fig. 2a, if the defined word length is (L =2), the possible words that can be encountered are 00, 01, 10, and 11, and hence, each word describes a state of the system. The next step as stated by Wolf, (1999), is to find the primary ingredients to evaluate information and complexity-based metrics. In other words, this step means to find the three sets of probabilities as: i) \(p_{L,i}\): the state probability of the i -th L word where i = 1, 2, … , 2L ; ii)\(p_{L,\ ij}\): is the probability of shifting from the i -th to the j -th L word instantly, where i = 1, 2, … , 2L and j = 1, 2, … , 2L ; and iii) \(p_{L,\ i\rightarrow j}\): is the conditional probability for the occurrence of the incidencej -th L word, given that the i -th L word event has been observed before. Once the aforementioned probabilities are determined the information and complexity-based metrics can be estimated.
In the case of this work, we will tackle two information measures as well as two complexity measures, the mean information gain and metric entropy were used to measure the information content in our data. On the other hand, the effective measure of complexity and fluctuation complexity were selected as metrics to quantify the complexity content in our streamflow data, the aforementioned metrics are explained below.
Basically, the information entropy proposed by (Shannon, 1948), is a popular measure that quantifies the randomness in a dataset. While the Shannon entropy is given by Eq (3), the Metric Entropy is equal to Shannon entropy (HS ) divided by the word length (L ), thus it is a normalization of the Shannon entropy that gives us an image about the contained information in a dataset but at the same time it is independent from the word length (L ).
\(H_{S}=-\ \sum_{i=1}^{n}{p_{L,i\ }\mathrm{\text{Log}}_{2\ }p_{L,i}}\)(3)
The metric entropy is zero for steady sequence of data, conversely, it increases in a monotonic behavior as the sequence’s disorder increases and amounts to its maxima at 1 for evenly distributed random sequences.
Alternatively, the Mean Information Gain (MIG), is a measure of entropy (randomness) that quantifies the amount of information and is defined as the mean amount of information that can be gained about a dataset and given as:
\(MIG=-\sum_{i,j=1}^{2^{L}}p_{L,ij}\text{Log}_{2}\text{\ p}_{L,i\rightarrow j}\)(4)
The above equation can be also expressed as the difference of Shannon entropies as:
\(MIG=H_{S}(L+1)-H_{S}(L)\) (5)
Pachepsky et al., (2016) pointed out that larger values of information gain refer to the greater chance of a system to vary from one state to another.
Instead, the complexity metrics are helpful measures that permit to capture the existence of internal patterns in studied datasets (Panet al. , 2012). The effective measure of complexity (EMC) as stated by Grassberger, (1986) is the least quantity of information has to be amassed required to deliver best possible prediction of the next data element, the effective measure of complexity can be approximated and computed using Eq (6):
\(EMC\approx(L+1)\text{\ H}_{S}(L)-\text{\ L\ H}_{S}(L+1)\) (6)
Alternatively, EMC and can be also evaluated using Eq (7) as:
\(EMC\approx\ \sum_{i,j=1}^{2^{L}}{p_{L,ij\ }\text{Log}_{2}\frac{p_{L,i\rightarrow j}^{L}}{p_{L,i}}}\)(7)
Finally, the fluctuation complexity (\(\sigma_{\Gamma}^{2}\)) is one of the most important complexity measures since it defines the fluctuations that occur in a system, i.e. how a system transforms from a pattern to another. The fluctuation complexity is, therefore, a measure for the changes of the net information gain over one or more-time steps. Hence, data that pose a high-level of fluctuation yields larger fluctuation complexity (Bates and Shepard, 1993), the fluctuation complexity is estimated as:
\(\sigma_{\Gamma}^{2}=\ \sum_{i,j=1}^{2^{L}}{p_{L,ij\ }(\text{Log}_{2}\frac{p_{L,i}}{p_{L,j}}})^{2}\)(8)

Temporal discharge characteristics using information and complexity measures within low-frequency and high- frequency scales

Temporal discharge characterization by means of information and complexity theory was performed at different time domains using growing aggregation lengths (AL). To illustrate, considering a word length of two characters, hence, in the case of AL=1 (basic approach), it means that each hour for a studied streamflow record was substituted directly as one charter to form a part of a word (Fig. 2a). Alternatively, in the case of AL=2, it means that each two successive hours from each studied streamflow record were gathered, averaged, binarized and then substituted as one character to compose a word, for more clarification see Fig. 2b.
In the present research, low-frequency and high-frequency indicate streamflow variations using information and complexity measures over long and short time scales, respectively. In fact, the availability of high-quality of streamflow data (roughly 6 months) measured by means of the FAT close to Ozekiyama gauging station (Fig. 1), offers a unique opportunity to examine the temporal variations of streamflow patterns over short periods. Hence, high-frequency analyses concentrate on the hourly discharge data from January 2016 to the end of June 2016 at Awaya, Minamihatachiki, Miyoshi, Ozekiyama, and FAT stations. Whereas, low-frequency analyses comprise hourly flow data from January 2002 to the end of December 2017 covering Awaya, Minamihatachiki, Miyoshi, and Ozekiyama stations.
For both low and high frequency analyses we used 4 characters-based to describes the different states of system patterns, thus the various potential patterns are 2L = 24= 16 probable words.

An extension for the information and complexity measures to examine flood events.

Herein, we provide an extension for the information and complexity theory aiming to assess the annual variation of flood events. According to the abovementioned symbolic strings method in (section 3.1), theQMedian value of each streamflow was used as a threshold to map each streamflow data either to 0 if its equal or less the median and 1 if it is greater than the median. In this approach, the same described procedures were performed, nevertheless, the two changes are i) the maximum daily discharge was taken and investigated instead of hourly discharge, and ii) the threshold discharge values for the observed stations have been changed as presented in Table 1, thus if a maximum daily discharge value is equal or greater than the threshold is converted to 1, otherwise it is assigned as 0. In the case of flood assessment, the word length was set to 2 characters and hence 4 possible words describe the different patterns of the system, additionally, one information and one complexity metrics used in these analyses namely, the metric entropy and effective measure of complexity.