Figure 2 is here.
Fig. 2 Illustration for the symbolic strings method, a) the
basic approach, b) illustration for data aggregation using Aggregation
Length (AL=2).
The first step is to determine the median value from each streamflow
time series. The next step is to map each value of the streamflow time
series to 0 if it is below or at the median otherwise it is mapped as 1.
Once the corresponding binarized time series is created, we define a
window length L (L \(\mathbb{\in N}\)) (alternatively word
length) composed of L consecutive symbols. Thus, the different
possible words that can be encountered in a studied system are
2L . As illustrated in Fig. 2a, if the defined
word length is (L =2), the possible words that can be encountered
are 00, 01, 10, and 11, and hence, each word describes a state of the
system. The next step as stated by Wolf, (1999), is to find the primary
ingredients to evaluate information and complexity-based metrics. In
other words, this step means to find the three sets of probabilities as:
i) \(p_{L,i}\): the state probability of the i -th L word
where i = 1, 2, … , 2L ; ii)\(p_{L,\ ij}\): is the probability of shifting from the i -th to
the j -th L word instantly, where i = 1, 2, … ,
2L and j = 1, 2, … ,
2L ; and iii) \(p_{L,\ i\rightarrow j}\): is
the conditional probability for the occurrence of the incidencej -th L word, given that the i -th L word
event has been observed before. Once the aforementioned probabilities
are determined the information and complexity-based metrics can be
estimated.
In the case of this work, we will tackle two information measures as
well as two complexity measures, the mean information gain and metric
entropy were used to measure the information content in our data. On the
other hand, the effective measure of complexity and fluctuation
complexity were selected as metrics to quantify the complexity content
in our streamflow data, the aforementioned metrics are explained below.
Basically, the information entropy proposed by (Shannon, 1948), is a
popular measure that quantifies the randomness in a dataset. While the
Shannon entropy is given by Eq (3), the Metric Entropy is equal to
Shannon entropy (HS ) divided by the word length
(L ), thus it is a normalization of the Shannon entropy that gives
us an image about the contained information in a dataset but at the same
time it is independent from the word length (L ).
\(H_{S}=-\ \sum_{i=1}^{n}{p_{L,i\ }\mathrm{\text{Log}}_{2\ }p_{L,i}}\)(3)
The metric entropy is zero for steady sequence of data, conversely, it
increases in a monotonic behavior as the sequence’s disorder increases
and amounts to its maxima at 1 for evenly distributed random sequences.
Alternatively, the Mean Information Gain (MIG), is a measure of entropy
(randomness) that quantifies the amount of information and is defined as
the mean amount of information that can be gained about a dataset and
given as:
\(MIG=-\sum_{i,j=1}^{2^{L}}p_{L,ij}\text{Log}_{2}\text{\ p}_{L,i\rightarrow j}\)(4)
The above equation can be also expressed as the difference of Shannon
entropies as:
\(MIG=H_{S}(L+1)-H_{S}(L)\) (5)
Pachepsky et al., (2016) pointed out that larger values of information
gain refer to the greater chance of a system to vary from one state to
another.
Instead, the complexity metrics are helpful measures that permit to
capture the existence of internal patterns in studied datasets (Panet al. , 2012). The effective measure of complexity (EMC) as
stated by Grassberger, (1986) is the least quantity of information has
to be amassed required to deliver best possible prediction of the next
data element, the effective measure of complexity can be approximated
and computed using Eq (6):
\(EMC\approx(L+1)\text{\ H}_{S}(L)-\text{\ L\ H}_{S}(L+1)\) (6)
Alternatively, EMC and can be also evaluated using Eq (7) as:
\(EMC\approx\ \sum_{i,j=1}^{2^{L}}{p_{L,ij\ }\text{Log}_{2}\frac{p_{L,i\rightarrow j}^{L}}{p_{L,i}}}\)(7)
Finally, the fluctuation complexity (\(\sigma_{\Gamma}^{2}\)) is
one of the most important complexity measures since it defines the
fluctuations that occur in a system, i.e. how a system transforms from a
pattern to another. The fluctuation complexity is, therefore, a measure
for the changes of the net information gain over one or more-time steps.
Hence, data that pose a high-level of fluctuation yields larger
fluctuation complexity (Bates and Shepard, 1993), the fluctuation
complexity is estimated as:
\(\sigma_{\Gamma}^{2}=\ \sum_{i,j=1}^{2^{L}}{p_{L,ij\ }(\text{Log}_{2}\frac{p_{L,i}}{p_{L,j}}})^{2}\)(8)
Temporal discharge characteristics
using information and complexity measures within low-frequency and high-
frequency
scales
Temporal discharge characterization by means of information and
complexity theory was performed at different time domains using growing
aggregation lengths (AL). To illustrate, considering a word length of
two characters, hence, in the case of AL=1 (basic approach), it means
that each hour for a studied streamflow record was substituted directly
as one charter to form a part of a word (Fig. 2a). Alternatively, in the
case of AL=2, it means that each two successive hours from each studied
streamflow record were gathered, averaged, binarized and then
substituted as one character to compose a word, for more clarification
see Fig. 2b.
In the present research, low-frequency and high-frequency indicate
streamflow variations using information and complexity measures over
long and short time scales, respectively. In fact, the availability of
high-quality of streamflow data (roughly 6 months) measured by means of
the FAT close to Ozekiyama gauging station (Fig. 1), offers a unique
opportunity to examine the temporal variations of streamflow patterns
over short periods. Hence, high-frequency analyses concentrate on the
hourly discharge data from January 2016 to the end of June 2016 at
Awaya, Minamihatachiki, Miyoshi, Ozekiyama, and FAT stations. Whereas,
low-frequency analyses comprise hourly flow data from January 2002 to
the end of December 2017 covering Awaya, Minamihatachiki, Miyoshi, and
Ozekiyama stations.
For both low and high frequency analyses we used 4 characters-based to
describes the different states of system patterns, thus the various
potential patterns are 2L = 24= 16 probable words.
An extension for the information and
complexity measures to examine flood events.
Herein, we provide an extension for the information and complexity
theory aiming to assess the annual variation of flood events. According
to the abovementioned symbolic strings method in (section 3.1), theQMedian value of each streamflow was used as a
threshold to map each streamflow data either to 0 if its equal or less
the median and 1 if it is greater than the median. In this approach, the
same described procedures were performed, nevertheless, the two changes
are i) the maximum daily discharge was taken and investigated instead of
hourly discharge, and ii) the threshold discharge values for the
observed stations have been changed as presented in Table 1, thus if a
maximum daily discharge value is equal or greater than the threshold is
converted to 1, otherwise it is assigned as 0. In the case of flood
assessment, the word length was set to 2 characters and hence 4 possible
words describe the different patterns of the system, additionally, one
information and one complexity metrics used in these analyses namely,
the metric entropy and effective measure of complexity.