# KNN vs. Bluecat—Machine Learning vs. Classical Statistics

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Case Studies

^{2}. The observed data include the mean areal daily rainfall, the evapotranspiration, and the discharge at the basin exit. The period of the available data starts from 2 January 1992 and ends on 1 January 2014.

^{2}. The observed data include the mean areal hourly rainfall, the evapotranspiration, and the discharge at the basin exit. The period of the available data starts from 3 June 1992 and ends on 2 January 1997 (there is a gap in the data from 1 January 1995 to 2 June 1995). The flow regime of Sieve River is intermittent, 4% of the observed streamflow values are zero.

#### 2.2. Estimate Uncertainty with KNN

**KNN**(k,

**x**))

**x**,

**x**is a vector (or a scalar) that defines the state of the model (the features in machine learning terminology),

**KNN**(k,

**x**) returns the set of the k observations that according to KNN are those most related to

**x**, and f: ℝ

^{k}→ ℝ is a function that returns a value related with some statistical property of the set returned by

**KNN**(k,

**x**), in the typical regression applications of KNN this is the average. The previous quantities refer to the time instance t of a simulation obtained by the hydrological model. The symbol of the variable t is omitted from Equation (1) for the sake of simplicity (normally should appear either as a subscript or superscript).

**x**, the following options were tested:

- 1D. One dimensional, this is the simplest approach that includes only the assessed discharge simulated by the hydrological model at the time step t, x = Q
_{t}. KNN returns the k observations that correspond to the k simulated discharges of the calibration period that are the closest to Q_{t}. - 2D Option 1. The vector elements are two successive simulated discharges,
**x**= (Q_{t}, Q_{t}_{−1}), KNN returns the k observations that correspond to the k vectors of the calibration period that are closer in the 2D Euclidean space to the vector (Q_{t}, Q_{t}_{−1}). - 2D Option 2. The vector elements are the discharge Q
_{t}and the change of the simulated discharge between t − 1 and t,**x**= (Q_{t}, Q_{t}_{−1}− Q_{t}). - 2D Option 3. The vector elements are the discharge Q
_{t}and a binary value, 0 if the discharge increases and 1 if it does not increase, this binary value can be obtained with the function φ(⋅) = max(0, (⋅)/|⋅|),**x**= (Q_{t}, φ(Q_{t}_{−1}− Q_{t})).

**x**(i.e., the features) need to be scaled so that the Euclidean distance is equally sensitive to both dimensions. For this reason, the z-score normalisation was employed [20]. To avoid data leakage [21], the normalisation parameters (i.e., the mean and standard deviation) were obtained from the training set only, and then the normalisation was applied to both sets.

#### 2.3. Estimate Uncertainty with Bluecat

_{q|Q}(q|Q), which is defined by the following formula.

_{q}

_{|}

_{Q}(q|Q) ≔ P{q ≤ q|Q = Q }

_{q|Q}(q|Q) ≈ P{q ≤ q|Q − ΔQ

_{1}≤ Q ≤ Q + ΔQ

_{2}}

_{1}and ΔQ

_{2}define a neighbourhood of Q such that the intervals above and below Q contain appropriate numbers of simulation values, say 2m + 1 if the closest plus an equal number of m values above and below Q is selected.

## 3. Results

#### 3.1. Case Study—Arno

#### 3.2. Case Study—Sieve

^{3}/s. The “Low” line of Bluecat tends to be higher at high flows. The “Median” line of Bluecat seems to deviate from the “True” line at high flows.

## 4. Discussion

_{t}, Q

_{t}

_{−1}), where Q

_{t}= 99 and Q

_{t}

_{−1}= 100. In this case, the vector (100, 99) is closer to the assessed status vector than the vector (100, 102). Yet the former corresponds to a rising part of the hydrograph, whereas the latter and the status vector correspond to a recession. The distinction between rising and falling limbs of the hydrograph is guaranteed with Option 2 and Option 3. However, these options may overemphasise this distinction. For example, Figure 8 displays the plots of the elements (a.k.a. features) of status vector

**x**for the three 2D options (normalised values in Options 2 and 3) for the Arno River case study. It is evident that in Option 2 two status vectors corresponding to successive simulated discharges may have very large Euclidean distance because the difference Q

_{t}

_{−1}− Q

_{t}fluctuates strongly when passing from the rising to the falling limb (Figure 8b before and after 5080). This is probably the reason the boundaries of the confidence interval and the median value in Figure A2b exhibit the intense fluctuations. This effect is mitigated in Option 3. However, according to Figure A1 and Figure A2 neither option appears to offer any advantage over the simplest 1D option. It should be noted that this may be happening just because the error of the hydrological model used in this study and for these specific two case studies is similar in the rising and falling parts of the hydrograph. If this is the case, then the uncertainty depends only on the model output and not on its state. Therefore, the model output alone can be used in a data-driven method to obtain estimations of its uncertainty.

_{q}

_{|Q}(q|Q) ≈ (m

_{Q}/n

_{q}) × (n

_{q}/n)/((2m + 1)/n) = m

_{Q}/(2m + 1)

_{Q}, is the number of simulation values within the range (Q − ΔQ

_{1}, Q + ΔQ

_{2}) of which the corresponding observations are less than q, n

_{q}is the number of observations that are less than q, n is the total number of observations, and 2m + 1 is the predetermined total number of simulation values within the range (Q − ΔQ

_{1}, Q + ΔQ

_{2}).

_{Q}, the observations that correspond to the 2m + 1 simulations within the range (Q − ΔQ

_{1}, Q + ΔQ

_{2}) need to be identified. These observations are more or less the output of

**KNN**(k,

**x**) in Equation (1) with only one difference, Equation (1) does not take extra care to ensure an equal number of values above and below Q. Furthermore, the right side of Equation (5) is the empirical distribution of the 2m + 1 observations. It is exactly the inverse of this empirical distribution that is used as f in Equation (1) to obtain the 90th and 10th percentiles, and the median value.

_{q}

_{|Q}(q|Q) at high and low Q by KNN because of an unbalanced number of values below and above Q. That is, the higher the assessed value Q the less the number of simulated values in the calibration period higher than Q. As a result, for very high Q values, KNN will return mostly observations corresponding to neighbours of Q lower than Q. This bias is the reason that the upper bound of the 80% confidence interval in Figure 2a coincides with the hydrological model simulation at the two peaks before 1 April 2013. It is also the reason for the differences between Figure 3a,b.

^{3}/s. A rough representation of this histogram can be obtained by the values of the upper and lower confidence interval bounds and the median value that correspond to 275 m

^{3}/s (see vertical black and dotted lines in Figure 9). Therefore, the lines “High”, “Median”, and “Low” provide a rough representation of the histograms (or the graphical representation of the probability density function of F

_{q}

_{|Q}(q|Q)) of all assessed Q values.

_{q}

_{|Q}(q|Q) for Q values beyond the available information, is to extrapolate them with linear regression. The application of this simple approach for the two case studies is demonstrated in Appendix B.

## 5. Conclusions

- The machine learning method is more flexible than the statistical method, which allows using more complex sampling schemes at higher dimensions (e.g., model simulation values from multiple time steps). This may improve the reliability of the estimated uncertainty in some cases. However, the application in the two case studies did not prove any advantage over the simplest approach (1D sampling, only the discharge). This finding cannot be generalized since it depends on the performance of the selected hydrological model in each specific case study. Nevertheless, it appears that the simplest approach captures successfully most (if not all) of the characteristics of the uncertainty.
- Machine learning is usually considered a black-box approach with some abstract/intuitive understanding of its functionality. However, in some applications, a close inspection can reveal similarities, or even equivalency, with rigorous mathematical approaches. The identification of the deviations of the algorithm underneath a machine learning method from the rigorous approach allows detecting the conditions under which the machine learning model may exhibit poor performance, and thus, increase its credibility.
- A very simple approach based on linear regression was employed to estimate the statistical structure of the assessed hydrological model uncertainty at conditions never met in the available data. This approach was tested in the two case studies and was found to perform satisfactorily.

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Appendix A

**Figure A1.**Scatter plot of the KNN application in the Arno River case study during the validation period: (

**a**) 2D Option 1; (

**b**) 2D Option 2; (

**c**) 2D Option 3.

**Figure A2.**Scatter plot of the KNN application in the Sieve River case study during the validation period: (

**a**) 2D Option 1; (

**b**) 2D Option 2; (

**c**) 2D Option 3.

## Appendix B

^{3}/s for the Arno River and Sieve River case studies, respectively. The results when applying the extrapolation method are displayed in the following figures.

**Figure A3.**Discharge of Arno River, 100 daily time steps of the validation period starting from 1 January 2013: (

**a**) KNN applied to the trimmed training data set; (

**b**) KNN with linear extrapolation applied to the trimmed training data set. “Det. Model” is the simulation with HyMod.

**Figure A4.**Scatter plot of Arno River case study: (

**a**) KNN applied to the trimmed training data set; (

**b**) KNN with linear extrapolation applied to the trimmed training data set.

**Figure A5.**Discharge of Sieve River, 150 hourly time steps of the validation period starting from 5 January 1996: (

**a**) KNN applied to the trimmed training data set; (

**b**) KNN with linear extrapolation applied to the trimmed training data set. “Det. Model” is the simulation with HyMod.

**Figure A6.**Scatter plot of the Sieve River case study: (

**a**) KNN applied to the trimmed training data set; (

**b**) KNN with linear extrapolation applied to the trimmed training data set.

## References

- Rosenblatt, F. The Perceptron, a Perceiving and Recognizing Automaton Project Para; Cornell Aeronautical Laboratory, Inc.: Buffalo, NY, USA, 1957. [Google Scholar]
- Minsky, M.; Papert, S. Perceptrons: An Introduction to Computational Geometry; MIT Press: Cambridge, MA, USA, 1969. [Google Scholar]
- Rumelhart, D.; Hinton, G.; Williams, R. Learning representations by back-propagating errors. Nature
**1986**, 323, 533–536. [Google Scholar] [CrossRef] - Shen, C.; Laloy, E.; Elshorbagy, A.; Albert, A.; Bales, J.; Chang, F.; Ganguly, S.; Hsu, K.; Kifer, D.; Fang, Z.; et al. HESS Opinions: Incubating deep-learning-powered hydrologic science advances as a community. Hydrol. Earth Syst. Sci.
**2018**, 22, 5639–5656. [Google Scholar] [CrossRef] [Green Version] - Shen, C. A Transdisciplinary Review of Deep Learning Research and Its Relevance for Water Resources Scientists. Water Resour. Res.
**2018**, 54, 8558–8593. [Google Scholar] [CrossRef] - Rozos, E.; Dimitriadis, P.; Mazi, K.; Koussis, A.D. A Multilayer Perceptron Model for Stochastic Synthesis. Hydrology
**2021**, 8, 67. [Google Scholar] [CrossRef] - Rozos, E.; Dimitriadis, P.; Bellos, V. Machine Learning in Assessing the Performance of Hydrological Models. Hydrology
**2022**, 9, 5. [Google Scholar] [CrossRef] - Sikorska-Senoner, A.; Quilty, J. A novel ensemble-based conceptual-data-driven approach for improved streamflow simulations. Environ. Model. Softw.
**2021**, 143, 105094. [Google Scholar] [CrossRef] - Sikorska, A.; Montanari, A.; Koutsoyiannis, D. Estimating the Uncertainty of Hydrological Predictions through Data-Driven Resampling Techniques. J. Hydrol. Eng.
**2015**, 20, A4014009. [Google Scholar] [CrossRef] - Solomatine, D.P.; Shrestha, D.L. A novel method to estimate model uncertainty using machine learning techniques. Water Resour. Res.
**2009**, 45, W00B11. [Google Scholar] [CrossRef] - Karlsson, M.; Yakowitz, S. Nearest-neighbor methods for nonparametric rainfall-runoff forecasting. Water Resour. Res.
**1987**, 23, 1300–1308. [Google Scholar] [CrossRef] - Koutsoyiannis, D.; Montanari, A. Bluecat: A Local Uncertainty Estimator for Deterministic Simulations and Predictions. Water Resour. Res.
**2022**, 58, e2021WR031215. [Google Scholar] [CrossRef] - Ehteram, M.; Mousavi, S.; Karami, H.; Farzin, S.; Singh, V.; Chau, K.; El-Shafie, A. Reservoir operation based on evolutionary algorithms and multi-criteria decision-making under climate change and uncertainty. J. Hydroinformatics
**2018**, 20, 332–355. [Google Scholar] [CrossRef] - Sharafati, A.; Pezeshki, E. A strategy to assess the uncertainty of a climate change impact on extreme hydrological events in the semi-arid Dehbar catchment in Iran. Theor. Appl. Climatol.
**2019**, 139, 389–402. [Google Scholar] [CrossRef] - Zhao, C.; Huang, Y.; Li, Z.; Chen, M. Drought Monitoring of Southwestern China Using Insufficient GRACE Data for the Long-Term Mean Reference Frame under Global Change. J. Clim.
**2018**, 31, 6897–6911. [Google Scholar] [CrossRef] - Boyle, D. Multicriteria Calibration of Hydrological Models. Doctoral Dissertation, University of Arizona, Tucson, AZ, USA, 2000, unpublished. [Google Scholar]
- Montanari, A. Large sample behaviors of the generalized likelihood uncertainty estimation (GLUE) in assessing the uncertainty of rainfall-runoff simulations. Water Resour. Res.
**2005**, 41. [Google Scholar] [CrossRef] - K-Nearest Neighbor(KNN) Algorithm for Machine Learning—Javatpoint. Available online: https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-learning (accessed on 12 May 2022).
- Russell, S.; Norvig, P. Artificial Intelligence; Prentice-Hall: Upper Saddle River, NJ, USA, 2010. [Google Scholar]
- Jordan, J. Normalizing Your Data (Specifically, Input and Batch Normalization). 2021. Available online: https://www.jeremyjordan.me/batch-normalization/ (accessed on 2 February 2021).
- Preventing Data Leakage in Your Machine Learning Model. Available online: https://towardsdatascience.com/preventing-data-leakage-in-your-machine-learning-model-9ae54b3cd1fb (accessed on 1 May 2022).
- Documentation mlpack-3-4-2. Available online: https://www.mlpack.org/doc/stable/cli_documentation.html#knn (accessed on 4 May 2022).
- Koutsoyiannis, D.; Montanari, A. Climate Extrapolations in Hydrology: The Expanded Bluecat Methodology. Hydrology
**2022**, 9, 86. [Google Scholar] [CrossRef]

**Figure 2.**Discharge of Arno River, 100 days of the validation period starting from 1 January 2013: (

**a**) KNN; (

**b**) Bluecat. “Det. Model” is the simulation with HyMod.

**Figure 4.**CPP plots of the Arno River case study: (

**a**) KNN; (

**b**) Bluecat. “Det. Model” is the simulation with HyMod.

**Figure 5.**Discharge of Sieve River, 150 hourly time steps of the validation period starting from 5 January 1996: (

**a**) KNN; (

**b**) Bluecat. “Det. Model” is the simulation with HyMod.

**Figure 7.**CPP plots of the Sieve River case study: (

**a**) KNN; (

**b**) Bluecat. “Det. Model” is the simulation with HyMod.

**Figure 9.**Visual explanation of confidence interval bound and median line for the discharge value of the hydrological simulation during the validation period equal to 275 m

^{3}/s (case study of Sieve River). The histogram on the right organises in classes the observed values that correspond to the 200 simulation values closer to 275 m

^{3}/s.

**Table 1.**Mean value of the time series of observations, of the HyMod simulation, and of the Bluecat Median and KNN Median.

Observations (m ^{3}/s) | HyMod (m ^{3}/s) | Bluecat Median (m ^{3}/s) | KNN Median (m ^{3}/s) | |
---|---|---|---|---|

Arno River | 10.98 | 16.24 | 12.24 | 11.32 |

Sieve River | 12.56 | 17.79 | 11.78 | 11.58 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Rozos, E.; Koutsoyiannis, D.; Montanari, A.
KNN vs. Bluecat—Machine Learning vs. Classical Statistics. *Hydrology* **2022**, *9*, 101.
https://doi.org/10.3390/hydrology9060101

**AMA Style**

Rozos E, Koutsoyiannis D, Montanari A.
KNN vs. Bluecat—Machine Learning vs. Classical Statistics. *Hydrology*. 2022; 9(6):101.
https://doi.org/10.3390/hydrology9060101

**Chicago/Turabian Style**

Rozos, Evangelos, Demetris Koutsoyiannis, and Alberto Montanari.
2022. "KNN vs. Bluecat—Machine Learning vs. Classical Statistics" *Hydrology* 9, no. 6: 101.
https://doi.org/10.3390/hydrology9060101