statstools¶
This module implements statistical functionalities frequently used in hydrological modelling.
Module statstools
implements the following members:
SimObs
A named tuple containing one array of simulated and one array of observed values.
filter_series()
Filter time series for the given date ranges or months.
prepare_arrays()
Prepare and return twonumpy
arrays based on the given arguments.
Criterion
Callback protocol for efficiency criteria likense()
.
rmse()
Calculate the root-mean-square error.
nse()
Calculate the efficiency criteria after Nash & Sutcliffe.
nse_log()
Calculate the efficiency criteria after Nash & Sutcliffe for logarithmic values.
corr2()
Calculate the coefficient of determination via the square of the coefficient of correlation according to Bravais-Pearson.
kge()
Calculate the Kling-Gupta efficiency according to Kling et al. (2012).
bias_abs()
Calculate the absolute difference between the means of the simulated and the observed values.
bias_rel()
Calculate the relative difference between the means of the simulated and the observed values.
std_ratio()
Calculate the ratio between the standard deviation of the simulated and the observed values.
var_ratio()
Calculate the ratio between the variation coefficients of the simulated and the observed values.
corr()
Calculate the product-moment correlation coefficient after Pearson.
hsepd_pdf()
Calculate the probability densities based on the heteroskedastic skewed exponential power distribution.
hsepd_manual()
Calculate the mean of the logarithmic probability densities of the heteroskedastic skewed exponential power distribution.
hsepd()
Calculate the mean of the logarithmic probability densities of the heteroskedastic skewed exponential power distribution.
calc_mean_time()
Return the weighted mean of the given time points.
calc_mean_time_deviation()
Return the weighted deviation of the given timepoints from their mean time.
calc_weights()
Calculate “statistical” weights for all given nodes based on the number of observations within the evaluation period.
SummaryRow
Abstract base class forSummaryRowSimple
andSummaryRowWeighted
.
SummaryRowSimple
Helper to define additional “summary rows” in evaluation tables based on simple (non-weighted) averages.
SummaryRowWeighted
Helper to define additional “summary rows” in evaluation tables based on weighted averages.
print_evaluationtable()
Print a table containing the results of the given evaluation criteria for the givenNode
objects.
- class hydpy.auxs.statstools.SimObs(sim: ndarray[Any, dtype[float64]], obs: ndarray[Any, dtype[float64]])[source]¶
Bases:
NamedTuple
A named tuple containing one array of simulated and one array of observed values.
- hydpy.auxs.statstools.filter_series(*, sim: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, obs: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, node: Node | None = None, date_ranges: Iterable[tuple[Date | datetime | str, Date | datetime | str]] | None = None, months: Iterable[int] | None = None) SimObs [source]¶
Filter time series for the given date ranges or months.
Often, we want to apply objective functions like
nse()
on a subset of the available simulated and observed values. The functionfilter_series()
helps to extract the relevant data either by data ranges or by months. Common examples are to pass a single date range to ignore the first non-optimal values of a warm-up period, to pass a set of date ranges to focus on certain events or to pass a set of months to perform a seasonal analysis.To show how
filter_series()
works, we prepare a daily initialisation time grid spanning two hydrological years:>>> from hydpy import filter_series, pub, Node >>> pub.timegrids = "2001-11-01", "2003-11-01", "1d"
Next, we prepare a
Node
object and assign some constantly increasing and decreasing values to its simulation and the observation series, respectively:>>> import numpy >>> node = Node("test") >>> node.prepare_allseries() >>> node.sequences.sim.series = numpy.arange(1, 2*365+1) >>> node.sequences.obs.series = numpy.arange(2*365, 0, -1)
First, we select data of arbitrary sub-periods via the data_ranges argument. Each data range consists of the start-point and the end-point of a sub-period. Here, we choose all values that belong to 31 October or 1 November (note that unsorted data ranges are acceptable):
>>> date_ranges = [("2001-11-01", "2001-11-02"), ... ("2002-10-31", "2002-11-02"), ... ("2003-10-31", "2003-11-01")] >>> results = filter_series(node=node, date_ranges=date_ranges)
filter_series()
returns the data within index-sortedSeries
objects (note that the index addresses the left boundary of each time step):>>> results.sim 2001-11-01 1.0 2002-10-31 365.0 2002-11-01 366.0 2003-10-31 730.0 Name: sim... >>> results.obs 2001-11-01 730.0 2002-10-31 366.0 2002-11-01 365.0 2003-10-31 1.0 Name: obs...
To help avoiding possible hard-to-find errors,
filter_series()
performs the following checks:>>> date_ranges = [("2001-10-31", "2003-11-01")] >>> filter_series(node=node, date_ranges=date_ranges) Traceback (most recent call last): ... ValueError: While trying to filter the given series, the following error occurred: The given date (2001-10-31 00:00:00) is before the first date of the initialisation period (2001-11-01 00:00:00).
>>> date_ranges = [("2001-11-01", "2003-11-02")] >>> filter_series(node=node, date_ranges=date_ranges) Traceback (most recent call last): ... ValueError: While trying to filter the given series, the following error occurred: The given date (2003-11-02 00:00:00) is behind the last date of the initialisation period (2003-11-01 00:00:00).
>>> date_ranges = [("2001-11-02", "2001-11-02")] >>> filter_series(node=node, date_ranges=date_ranges) Traceback (most recent call last): ... ValueError: While trying to filter the given series, the following error occurred: The given first date `2001-11-02 00:00:00` is not before than the given last date `2001-11-02 00:00:00`.
Note that function
filter_series()
does not remove any duplicates:>>> date_ranges = [("2001-11-01", "2001-11-05"), ... ("2001-11-01", "2001-11-02"), ... ("2001-11-04", "2001-11-06")] >>> sim = filter_series(node=node, date_ranges=date_ranges).sim >>> sim 2001-11-01 1.0 2001-11-01 1.0 2001-11-02 2.0 2001-11-03 3.0 2001-11-04 4.0 2001-11-04 4.0 2001-11-05 5.0 Name: sim...
Instead of date ranges, one can specify months via integer numbers. We begin with selecting October (10) and November (11) individually:
>>> sim = filter_series(node=node, months=[11]).sim >>> len(sim) 60 >>> sim 2001-11-01 1.0 2001-11-02 2.0 ... 2002-11-29 394.0 2002-11-30 395.0 Name: sim...
>>> sim = filter_series(node=node, months=[10]).sim >>> len(sim) 62 >>> sim 2002-10-01 335.0 2002-10-02 336.0 ... 2003-10-30 729.0 2003-10-31 730.0 Name: sim...
One can select multiple months, which neither need to be sorted nor consecutive:
>>> sim = filter_series(node=node, months=[4, 1]).sim >>> len(sim) 122 >>> sim 2002-01-01 62.0 2002-01-02 63.0 ... 2003-04-29 545.0 2003-04-30 546.0 Name: sim...
Note that you are also free to either pass the sim and obs series directly instead of a node (see function
prepare_arrays()
for further information):>>> xs = node.sequences.sim.series >>> ys = node.sequences.obs.series >>> filter_series(sim=xs, obs=ys, months=[4, 1]).sim 2002-01-01 62.0 2002-01-02 63.0 ... 2003-04-29 545.0 2003-04-30 546.0 Name: sim...
Missing or double information for arguments date_ranges and months results in the following error messages:
>>> filter_series(node=node) Traceback (most recent call last): ... ValueError: While trying to filter the given series, the following error occurred: You need to define either the `date_ranges` or `months` argument, but none of them is given.
>>> filter_series(node=node, date_ranges=[], months=[]) Traceback (most recent call last): ... ValueError: While trying to filter the given series, the following error occurred: You need to define either the `date_ranges` or `months` argument, but both of them are given.
- hydpy.auxs.statstools.prepare_arrays(*, sim: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, obs: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, node: Node | None = None, skip_nan: bool = False, subperiod: bool | None = None) SimObs [source]¶
Prepare and return two
numpy
arrays based on the given arguments.Note that many functions provided by module
statstools
apply functionprepare_arrays()
internally (e.g.nse()
). But you can also use it manually, as shown in the following examples.Function
prepare_arrays()
can extract time series data fromNode
objects. To set up an example for this, we define an initialisation period and prepare aNode
object:>>> from hydpy import pub, Node, round_, nan >>> pub.timegrids = "01.01.2000", "07.01.2000", "1d" >>> node = Node("test")
Next, we assign some values to the simulation and the observation sequences of the node:
>>> node.prepare_allseries() >>> with pub.options.checkseries(False): ... node.sequences.sim.series = 1.0, nan, nan, nan, 2.0, 3.0 ... node.sequences.obs.series = 4.0, 5.0, nan, nan, nan, 6.0
Now we can pass the node object to function
prepare_arrays()
and get the (unmodified) time series data:>>> from hydpy import prepare_arrays >>> arrays = prepare_arrays(node=node) >>> round_(arrays.sim) 1.0, nan, nan, nan, 2.0, 3.0 >>> round_(arrays.obs) 4.0, 5.0, nan, nan, nan, 6.0
Alternatively, we can pass directly any iterable (e.g.
list
andtuple
objects) containing the simulated and observed data:>>> arrays = prepare_arrays(sim=list(node.sequences.sim.series), ... obs=tuple(node.sequences.obs.series)) >>> round_(arrays.sim) 1.0, nan, nan, nan, 2.0, 3.0 >>> round_(arrays.obs) 4.0, 5.0, nan, nan, nan, 6.0
The optional skip_nan flag allows skipping all values, which are no numbers. Note that
prepare_arrays()
returns only those pairs of simulated and observed values that do not contain any nan value:>>> arrays = prepare_arrays(node=node, skip_nan=True) >>> round_(arrays.sim) 1.0, 3.0 >>> round_(arrays.obs) 4.0, 6.0
If you are interested in analysing a sub-period, adapt the global
eval_
Timegrid
beforehand. When passing aNode
object, functionprepare_arrays()
then returns the data of the current evaluation sub-period only:>>> pub.timegrids.eval_.dates = "02.01.2000", "06.01.2000" >>> arrays = prepare_arrays(node=node) >>> round_(arrays.sim) nan, nan, nan, 2.0 >>> round_(arrays.obs) 5.0, nan, nan, nan
Suppose one instead passes the simulation and observation time series directly (which possibly fit the evaluation period already). In that case, function
prepare_arrays()
ignores the currenteval_
Timegrid
by default:>>> arrays = prepare_arrays(sim=arrays.sim, obs=arrays.obs) >>> round_(arrays.sim) nan, nan, nan, 2.0 >>> round_(arrays.obs) 5.0, nan, nan, nan
Use the subperiod argument to deviate from the default behaviour:
>>> arrays = prepare_arrays(node=node, subperiod=False) >>> round_(arrays.sim) 1.0, nan, nan, nan, 2.0, 3.0 >>> round_(arrays.obs) 4.0, 5.0, nan, nan, nan, 6.0
>>> arrays = prepare_arrays(sim=arrays.sim, obs=arrays.obs, subperiod=True) >>> round_(arrays.sim) nan, nan, nan, 2.0 >>> round_(arrays.obs) 5.0, nan, nan, nan
The final examples show the error messages returned in case of invalid combinations of input arguments:
>>> prepare_arrays() Traceback (most recent call last): ... ValueError: Neither a `Node` object is passed to argument `node` nor are arrays passed to arguments `sim` and `obs`.
>>> prepare_arrays(sim=node.sequences.sim.series, node=node) Traceback (most recent call last): ... ValueError: Values are passed to both arguments `sim` and `node`, which is not allowed.
>>> prepare_arrays(obs=node.sequences.obs.series, node=node) Traceback (most recent call last): ... ValueError: Values are passed to both arguments `obs` and `node`, which is not allowed.
>>> prepare_arrays(sim=node.sequences.sim.series) Traceback (most recent call last): ... ValueError: A value is passed to argument `sim` but no value is passed to argument `obs`.
>>> prepare_arrays(obs=node.sequences.obs.series) Traceback (most recent call last): ... ValueError: A value is passed to argument `obs` but no value is passed to argument `sim`.
- class hydpy.auxs.statstools.Criterion(*args, **kwargs)[source]¶
Bases:
Protocol
Callback protocol for efficiency criteria like
nse()
.
- hydpy.auxs.statstools.rmse(*, sim: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, obs: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, node: Node | None = None, skip_nan: bool = False, subperiod: bool | None = None) float [source]¶
Calculate the root-mean-square error.
>>> from hydpy import rmse, round_ >>> round_(rmse(sim=[1.0, 2.0, 3.0], obs=[1.0, 2.0, 3.0])) 0.0 >>> round_(rmse(sim=[1.0, 2.0, 3.0], obs=[0.5, 2.0, 4.5])) 0.912871
See the documentation on function
prepare_arrays()
for some additional instructions for usingrmse()
.
- hydpy.auxs.statstools.nse(*, sim: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, obs: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, node: Node | None = None, skip_nan: bool = False, subperiod: bool | None = None) float [source]¶
Calculate the efficiency criteria after Nash & Sutcliffe.
If the simulated values predict the observed values and the average observed value (regarding the mean square error), the NSE value is zero:
>>> from hydpy import nse, round_ >>> round_(nse(sim=[2.0, 2.0, 2.0], obs=[1.0, 2.0, 3.0])) 0.0 >>> round_(nse(sim=[0.0, 2.0, 4.0], obs=[1.0, 2.0, 3.0])) 0.0
For worse and better agreement, the NSE is negative or positive, respectively:
>>> round_(nse(sim=[3.0, 2.0, 1.0], obs=[1.0, 2.0, 3.0])) -3.0 >>> round_(nse(sim=[1.0, 2.0, 2.0], obs=[1.0, 2.0, 3.0])) 0.5
The highest possible value is one:
>>> round_(nse(sim=[1.0, 2.0, 3.0], obs=[1.0, 2.0, 3.0])) 1.0
See the documentation on function
prepare_arrays()
for some additional instructions for usingnse()
.
- hydpy.auxs.statstools.nse_log(*, sim: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, obs: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, node: Node | None = None, skip_nan: bool = False, subperiod: bool | None = None) float [source]¶
Calculate the efficiency criteria after Nash & Sutcliffe for logarithmic values.
The following calculations repeat the ones of the documentation on function
nse()
but with exponentiated values. Hence, the results are similar or, as in the first and the last example, even identical:>>> from hydpy import nse_log, round_ >>> from numpy import exp >>> round_(nse_log(sim=exp([2.0, 2.0, 2.0]), obs=exp([1.0, 2.0, 3.0]))) 0.0 >>> round_(nse_log(sim=exp([0.0, 2.0, 4.0]), obs=exp([1.0, 2.0, 3.0]))) 0.0
>>> round_(nse(sim=exp([3.0, 2.0, 1.0]), obs=exp([1.0, 2.0, 3.0]))) -2.734185 >>> round_(nse(sim=exp([1.0, 2.0, 2.0]), obs=exp([1.0, 2.0, 3.0]))) 0.002139
>>> round_(nse(sim=exp([1.0, 2.0, 3.0]), obs=exp([1.0, 2.0, 3.0]))) 1.0
See the documentation on function
prepare_arrays()
for some additional instructions for usingnse_log()
.
- hydpy.auxs.statstools.corr2(*, sim: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, obs: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, node: Node | None = None, skip_nan: bool = False, subperiod: bool | None = None) float [source]¶
Calculate the coefficient of determination via the square of the coefficient of correlation according to Bravais-Pearson.
For perfect positive or negative correlation,
corr2()
returns 1:>>> from hydpy import corr2, round_ >>> round_(corr2(sim=[1.0, 2.0, 3.0], obs=[1.0, 2.0, 3.0])) 1.0 >>> round_(corr2(sim=[3.0, 2.0, 1.0], obs=[1.0, 2.0, 3.0])) 1.0
If there is no correlation at all,
corr2()
returns 0:>>> round_(corr2(sim=[1.0, 2.0, 3.0], obs=[1.0, 2.0, 1.0])) 0.0
An intermediate example:
>>> round_(corr2(sim=[2.0, 2.0, 1.0], obs=[1.0, 2.0, 3.0])) 0.75
Take care if there is no variation in one of the data series. Then the correlation coefficient is not defined, and
corr2()
returnsnan
:>>> round_(corr2(sim=[2.0, 2.0, 2.0], obs=[2.0, 2.0, 3.0])) nan
See the documentation on function
prepare_arrays()
for some additional instructions for usingcorr2()
.
- hydpy.auxs.statstools.kge(*, sim: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, obs: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, node: Node | None = None, skip_nan: bool = False, subperiod: bool | None = None) float [source]¶
Calculate the Kling-Gupta efficiency according to Kling et al. (2012).
For a perfect fit,
kge()
returns one:>>> from hydpy import kge, round_ >>> round_(kge(sim=[1.0, 2.0, 3.0], obs=[1.0, 2.0, 3.0])) 1.0
In each of the following three examples, only one of the KGE components deviates from one:
>>> round_(kge(sim=[3.0, 2.0, 1.0], obs=[1.0, 2.0, 3.0])) # imperfect correlation -1.0 >>> round_(kge(sim=[3.0, 2.0, 1.0], obs=[6.0, 4.0, 2.0])) # imperfect average 0.5 >>> round_(kge(sim=[3.0, 2.0, 1.0], obs=[4.0, 2.0, 0.0])) # imperfect variation 0.5
Finally, a mixed example, where all components deviate from one:
>>> round_(kge(sim=[3.0, 2.0, 1.0], obs=[2.0, 2.0, 1.0])) 0.495489
See the documentation on function
prepare_arrays()
for some additional instructions for usingkge()
.
- hydpy.auxs.statstools.bias_abs(*, sim: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, obs: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, node: Node | None = None, skip_nan: bool = False, subperiod: bool | None = None) float [source]¶
Calculate the absolute difference between the means of the simulated and the observed values.
>>> from hydpy import bias_abs, round_ >>> round_(bias_abs(sim=[2.0, 2.0, 2.0], obs=[1.0, 2.0, 3.0])) 0.0 >>> round_(bias_abs(sim=[5.0, 2.0, 2.0], obs=[1.0, 2.0, 3.0])) 1.0 >>> round_(bias_abs(sim=[1.0, 1.0, 1.0], obs=[1.0, 2.0, 3.0])) -1.0
See the documentation on function
prepare_arrays()
for some additional instructions for usingbias_abs()
.
- hydpy.auxs.statstools.bias_rel(*, sim: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, obs: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, node: Node | None = None, skip_nan: bool = False, subperiod: bool | None = None) float [source]¶
Calculate the relative difference between the means of the simulated and the observed values.
>>> from hydpy import bias_rel, round_ >>> round_(bias_rel(sim=[2.0, 2.0, 2.0], obs=[1.0, 2.0, 3.0])) 0.0 >>> round_(bias_rel(sim=[5.0, 2.0, 2.0], obs=[1.0, 2.0, 3.0])) 0.5 >>> round_(bias_rel(sim=[1.0, 1.0, 1.0], obs=[1.0, 2.0, 3.0])) -0.5
See the documentation on function
prepare_arrays()
for some additional instructions for usingbias_rel()
.
- hydpy.auxs.statstools.std_ratio(*, sim: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, obs: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, node: Node | None = None, skip_nan: bool = False, subperiod: bool | None = None) float [source]¶
Calculate the ratio between the standard deviation of the simulated and the observed values.
>>> from hydpy import round_, std_ratio >>> round_(std_ratio(sim=[1.0, 2.0, 3.0], obs=[1.0, 2.0, 3.0])) 0.0 >>> round_(std_ratio(sim=[1.0, 1.0, 1.0], obs=[1.0, 2.0, 3.0])) -1.0 >>> round_(std_ratio(sim=[0.0, 3.0, 6.0], obs=[1.0, 2.0, 3.0])) 2.0
See the documentation on function
prepare_arrays()
for some additional instructions for usingstd_ratio()
.
- hydpy.auxs.statstools.var_ratio(*, sim: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, obs: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, node: Node | None = None, skip_nan: bool = False, subperiod: bool | None = None) float [source]¶
Calculate the ratio between the variation coefficients of the simulated and the observed values.
>>> from hydpy import round_, var_ratio >>> round_(var_ratio(sim=[1.0, 2.0, 3.0], obs=[1.0, 2.0, 3.0])) 0.0 >>> round_(var_ratio(sim=[1.0, 2.0, 3.0], obs=[0.0, 1.0, 2.0])) -0.5 >>> round_(var_ratio(sim=[1.0, 2.0, 3.0], obs=[0.0, 2.0, 4.0])) -0.5
See the documentation on function
prepare_arrays()
for some additional instructions for usingvar_ratio()
.
- hydpy.auxs.statstools.corr(*, sim: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, obs: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, node: Node | None = None, skip_nan: bool = False, subperiod: bool | None = None) float [source]¶
Calculate the product-moment correlation coefficient after Pearson.
>>> from hydpy import corr, round_ >>> round_(corr(sim=[0.5, 1.0, 1.5], obs=[1.0, 2.0, 3.0])) 1.0 >>> round_(corr(sim=[4.0, 2.0, 0.0], obs=[1.0, 2.0, 3.0])) -1.0 >>> round_(corr(sim=[1.0, 2.0, 1.0], obs=[1.0, 2.0, 3.0])) 0.0
Take care if there is no variation in one of the data series. Then the correlation coefficient is not defined, and
corr()
returnsnan
:>>> round_(corr(sim=[2.0, 2.0, 2.0], obs=[1.0, 2.0, 3.0])) nan
See the documentation on function
prepare_arrays()
for some additional instructions for use of functioncorr()
.
- hydpy.auxs.statstools.hsepd_pdf(*, sigma1: float, sigma2: float, xi: float, beta: float, sim: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, obs: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, node: Node | None = None, skip_nan: bool = False, subperiod: bool | None = None) ndarray[Any, dtype[float64]] [source]¶
Calculate the probability densities based on the heteroskedastic skewed exponential power distribution.
For convenience, we store the required parameters of the probability density function as well as the simulated and observed values in a dictionary:
>>> import numpy >>> from hydpy import hsepd_pdf, round_ >>> general = {"sigma1": 0.2, ... "sigma2": 0.0, ... "xi": 1.0, ... "beta": 0.0, ... "sim": numpy.arange(10.0, 41.0), ... "obs": numpy.full(31, 25.0)}
The following test function allows for varying one parameter and prints some and plots all the probability density values corresponding to different simulated values:
>>> def test(**kwargs): ... from matplotlib import pyplot ... special = general.copy() ... name, values = list(kwargs.items())[0] ... results = numpy.zeros((len(general["sim"]), len(values)+1)) ... results[:, 0] = general["sim"] ... for jdx, value in enumerate(values): ... special[name] = value ... results[:, jdx+1] = hsepd_pdf(**special) ... pyplot.plot(results[:, 0], results[:, jdx+1], ... label="%s=%.1f" % (name, value)) ... pyplot.legend() ... for idx, result in enumerate(results): ... if not (idx % 5): ... round_(result)
When varying beta, the resulting probabilities correspond to the Laplace distribution (1.0), normal distribution (0.0), and the uniform distribution (-1.0), respectively. Note that we use -0.99 instead of -1.0 for approximating the uniform distribution to prevent from running into numerical problems, which are not solved yet:
>>> test(beta=[1.0, 0.0, -0.99]) 10.0, 0.002032, 0.000886, 0.0 15.0, 0.008359, 0.010798, 0.0 20.0, 0.034382, 0.048394, 0.057739 25.0, 0.141421, 0.079788, 0.057739 30.0, 0.034382, 0.048394, 0.057739 35.0, 0.008359, 0.010798, 0.0 40.0, 0.002032, 0.000886, 0.0
When varying xi, the resulting density is negatively skewed (0.2), symmetric (1.0), and positively skewed (5.0), respectively:
>>> test(xi=[0.2, 1.0, 5.0]) 10.0, 0.0, 0.000886, 0.003175 15.0, 0.0, 0.010798, 0.012957 20.0, 0.092845, 0.048394, 0.036341 25.0, 0.070063, 0.079788, 0.070063 30.0, 0.036341, 0.048394, 0.092845 35.0, 0.012957, 0.010798, 0.0 40.0, 0.003175, 0.000886, 0.0
In the above examples, the actual sigma (5.0) is calculated by multiplying sigma1 (0.2) with the mean simulated value (25.0) internally. This can be done for modelling homoscedastic errors. Instead, sigma2 is multiplied with the individual simulated values to account for heteroscedastic errors. With increasing values of sigma2, the resulting densities are modified as follows:
>>> test(sigma2=[0.0, 0.1, 0.2]) 10.0, 0.000886, 0.002921, 0.005737 15.0, 0.010798, 0.018795, 0.022831 20.0, 0.048394, 0.044159, 0.037988 25.0, 0.079788, 0.053192, 0.039894 30.0, 0.048394, 0.04102, 0.032708 35.0, 0.010798, 0.023493, 0.023493 40.0, 0.000886, 0.011053, 0.015771
See the documentation on function
prepare_arrays()
for some additional instructions for usinghsepd_pdf()
.
- hydpy.auxs.statstools.hsepd_manual(*, sigma1: float, sigma2: float, xi: float, beta: float, sim: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, obs: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, node: Node | None = None, skip_nan: bool = False, subperiod: bool | None = None) float [source]¶
Calculate the mean of the logarithmic probability densities of the heteroskedastic skewed exponential power distribution.
The following examples stem from the documentation of function
hsepd_pdf()
, which is used by functionhsepd_manual()
. The first one deals with a heteroscedastic normal distribution:>>> from hydpy import hsepd_manual, round_ >>> round_(hsepd_manual(sigma1=0.2, sigma2=0.2, ... xi=1.0, beta=0.0, ... sim=numpy.arange(10.0, 41.0), ... obs=numpy.full(31, 25.0))) -3.682842
Too small probability density values are set to 1e-200 before calculating their logarithm (which means that the lowest possible value returned by function
hsepd_manual()
is approximately -460):>>> round_(hsepd_manual(sigma1=0.2, sigma2=0.0, ... xi=1.0, beta=-0.99, ... sim=numpy.arange(10.0, 41.0), ... obs=numpy.full(31, 25.0))) -209.539335
See the documentation on function
prepare_arrays()
for some additional instructions for usinghsepd_manual()
.
- hydpy.auxs.statstools.hsepd(*, sim: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, obs: Sequence[float] | ndarray[Any, dtype[float64]] | None = None, node: Node | None = None, skip_nan: bool = False, subperiod: bool | None = None, inits: Iterable[float] | None = None, return_pars: bool = False, silent: bool = True) float | tuple[float, tuple[float, float, float, float]] [source]¶
Calculate the mean of the logarithmic probability densities of the heteroskedastic skewed exponential power distribution.
Function
hsepd()
serves the same purpose as functionhsepd_manual()
but tries to estimate the parameters of the heteroscedastic skewed exponential distribution via an optimisation algorithm. This is shown by generating a random sample. One thousand simulated values are scattered around the observed (true) value of 10.0 with a standard deviation of 2.0:>>> import numpy >>> numpy.random.seed(0) >>> sim = numpy.random.normal(10.0, 2.0, 1000) >>> obs = numpy.full(1000, 10.0)
First, as a reference, we calculate the “true” value based on function
hsepd_manual()
and the correct distribution parameters:>>> from hydpy import hsepd, hsepd_manual, round_ >>> round_(hsepd_manual(sigma1=0.2, sigma2=0.0, xi=1.0, beta=0.0, sim=sim, obs=obs)) -2.100093
When using function
hsepd()
, the returned value is even a little “better”:>>> round_(hsepd(sim=sim, obs=obs)) -2.09983
This is due to the deviation from the random sample to its theoretical distribution. This is reflected by small differences between the estimated values and the theoretical values of sigma1 (0.2), sigma2 (0.0), xi (1.0), and beta (0.0). The estimated values are returned in the mentioned order by enabling the return_pars option:
>>> value, pars = hsepd(sim=sim, obs=obs, return_pars=True) >>> round_(pars, decimals=5) 0.19966, 0.0, 0.96836, 0.0188
There is no guarantee that the optimisation numerical optimisation algorithm underlying function
hsepd()
will always find the parameters resulting in the largest value returned by functionhsepd_manual()
. You can increase its robustness (and decrease computation time) by supplying close initial parameter values:>>> value, pars = hsepd(sim=sim, obs=obs, return_pars=True, ... inits=(0.2, 0.0, 1.0, 0.0)) >>> round_(pars, decimals=5) 0.19966, 0.0, 0.96836, 0.0188
However, the following example shows a case when this strategy results in worse results:
>>> value, pars = hsepd(sim=sim, obs=obs, return_pars=True, ... inits=(0.0, 0.2, 1.0, 0.0)) >>> round_(value) -2.174492 >>> round_(pars) 0.0, 0.213179, 1.705485, 0.505112
See the documentation on function
prepare_arrays()
for some additional instructions for usinghsepd()
.
- hydpy.auxs.statstools.calc_mean_time(timepoints: Sequence[float] | ndarray[Any, dtype[float64]], weights: Sequence[float] | ndarray[Any, dtype[float64]]) float [source]¶
Return the weighted mean of the given time points.
With equal given weights, the result is simply the mean of the given time points:
>>> from hydpy import calc_mean_time, round_ >>> round_(calc_mean_time(timepoints=[3.0, 7.0], weights=[2.0, 2.0])) 5.0
With different weights, the resulting time is shifted to the larger ones:
>>> round_(calc_mean_time(timepoints=[3.0, 7.0], weights=[1.0, 3.0])) 6.0
Or, in the most extreme case:
>>> round_(calc_mean_time(timepoints=[3.0, 7.0], weights=[0.0, 4.0])) 7.0
There are some checks for input plausibility, e.g.:
>>> calc_mean_time(timepoints=[3.0, 7.0], weights=[-2.0, 2.0]) Traceback (most recent call last): ... ValueError: While trying to calculate the weighted mean time, the following error occurred: For the following objects, at least one value is negative: weights.
- hydpy.auxs.statstools.calc_mean_time_deviation(timepoints: Sequence[float] | ndarray[Any, dtype[float64]], weights: Sequence[float] | ndarray[Any, dtype[float64]], mean_time: float | None = None) float [source]¶
Return the weighted deviation of the given timepoints from their mean time.
With equal given weights, the is simply the standard deviation of the given time points:
>>> from hydpy import calc_mean_time_deviation, round_ >>> round_(calc_mean_time_deviation(timepoints=[3.0, 7.0], weights=[2.0, 2.0])) 2.0
One can pass a precalculated mean time:
>>> from hydpy import round_ >>> round_(calc_mean_time_deviation( ... timepoints=[3.0, 7.0], weights=[2.0, 2.0], mean_time=4.0)) 2.236068
>>> round_(calc_mean_time_deviation(timepoints=[3.0, 7.0], weights=[1.0, 3.0])) 1.732051
Or, in the most extreme case:
>>> round_(calc_mean_time_deviation(timepoints=[3.0, 7.0], weights=[0.0, 4.0])) 0.0
There are some checks for input plausibility, e.g.:
>>> calc_mean_time_deviation(timepoints=[3.0, 7.0], weights=[-2.0, 2.0]) Traceback (most recent call last): ... ValueError: While trying to calculate the weighted time deviation from mean time, the following error occurred: For the following objects, at least one value is negative: weights.
- hydpy.auxs.statstools.calc_weights(nodes: Collection[Node]) dict[Node, float] [source]¶
Calculate “statistical” weights for all given nodes based on the number of observations within the evaluation period.
>>> from hydpy import calc_weights, nan, Node, print_vector, pub >>> pub.timegrids = "01.01.2000", "04.01.2000", "1d" >>> test1, test2 = Node("test1"), Node("test2") >>> test1.prepare_obsseries() >>> test1.sequences.obs.series = 4.0, 5.0, 6.0 >>> test2.prepare_obsseries() >>> with pub.options.checkseries(False): ... test2.sequences.obs.series = 3.0, nan, 1.0
>>> print_vector(calc_weights((test1, test2)).values()) 0.6, 0.4
>>> pub.timegrids.eval_.lastdate = "03.01.2000" >>> print_vector(calc_weights((test1, test2)).values()) 0.666667, 0.333333
>>> pub.timegrids.eval_.firstdate = "02.01.2000" >>> print_vector(calc_weights((test1, test2)).values()) 1.0, 0.0
>>> print_vector(calc_weights((test1,)).values()) 1.0
>>> print_vector(calc_weights((test2,)).values()) Traceback (most recent call last): ... RuntimeError: None of the given nodes (test2) provides any observation values for the current evaluation period (Timegrid("02.01.2000 00:00:00", "03.01.2000 00:00:00", "1d")).
>>> calc_weights(()) {}
- class hydpy.auxs.statstools.SummaryRow(name: str, nodes: Collection[Node])[source]¶
Bases:
ABC
Abstract base class for
SummaryRowSimple
andSummaryRowWeighted
.The documentation on function
print_evaluationtable()
explains the intended use of the availableSummaryRow
subclasses. Here, we demonstrate their configuration in more detail based on the subclassSummaryRowSimple
, which calculates simple (non-weighted) averages. You only need to pass the name and the node objects relevant for the corresponding row for initialising:>>> from hydpy import Nodes, print_vector, SummaryRowSimple >>> n1, n2, n3 = Nodes("n1", "n2", "n3") >>> s = SummaryRowSimple("s", (n1, n2))
print_evaluationtable()
calculates values for all node-criterion combinations and passes them tosummarise_criteria()
. If the nodes passed toprint_evaluationtable()
and theSummaryRow
instance are identical,SummaryRowSimple
just calculates the average for each criterion:>>> print_vector(s.summarise_criteria(2, {n1: [1.0, 2.0], n2: [3.0, 6.0]})) 2.0, 4.0
Nodes passed to
print_evaluationtable()
but not toSummaryRow
are considered irrelevant for the corresponding row and thus not taken into account for averaging:>>> print_vector(s.summarise_criteria(1, {n1: [1.0], n2: [3.0], n3: [5.0]})) 2.0
If the
SummaryRow
instance expects a node not passed toprint_evaluationtable()
, it raises the following error:>>> print_vector(s.summarise_criteria(1, {n1: [1.0]})) Traceback (most recent call last): ... RuntimeError: While trying to calculate the values of row `s` based on class `SummaryRowSimple`, the following error occurred: Missing information for node `n2`.
summarise_criteria()
generally returnsnan
values for allSummaryRow
instances that select no nodes:>>> SummaryRowSimple("s", ()).summarise_criteria(2, {n1: [1.0, 2.0]}) (nan, nan)
- class hydpy.auxs.statstools.SummaryRowSimple(name: str, nodes: Collection[Node])[source]¶
Bases:
SummaryRow
Helper to define additional “summary rows” in evaluation tables based on simple (non-weighted) averages.
See the documentation on class
SummaryRow
for further information.
- class hydpy.auxs.statstools.SummaryRowWeighted(name: str, nodes: Collection[Node], weights: Collection[float] | None = None)[source]¶
Bases:
SummaryRow
Helper to define additional “summary rows” in evaluation tables based on weighted averages.
The documentation on class
SummaryRow
provides general information on usingSummaryRow
subclasses, while the following examples focus on the unique features of classSummaryRowWeighted
.First, we prepare two nodes. n1 provides a complete and n2 provides an incomplete observation time series:
>>> from hydpy import print_vector, pub, Node, nan >>> pub.timegrids = "2000-01-01", "2000-01-04", "1d" >>> n1, n2 = Node("n1"), Node("n2") >>> n1.prepare_obsseries() >>> n1.sequences.obs.series = 4.0, 5.0, 6.0 >>> n2.prepare_obsseries() >>> with pub.options.checkseries(False): ... n2.sequences.obs.series = 3.0, nan, 1.0
We can pass predefined weighting coefficients to
SummaryRowWeighted
. Then, the completeness of the observation series is irrelevant:>>> sumrow = SummaryRowWeighted("sumrow", (n1, n2), (0.1, 0.9)) >>> print_vector(sumrow.summarise_criteria(2, {n1: [-1.0, 2.0], n2: [1.0, 6.0]})) 0.8, 5.6
If we do not pass any weights,
SummaryRowWeighted
determines them automatically based on the number of available observations per node by invoking functioncalc_weights()
:>>> sumrow = SummaryRowWeighted("sumrow", (n1, n2)) >>> print_vector(sumrow.summarise_criteria(2, {n1: [-1.0, 2.0], n2: [1.0, 6.0]})) -0.2, 3.6
SummaryRowWeighted
reuses the internally calculated weights but updates them when the evaluation time grid changes in the meantime:>>> pub.timegrids.eval_.firstdate = "2000-01-02" >>> print_vector(sumrow.summarise_criteria(2, {n1: [-1.0, 2.0], n2: [1.0, 6.0]})) -0.333333, 3.333333
nan
values calculated for individual nodes due to completely missing observations within the evaluation period do not leak into the results ofsummarise_criteria()
(if the corresponding weights are zero, as they should):>>> pub.timegrids.eval_.lastdate = "2000-01-03" >>> print_vector(sumrow.summarise_criteria(2, {n1: [-1.0, 2.0], n2: [nan, nan]})) -1.0, 2.0
- hydpy.auxs.statstools.print_evaluationtable(*, nodes: Collection[Node], criteria: Collection[Criterion], nodenames: Collection[str] | None = None, critnames: Collection[str] | None = None, critfactors: float | Collection[float] = 1.0, critdigits: int | Collection[int] = 2, subperiod: bool = True, average: bool = True, averagename: str = 'mean', summaryrows: Collection[SummaryRow] = (), filter_: float = 0.0, stepsize: Literal['daily', 'd', 'monthly', 'm'] | None = None, aggregator: str | Callable[[Sequence[float] | ndarray[Any, dtype[float64]]], float] = 'mean', missingvalue: str = '-', decimalseperator: str = '.', file_: str | TextIO | None = None) None [source]¶
Print a table containing the results of the given evaluation criteria for the given
Node
objects.First, we define two nodes with different simulation and observation data (see function
prepare_arrays()
for some explanations):>>> from hydpy import pub, Node, nan >>> pub.timegrids = "01.01.2000", "04.01.2000", "1d" >>> nodes = Node("test1"), Node("test2") >>> for node in nodes: ... node.prepare_allseries() >>> nodes[0].sequences.sim.series = 1.0, 2.0, 3.0 >>> nodes[0].sequences.obs.series = 4.0, 5.0, 6.0 >>> nodes[1].sequences.sim.series = 1.0, 2.0, 3.0 >>> with pub.options.checkseries(False): ... nodes[1].sequences.obs.series = 3.0, nan, 1.0
Selecting functions
corr()
andbias_abs()
as evaluation criteria, functionprint_evaluationtable()
prints the following table:>>> from hydpy import bias_abs, corr, print_evaluationtable >>> print_evaluationtable(nodes=nodes, ... criteria=(corr, bias_abs)) corr bias_abs test1 1.00 -3.00 test2 -1.00 0.00 mean 0.00 -1.50
One can pass alternative names for the node objects, the criteria functions, and the row containing the average values. Also, one can use the filter_ argument to suppress printing statistics in case of incomplete observation data. In the following example, we set the minimum fraction of required data to 80 %:
>>> print_evaluationtable(nodes=nodes, ... criteria=(corr, bias_abs), ... nodenames=("first node", "second node"), ... critnames=("corrcoef", "bias"), ... critdigits=1, ... averagename="average", ... filter_=0.8) corrcoef bias first node 1.0 -3.0 second node - - average 1.0 -3.0
The number of assigned node objects and criteria functions must match the number of given alternative names:
>>> print_evaluationtable(nodes=nodes, ... criteria=(corr, bias_abs), ... nodenames=("first node",)) Traceback (most recent call last): ... ValueError: While trying to evaluate the simulation results of some node objects, the following error occurred: 2 node objects are given which does not match with number of given alternative names being 1.
>>> print_evaluationtable(nodes=nodes, ... criteria=(corr, bias_abs), ... critnames=("corrcoef",)) Traceback (most recent call last): ... ValueError: While trying to evaluate the simulation results of some node objects, the following error occurred: 2 criteria functions are given which does not match with number of given alternative names being 1.
Set the average argument to
False
to omit the row containing the average values:>>> print_evaluationtable(nodes=nodes, ... criteria=(corr, bias_abs), ... average=False) corr bias_abs test1 1.00 -3.00 test2 -1.00 0.00
The summaryrows argument is a more flexible alternative to the standard averaging across nodes. You can pass an arbitrary number of
SummaryRow
instances. Their names define the descriptions in the first column. Here, we include additional lines giving the complete averages for all nodes, averages for a subset of nodes (in fact, the “average” for the single node test2), automatically weighted averages (based on the number of available observations), and manually weighted averages (based on predefined weights):>>> from hydpy import SummaryRowSimple, SummaryRowWeighted >>> summaryrows = (SummaryRowSimple("complete", nodes), ... SummaryRowSimple("selective", (nodes[1],)), ... SummaryRowWeighted("automatically weighted", nodes), ... SummaryRowWeighted("manually weighted", nodes, (0.1, 0.9))) >>> print_evaluationtable(nodes=nodes, ... criteria=(corr, bias_abs), ... average=False, ... summaryrows=summaryrows) corr bias_abs test1 1.00 -3.00 test2 -1.00 0.00 complete 0.00 -1.50 selective -1.00 0.00 automatically weighted 0.20 -1.80 manually weighted -0.80 -0.30
You can use the arguments critfactors and critdigits by passing either a single number or a sequence of criteria-specific numbers to modify the printed values:
>>> print_evaluationtable(nodes=nodes, ... criteria=(corr, bias_abs), ... critfactors=(10.0, 0.1), ... critdigits=1) corr bias_abs test1 10.0 -0.3 test2 -10.0 0.0 mean 0.0 -0.2
By default, function
print_evaluationtable()
prints the statics relevant for the actual evaluation period only:>>> pub.timegrids.eval_.dates = "01.01.2000", "02.01.2000" >>> print_evaluationtable(nodes=nodes, ... criteria=(corr, bias_abs)) corr bias_abs test1 - -3.00 test2 - -2.00 mean - -2.50
You can deviate from this default behaviour by setting the subperiod argument to
False
:>>> print_evaluationtable(nodes=nodes, ... criteria=(corr, bias_abs), ... subperiod=False) corr bias_abs test1 1.00 -3.00 test2 -1.00 0.00 mean 0.00 -1.50
Use the stepsize argument (eventually in combination with argument aggregator) to print the statistics of previously aggregated time series. See
aggregate_series()
for further information.Here, the daily aggregation step size results in identical results as the original step size is also one day:
>>> pub.timegrids.eval_ = pub.timegrids.init >>> print_evaluationtable(nodes=nodes, ... criteria=(corr, bias_abs), ... stepsize="daily", ... aggregator="mean") corr bias_abs test1 1.00 -3.00 test2 -1.00 0.00 mean 0.00 -1.50
For the monthly step size, the result table is empty due to the too short initialisation period covering less than a month:
>>> pub.timegrids.eval_.dates = pub.timegrids.init.dates >>> print_evaluationtable(nodes=nodes, ... criteria=(corr, bias_abs), ... stepsize="monthly", ... aggregator="mean") corr bias_abs test1 - - test2 - - mean - -