# Complexity measures

Be sure you have gone through the Tutorial before going through the API here to have a good idea of the terminology used in ComplexityMeasures.jl.

## Complexity measures API

The complexity measure API is defined by the `complexity`

function, which may take as an input an `ComplexityEstimator`

. The function `complexity_normalized`

is also useful.

`ComplexityMeasures.complexity`

— Function`complexity(c::ComplexityEstimator, x) → m::Real`

Estimate a complexity measure according to `c`

for input data `x`

, where `c`

is an instance of any subtype of `ComplexityEstimator`

:

`ComplexityMeasures.complexity_normalized`

— Function`complexity_normalized(c::ComplexityEstimator, x) → m::Real ∈ [a, b]`

The same as `complexity`

, but the result is normalized to the interval `[a, b]`

, where `[a, b]`

depends on `c`

.

`ComplexityMeasures.ComplexityEstimator`

— Type`ComplexityEstimator`

Supertype for estimators for various complexity measures that are not entropies in the strict mathematical sense.

See `complexity`

for all available estimators.

## Approximate entropy

`ComplexityMeasures.ApproximateEntropy`

— Type```
ApproximateEntropy <: ComplexityEstimator
ApproximateEntropy([x]; r = 0.2std(x), kwargs...)
```

An estimator for the approximate entropy (Pincus, 1991) complexity measure, used with `complexity`

.

The keyword argument `r`

is mandatory if an input timeseries `x`

is not provided.

**Keyword arguments**

`r::Real`

: The radius used when querying for nearest neighbors around points. Its value should be determined from the input data, for example as some proportion of the standard deviation of the data.`m::Int = 2`

: The embedding dimension.`τ::Int = 1`

: The embedding lag.`base::Real = MathConstants.e`

: The base to use for the logarithm. Pincus (1991) uses the natural logarithm.

**Description**

Approximate entropy (ApEn) is defined as

\[ApEn(m ,r) = \lim_{N \to \infty} \left[ \phi(x, m, r) - \phi(x, m + 1, r) \right].\]

Approximate entropy is estimated for a timeseries `x`

, by first embedding `x`

using embedding dimension `m`

and embedding lag `τ`

, then searching for similar vectors within tolerance radius `r`

, using the estimator described below, with logarithms to the given `base`

(natural logarithm is used in Pincus, 1991).

Specifically, for a finite-length timeseries `x`

, an estimator for $ApEn(m ,r)$ is

\[ApEn(m, r, N) = \phi(x, m, r, N) - \phi(x, m + 1, r, N),\]

where `N = length(x)`

and

\[\phi(x, k, r, N) = \dfrac{1}{N-(k-1)\tau} \sum_{i=1}^{N - (k-1)\tau} \log{\left( \sum_{j = 1}^{N-(k-1)\tau} \dfrac{\theta(d({\bf x}_i^m, {\bf x}_j^m) \leq r)}{N-(k-1)\tau} \right)}.\]

Here, $\theta(\cdot)$ returns 1 if the argument is true and 0 otherwise, $d({\bf x}_i, {\bf x}_j)$ returns the Chebyshev distance between vectors ${\bf x}_i$ and ${\bf x}_j$, and the `k`

-dimensional embedding vectors are constructed from the input timeseries $x(t)$ as

\[{\bf x}_i^k = (x(i), x(i+τ), x(i+2τ), \ldots, x(i+(k-1)\tau)).\]

In the original paper, they fix `τ = 1`

. In our implementation, the normalization constant is modified to account for embeddings with `τ != 1`

.

## Sample entropy

`ComplexityMeasures.SampleEntropy`

— Type`SampleEntropy([x]; r = 0.2std(x), kwargs...) <: ComplexityEstimator`

An estimator for the sample entropy complexity measure (Richman and Moorman, 2000), used with `complexity`

and `complexity_normalized`

.

The keyword argument `r`

is mandatory if an input timeseries `x`

is not provided.

**Keyword arguments**

`r::Real`

: The radius used when querying for nearest neighbors around points. Its value should be determined from the input data, for example as some proportion of the standard deviation of the data.`m::Int = 2`

: The embedding dimension.`τ::Int = 1`

: The embedding lag.

**Description**

An *estimator* for sample entropy using radius `r`

, embedding dimension `m`

, and embedding lag `τ`

is

\[SampEn(m,r, N) = -\ln{\dfrac{A(r, N)}{B(r, N)}}.\]

Here,

\[\begin{aligned} B(r, m, N) = \sum_{i = 1}^{N-m\tau} \sum_{j = 1, j \neq i}^{N-m\tau} \theta(d({\bf x}_i^m, {\bf x}_j^m) \leq r) \\ A(r, m, N) = \sum_{i = 1}^{N-m\tau} \sum_{j = 1, j \neq i}^{N-m\tau} \theta(d({\bf x}_i^{m+1}, {\bf x}_j^{m+1}) \leq r) \\ \end{aligned},\]

where $\theta(\cdot)$ returns 1 if the argument is true and 0 otherwise, and $d(x, y)$ computes the Chebyshev distance between $x$ and $y$, and ${\bf x}_i^{m}$ and ${\bf x}_i^{m+1}$ are `m`

-dimensional and `m+1`

-dimensional embedding vectors, where `k`

-dimensional embedding vectors are constructed from the input timeseries $x(t)$ as

\[{\bf x}_i^k = (x(i), x(i+τ), x(i+2τ), \ldots, x(i+(k-1)\tau)).\]

Quoting Richman & Moorman (2002): "SampEn(m,r,N) will be defined except when B = 0, in which case no regularity has been detected, or when A = 0, which corresponds to a conditional probability of 0 and an infinite value of SampEn(m,r,N)". In these cases, `NaN`

is returned.

If computing the normalized measure, then the resulting sample entropy is on `[0, 1]`

.

The original algorithm fixes `τ = 1`

. All formulas here are modified to account for any `τ`

.

See also: `entropy_sample`

.

## Missing dispersion patterns

`ComplexityMeasures.MissingDispersionPatterns`

— Type```
MissingDispersionPatterns <: ComplexityEstimator
MissingDispersionPatterns(o = Dispersion()) → mdp
```

An estimator for the number of missing dispersion patterns (MDP), a complexity measure which can be used to detect nonlinearity in time series (Zhou *et al.*, 2023).

Used with `complexity`

or `complexity_normalized`

.

**Description**

When used with `complexity`

, `complexity(mdp)`

is syntactically equivalent with just `missing_outcomes`

`(o)`

. When used with `complexity_normalized`

, the normalization is simply `missing_outcomes(o)/total_outcomes(o)`

.

`Dispersion`

's linear mapping from CDFs to integers is based on equidistant partitioning of the interval `[0, 1]`

. This is slightly different from Zhou *et al.* (2023), which uses the linear mapping $s_i := \text{round}(y + 0.5)$.

**Usage**

In Zhou *et al.* (2023), `MissingDispersionPatterns`

is used to detect nonlinearity in time series by comparing the MDP for a time series `x`

to values for an ensemble of surrogates of `x`

, as per the standard analysis of TimeseriesSurrogates.jl If the MDP value of $x$ is significantly larger than some high quantile of the surrogate distribution, then it is taken as evidence for nonlinearity.

See also: `Dispersion`

, `ReverseDispersion`

, `total_outcomes`

.

## Reverse dispersion entropy

`ComplexityMeasures.ReverseDispersion`

— Type```
ReverseDispersion <: ComplexityEstimator
ReverseDispersion(; c = 3, m = 2, τ = 1, check_unique = true)
```

Estimator for the reverse dispersion entropy complexity measure (Li *et al.*, 2019).

**Description**

Li *et al.* (2019) defines the reverse dispersion entropy as

\[H_{rde} = \sum_{i = 1}^{c^m} \left(p_i - \dfrac{1}{{c^m}} \right)^2 = \left( \sum_{i=1}^{c^m} p_i^2 \right) - \dfrac{1}{c^{m}}\]

where the probabilities $p_i$ are obtained precisely as for the `Dispersion`

probability estimator. Relative frequencies of dispersion patterns are computed using the given `encoding`

scheme , which defaults to encoding using the normal cumulative distribution function (NCDF), as implemented by `GaussianCDFEncoding`

, using embedding dimension `m`

and embedding delay `τ`

. Recommended parameter values(Li *et al.*, 2019) are `m ∈ [2, 3]`

, `τ = 1`

for the embedding, and `c ∈ [3, 4, …, 8]`

categories for the Gaussian mapping.

If normalizing, then the reverse dispersion entropy is normalized to `[0, 1]`

.

The minimum value of $H_{rde}$ is zero and occurs precisely when the dispersion pattern distribution is flat, which occurs when all $p_i$s are equal to $1/c^m$. Because $H_{rde} \geq 0$, $H_{rde}$ can therefore be said to be a measure of how far the dispersion pattern probability distribution is from white noise.

**Data requirements**

The input must have more than one unique element for the default `GaussianCDFEncoding`

to be well-defined. Li *et al.* (2019) recommends that `x`

has at least 1000 data points.

If `check_unique == true`

(default), then it is checked that the input has more than one unique value. If `check_unique == false`

and the input only has one unique element, then a `InexactError`

is thrown when trying to compute probabilities.

## Statistical complexity

`ComplexityMeasures.StatisticalComplexity`

— Type```
StatisticalComplexity <: ComplexityEstimator
StatisticalComplexity(; kwargs...)
```

An estimator for the statistical complexity and entropy, originally by (Rosso *et al.*, 2007) and generalized by Rosso *et al.* (2013).

Our implementation extends the generalization to any valid distance metric, any `OutcomeSpace`

with a priori known `total_outcomes`

, any `ProbabilitiesEstimator`

, and any normalizable discrete `InformationMeasure`

.

Used with `complexity`

.

**Keyword arguments**

`o::OutcomeSpace = OrdinalPatterns{3}()`

. The`OutcomeSpace`

, which controls how the input data are discretized.`pest::ProbabilitiesEstimator = RelativeAmount()`

: The`ProbabilitiesEstimator`

used to estimate probabilities over the discretized input data.`hest = Renyi()`

: A`DiscreteInfoEstimator`

or an`InformationMeasure`

. Any information measure that defines`information_maximum`

is valid here including extropies. The measure will be estimated using the`PlugIn`

estimator if not given an estimator.`dist <: SemiMetric = JSDivergence()`

: The distance measure (from Distances.jl) to use for estimating the distance between the estimated probability distribution and a uniform distribution with the same maximal number of outcomes.

**Description**

Statistical complexity is defined as

\[C_q[P] = \mathcal{H}_q\cdot \mathcal{Q}_q[P],\]

where $Q_q$ is a "disequilibrium" obtained from a distance-measure and $H_q$ a disorder measure. In the original paper(Rosso *et al.*, 2007), this complexity measure was defined via an ordinal pattern-based probability distribution (see `OrdinalPatterns`

), using `Shannon`

entropy as the information measure, and the Jensen-Shannon divergence as a distance measure.

Our implementation is a further generalization of the complexity measure developed in Rosso *et al.* (2013). We let $H_q$be any normalizable `InformationMeasure`

, e.g. `Shannon`

, `Renyi`

or `Tsallis`

entropy, and we let $Q_q$ be either on the Euclidean, Wooters, Kullback, q-Kullback, Jensen or q-Jensen distance as

\[Q_q[P] = Q_q^0\cdot D[P, P_e],\]

where $D[P, P_e]$ is the distance between the obtained distribution $P$ and a uniform distribution with the same maximum number of bins, measured by the distance measure `dist`

.

**Usage**

The statistical complexity is exclusively used in combination with the chosen information measure (typically an entropy). The estimated value of the information measure can be accessed as a `Ref`

value of the struct as

```
x = randn(100)
c = StatisticalComplexity()
compl = complexity(c, x)
entr = first(entropy_complexity(c, x)) # both complexity and entropy value
```

`complexity(c::StatisticalComplexity, x)`

returns only the statistical complexity. To obtain both the value of the entropy (or other information measure) and the statistical complexity together as a `Tuple`

, use the wrapper `entropy_complexity`

.

See also: `entropy_complexity_curves`

.

`ComplexityMeasures.entropy_complexity`

— Function`entropy_complexity(c::StatisticalComplexity, x) → (h, compl)`

Return a information measure `h`

and the corresponding `StatisticalComplexity`

value `compl`

.

Useful when wanting to plot data on the "entropy-complexity plane". See also `entropy_complexity_curves`

.

`ComplexityMeasures.entropy_complexity_curves`

— Function```
entropy_complexity_curves(c::StatisticalComplexity;
num_max=1, num_min=1000) -> (min_entropy_complexity, max_entropy_complexity)
```

Calculate the maximum complexity-entropy curve for the statistical complexity according to Rosso *et al.* (2007) for `num_max * total_outcomes(c.o)`

different values of the normalized information measure of choice (in case of the maximum complexity curves) and `num_min`

different values of the normalized information measure of choice (in case of the minimum complexity curve).

This function can also be used to compute the maximum "complexity-extropy curve" if `c.hest`

is e.g. `ShannonExtropy`

, which is the equivalent of the complexity-entropy curves, but using extropy instead of entropy.

**Description**

The way the statistical complexity is designed, there is a minimum and maximum possible complexity for data with a given value of an information measure. The calculation time of the maximum complexity curve grows as `O(total_outcomes(c.o)^2)`

, and thus takes very long for high numbers of outcomes. This function is inspired by S. Sippels implementation in statcomp (Sippel *et al.*, 2016).

This function will work with any `ProbabilitiesEstimator`

where `total_outcomes`

is known a priori.

## Lempel-Ziv complexity

`ComplexityMeasures.LempelZiv76`

— Type```
LempelZiv76 <: ComplexityEstimator
LempelZiv76()
```

The Lempel-Ziv, or `LempelZiv76`

, complexity measure (Lempel and Ziv, 1976), which is used with `complexity`

and `complexity_normalized`

.

For results to be comparable across sequences with different length, use the normalized version. Normalized `LempelZiv76`

-complexity is implemented as given in Amigó *et al.* (2004). The normalized measure is close to zero for very regular signals, while for random sequences, it is close to 1 with high probability^{[Amigó2004]}. Note: the normalized `LempelZiv76`

complexity can be higher than 1^{[Amigó2004]}.

The `LempelZiv76`

measure applies only to binary sequences, i.e. sequences with a two-element alphabet (precisely two distinct outcomes). For performance optimization, we do not check the number of unique elements in the input. If your input sequence is not binary, you must `encode`

it first using one of the implemented `Encoding`

schemes (or encode your data manually).

## Bubble entropy

`ComplexityMeasures.BubbleEntropy`

— Type```
BubbleEntropy <: ComplexityEstimator
BubbleEntropy(; m = 3, τ = 1, definition = Renyi(q = 2))
```

The `BubbleEntropy`

complexity estimator (Manis *et al.*, 2017) is just a difference between two entropies, each computed with the `BubbleSortSwaps`

outcome space, for embedding dimensions `m + 1`

and `m`

, respectively.

Manis *et al.* (2017) use the `Renyi`

entropy of order `q = 2`

as the information measure `definition`

, but here you can use any `InformationMeasure`

. Manis *et al.* (2017) formulates the "bubble entropy" as the normalized measure below, while here you can also compute the unnormalized measure.

**Definition**

For input data `x`

, the "bubble entropy" is computed by first embedding the input data using embedding dimension `m`

and embedding delay `τ`

(call the embedded pts `y`

), and then computing the difference between the two entropies:

\[BubbleEn_T(τ) = H_T(y, m + 1) - H_T(y, m)\]

where $H_T(y, m)$ and $H_T(y, m + 1)$ are entropies of type $T$ (e.g. `Renyi`

) computed with the input data `x`

embedded to dimension $m$ and $m+1$, respectively. Use `complexity`

to compute this non-normalized version. Use `complexity_normalized`

to compute the normalized difference of entropies:

\[BubbleEn_H(τ)^{norm} = \dfrac{H_T(x, m + 1) - H_T(x, m)}{max(H_T(x, m + 1)) - max(H_T(x, m))},\]

where the maximum of the entropies for dimensions `m`

and `m + 1`

are computed using `information_maximum`

.

**Example**

```
using ComplexityMeasures
x = rand(1000)
est = BubbleEntropy(m = 5, τ = 3)
complexity(est, x)
```