Discretization API

Encoding multiple input datasets

A fundamental operation when computing multivariate information measures from data is discretization. When discretizing, what happens is that we "encode" input data into an intermediate representation indexed by the positive integers. This intermediate representation is called an "encoding". This is useful in several ways:

The following functions and types are used by Associations.jl to perform discretization of input data.

Associations.CodifyVariablesType
CodifyVariables <: Discretization
CodifyVariables(outcome_space::OutcomeSpace)

The CodifyVariables discretization scheme quantises input data in a column-wise manner using the given outcome_space.

Compatible outcome spaces

Description

The main difference between CodifyVariables and [CodifyPoints] is that the former uses OutcomeSpaces for discretization. This usually means that some transformation is applied to the data before discretizing. For example, some outcome constructs a delay embedding from the input (and thus encodes sequential information) before encoding the data.

Specifically, given x::AbstractStateSpaceSet..., where the i-th dataset x[i] is assumed to represent a single series of measurements, CodifyVariables encodes x[i] by codify-ing into a series of integers using an appropriate OutcomeSpace. This is typically done by first sequentially transforming the data and then running sliding window (the width of the window is controlled by outcome_space) across the data, and then encoding the values within each window to an integer.

Examples

using Associations
x, y = rand(100), rand(100)
d = CodifyVariables(OrdinalPatterns(m=2))
cx, cy = codify(d, x, y)
source
Associations.CodifyPointsType
CodifyPoints{N}
CodifyPoints(encodings::NTuple{N, Encoding})

CodifyPoints points is a Discretization scheme that encodes input data points without applying any sequential transformation to the input (as opposed to CodifyVariables, which may apply some transformation before encoding).

Usage

  • Use with codify to encode/discretize input variable on a point-by-point basis.

Compatible encodings

Description

Given x::AbstractStateSpaceSet..., where the i-th dataset is assumed to represent a single series of measurements, CodifyPoints encodes each point pₖ ∈ x[i] using some Encoding(s), without applying any (sequential) transformation to the x[i] first. This behaviour is different to CodifyVariables, which does apply a transformation to x[i] before encoding.

If length(x) == N (i.e. there are N input dataset), then encodings must be a tuple of N Encoding. Alternatively, if encodings is a single Encoding, then that same encoding is applied to every x[i].

Examples

using Associations

# The same encoding on two input datasets
x = StateSpaceSet(rand(100, 3))
y = StateSpaceSet(rand(100, 3))
encoding_ord = OrdinalPatternEncoding(3)
cx, cy = codify(CodifyPoints(encoding_ord), x, y)

# Different encodings on multiple datasets
z = StateSpaceSet(rand(100, 2))
encoding_bin = RectangularBinEncoding(RectangularBinning(3), z)
d = CodifyPoints(encoding_ord, encoding_ord, encoding_bin)
cx, cy, cz = codify(d, x, y, z)
source
ComplexityMeasures.codifyFunction
codify(o::OutcomeSpace, x::Vector) → s::Vector{Int}
codify(o::OutcomeSpace, x::AbstractStateSpaceSet{D}) → s::NTuple{D, Vector{Int}

Codify x according to the outcome space o. If x is a Vector, then a Vector{<:Integer} is returned. If x is a StateSpaceSet{D}, then symbolization is done column-wise and an NTuple{D, Vector{<:Integer}} is returned, where D = dimension(x).

Description

The reason this function exists is that we don't always want to encode the entire input x at once. Sometimes, it is desirable to first apply some transformation to x first, then apply Encodings in a point-wise manner in the transformed space. (the OutcomeSpace dictates this transformation). This is useful for encoding timeseries data.

The length of the returned s depends on the OutcomeSpace. Some outcome spaces preserve the input data length (e.g. UniqueElements), while some outcome spaces (e.g. OrdinalPatterns) do e.g. delay embeddings before encoding, so that length(s) < length(x).

source

In summary, the two main ways of discretizing data in Associations are as follows.

  • The CodifyPoints discretization scheme encodes input data on a point-by-point basis by applying some Encoding to each point.
  • The CodifyVariables discretization scheme encodes input data on a column-by-column basis by applying a sliding window to each column, and encoding the data within the sliding window according to some OutcomeSpace (Internally, this uses codify).
Note

Encoding, OutcomeSpace and codify are all from ComplexityMeasures.jl. In this package, they are used to discretize multiple input variables instead of just one input variable.

Encoding per point/row

In some cases, it may be desireable to encode data on a row-wise basis. This typically happens when working with pre-embedded time series or StateSpaceSets (respecting the fact that time ordering is already taken care of by the embedding procedure). If we want to apply something like OrdinalPatternEncoding to such data, then we must encode each point individually, such that vectors like [1.2, 2.4, 4.5] or ["howdy", "partner"] gets mapped to an integer. The CodifyPoints discretization intstruction ensures input data are encoded on a point-by-point basis.

A point-by-point discretization using CodifyPoints is formally done by applying some Encoding to each input data point. You can pick between the following encodings, or combine them in arbitrary ways using CombinationEncoding.

ComplexityMeasures.EncodingType
Encoding

The supertype for all encoding schemes. Encodings always encode elements of input data into the positive integers. The encoding API is defined by the functions encode and decode. Some probability estimators utilize encodings internally.

Current available encodings are:

source
ComplexityMeasures.GaussianCDFEncodingType
GaussianCDFEncoding <: Encoding
GaussianCDFEncoding{m}(; μ, σ, c::Int = 3)

An encoding scheme that encodes a scalar or vector χ into one of the integers sᵢ ∈ [1, 2, …, c] based on the normal cumulative distribution function (NCDF), and decodes the sᵢ into subintervals of [0, 1] (with some loss of information).

Initializing a GaussianCDFEncoding

The size of the input to be encoded must be known beforehand. One must therefore set m = length(χ), where χ is the input (m = 1 for scalars, m ≥ 2 for vectors). To do so, one must explicitly give m as a type parameter: e.g. encoding = GaussianCDFEncoding{3}(; μ = 0.0, σ = 0.1) to encode 3-element vectors, or encoding = GaussianCDFEncoding{1}(; μ = 0.0, σ = 0.1) to encode scalars.

Description

Encoding/decoding scalars

GaussianCDFEncoding first maps an input scalar $χ$ to a new real number $y_ \in [0, 1]$ by using the normal cumulative distribution function (CDF) with the given mean μ and standard deviation σ, according to the map

\[x \to y : y = \dfrac{1}{ \sigma \sqrt{2 \pi}} \int_{-\infty}^{x} e^{(-(x - \mu)^2)/(2 \sigma^2)} dx.\]

Next, the interval [0, 1] is equidistantly binned and enumerated $1, 2, \ldots, c$, and $y$ is linearly mapped to one of these integers using the linear map $y \to z : z = \text{floor}(y(c-1)) + 1$.

Because of the floor operation, some information is lost, so when used with decode, each decoded sᵢ is mapped to a subinterval of [0, 1]. This subinterval is returned as a length-1 Vector{SVector}.

Notice that the decoding step does not yield an element of any outcome space of the estimators that use GaussianCDFEncoding internally, such as Dispersion. That is because these estimators additionally delay embed the encoded data.

Encoding/decoding vectors

If GaussianCDFEncoding is used with a vector χ, then each element of χ is encoded separately, resulting in a length(χ) sequence of integers which may be treated as a CartesianIndex. The encoded symbol s ∈ [1, 2, …, c] is then just the linear index corresponding to this cartesian index (similar to how CombinationEncoding works).

When decoded, the integer symbol s is converted back into its CartesianIndex representation, which is just a sequence of integers that refer to subdivisions of the [0, 1] interval. The relevant subintervals are then returned as a length-χ Vector{SVector}.

Examples

julia> using ComplexityMeasures, Statistics

julia> x = [0.1, 0.4, 0.7, -2.1, 8.0];

julia> μ, σ = mean(x), std(x); encoding = GaussianCDFEncoding(; μ, σ, c = 5)

julia> es = encode.(Ref(encoding), x)
5-element Vector{Int64}:
 2
 2
 3
 1
 5

julia> decode(encoding, 3)
2-element SVector{2, Float64} with indices SOneTo(2):
 0.4
 0.6
source
ComplexityMeasures.OrdinalPatternEncodingType
OrdinalPatternEncoding <: Encoding
OrdinalPatternEncoding{m}(lt = ComplexityMeasures.isless_rand)

An encoding scheme that encodes length-m vectors into their permutation/ordinal patterns and then into the integers based on the Lehmer code. It is used by OrdinalPatterns and similar estimators, see that for a description of the outcome space.

The ordinal/permutation pattern of a vector χ is simply sortperm(χ), which gives the indices that would sort χ in ascending order.

Description

The Lehmer code, as implemented here, is a bijection between the set of factorial(m) possible permutations for a length-m sequence, and the integers 1, 2, …, factorial(m). The encoding step uses algorithm 1 in Berger et al. (2019), which is highly optimized. The decoding step is much slower due to missing optimizations (pull requests welcomed!).

Example

julia> using ComplexityMeasures

julia> χ = [4.0, 1.0, 9.0];

julia> c = OrdinalPatternEncoding(3);

julia> i = encode(c, χ)
3

julia> decode(c, i)
3-element SVector{3, Int64} with indices SOneTo(3):
 2
 1
 3

If you want to encode something that is already a permutation pattern, then you can use the non-exported permutation_to_integer function.

source
ComplexityMeasures.RelativeMeanEncodingType
RelativeMeanEncoding <: Encoding
RelativeMeanEncoding(minval::Real, maxval::Real; n = 2)

RelativeMeanEncoding encodes a vector based on the relative position the mean of the vector has with respect to a predefined minimum and maximum value (minval and maxval, respectively).

Description

This encoding is inspired by Azami and Escudero (2016)'s algorithm for amplitude-aware permutation entropy. They use a linear combination of amplitude information and first differences information of state vectors to correct probabilities. Here, however, we explicitly encode the amplitude-part of the correction as an a integer symbol Λ ∈ [1, 2, …, n]. The first-difference part of the encoding is available as the RelativeFirstDifferenceEncoding encoding.

Encoding/decoding

When used with encode, an $m$-element state vector $\bf{x} = (x_1, x_2, \ldots, x_m)$ is encoded as $Λ = \dfrac{1}{N}\sum_{i=1}^m abs(x_i)$. The value of $Λ$ is then normalized to lie on the interval [0, 1], assuming that the minimum/maximum value any single element $x_i$ can take is minval/maxval, respectively. Finally, the interval [0, 1] is discretized into n discrete bins, enumerated by positive integers 1, 2, …, n, and the number of the bin that the normalized $Λ$ falls into is returned.

When used with decode, the left-edge of the bin that the normalized $Λ$ fell into is returned.

source
ComplexityMeasures.RelativeFirstDifferenceEncodingType
RelativeFirstDifferenceEncoding <: Encoding
RelativeFirstDifferenceEncoding(minval::Real, maxval::Real; n = 2)

RelativeFirstDifferenceEncoding encodes a vector based on the relative position the average of the first differences of the vectors has with respect to a predefined minimum and maximum value (minval and maxval, respectively).

Description

This encoding is inspired by Azami and Escudero (2016)'s algorithm for amplitude-aware permutation entropy. They use a linear combination of amplitude information and first differences information of state vectors to correct probabilities. Here, however, we explicitly encode the first differences part of the correction as an a integer symbol Λ ∈ [1, 2, …, n]. The amplitude part of the encoding is available as the RelativeMeanEncoding encoding.

Encoding/decoding

When used with encode, an $m$-element state vector $\bf{x} = (x_1, x_2, \ldots, x_m)$ is encoded as $Λ = \dfrac{1}{m - 1}\sum_{k=2}^m |x_{k} - x_{k-1}|$. The value of $Λ$ is then normalized to lie on the interval [0, 1], assuming that the minimum/maximum value any single $abs(x_k - x_{k-1})$ can take is minval/maxval, respectively. Finally, the interval [0, 1] is discretized into n discrete bins, enumerated by positive integers 1, 2, …, n, and the number of the bin that the normalized $Λ$ falls into is returned. The smaller the mean first difference of the state vector is, the smaller the bin number is. The higher the mean first difference of the state vectors is, the higher the bin number is.

When used with decode, the left-edge of the bin that the normalized $Λ$ fell into is returned.

Performance tips

If you are encoding multiple input vectors, it is more efficient to construct a RelativeFirstDifferenceEncoding instance and re-use it:

minval, maxval = 0, 1
encoding = RelativeFirstDifferenceEncoding(minval, maxval; n = 4)
pts = [rand(3) for i = 1:1000]
[encode(encoding, x) for x in pts]
source
ComplexityMeasures.UniqueElementsEncodingType
UniqueElementsEncoding <: Encoding
UniqueElementsEncoding(x)

UniqueElementsEncoding is a generic encoding that encodes each xᵢ ∈ unique(x) to one of the positive integers. The xᵢ are encoded according to the order of their first appearance in the input data.

The constructor requires the input data x, since the number of possible symbols is length(unique(x)).

Example

using ComplexityMeasures
x = ['a', 2, 5, 2, 5, 'a']
e = UniqueElementsEncoding(x)
encode.(Ref(e), x) == [1, 2, 3, 2, 3, 1] # true
source
ComplexityMeasures.CombinationEncodingType
CombinationEncoding <: Encoding
CombinationEncoding(encodings)

A CombinationEncoding takes multiple Encodings and creates a combined encoding that can be used to encode inputs that are compatible with the given encodings.

Encoding/decoding

When used with encode, each Encoding in encodings returns integers in the set 1, 2, …, n_e, where n_e is the total number of outcomes for a particular encoding. For k different encodings, we can thus construct the cartesian coordinate (c₁, c₂, …, cₖ) (cᵢ ∈ 1, 2, …, n_i), which can uniquely be identified by an integer. We can thus identify each unique combined encoding with a single integer.

When used with decode, the integer symbol is converted to its corresponding cartesian coordinate, which is used to retrieve the decoded symbols for each of the encodings, and a tuple of the decoded symbols are returned.

The total number of outcomes is prod(total_outcomes(e) for e in encodings).

Examples

using ComplexityMeasures

# We want to encode the vector `x`.
x = [0.9, 0.2, 0.3]

# To do so, we will use a combination of first-difference encoding, amplitude encoding,
# and ordinal pattern encoding.

encodings = (
    RelativeFirstDifferenceEncoding(0, 1; n = 2),
    RelativeMeanEncoding(0, 1; n = 5),
    OrdinalPatternEncoding(3) # x is a three-element vector
    )
c = CombinationEncoding(encodings)

# Encode `x` as integer
ω = encode(c, x)

# Decode symbol (into a vector of decodings, one for each encodings `e ∈ encodings`).
# In this particular case, the first two element will be left-bin edges, and
# the last element will be the decoded ordinal pattern (indices that would sort `x`).
d = decode(c, ω)
source

Examples: encoding rows (one point at a time)

We'll here use the OrdinalPatternEncoding with differing parameter m to encode multiple StateSpaceSet of differing dimensions.

using Associations
using StateSpaceSets
using Random; rng = Xoshiro(1234)

# The first variable is 2-dimensional and has 50 points
x = StateSpaceSet(rand(rng, 50, 2))
# The second variable is 3-dimensional and has 50 points
y = StateSpaceSet(rand(rng, 50, 3))
# The third variable is 4-dimensional and has 50 points
z = StateSpaceSet(rand(rng, 50, 4))

# One encoding scheme per input variable
# encode `x` using `ox` on a point-by-point basis (Vector{SVector{4}} → Vector{Int})
# encode `y` using `oy` on a point-by-point basis (Vector{SVector{3}} → Vector{Int})
# encode `z` using `oz` on a point-by-point basis (Vector{SVector{2}} → Vector{Int})
ox = OrdinalPatternEncoding(2)
oy = OrdinalPatternEncoding(3)
oz = OrdinalPatternEncoding(4)

# This given three column vectors of integers.
cx, cy, cz = codify(CodifyPoints(ox, oy, oz), x, y, z)

[cx cy cz]
50×3 Matrix{Int64}:
 2  1   9
 1  4   5
 2  6   2
 1  3  22
 1  1  23
 1  1  17
 2  2   2
 2  6   7
 2  4  20
 1  3  22
 ⋮     
 2  3  12
 1  6  19
 2  1  24
 1  4   5
 2  2   2
 2  6  11
 1  6  18
 1  3  21
 2  1   2

Notice that the 2-dimensional x has been encoded into integer values 1 or 2, because there are 2! possible ordinal patterns for dimension m = 2. The 3-dimensional y has been encoded into integers in the range 1 to 3! = 6, while the 4-dimensional z is encoded into an even larger range of integers, because the number of possible ordinal patterns is 4! = 24 for 4-dimensional embedding vectors.

Encoding per variable/column

Sometimes, it may be desireable to encode input data one variable/column at a time. This typically happens when the input are either a single or multiple timeseries.

To encode columns, we move a sliding window across each input variable/column and encode points within that window. Formally, such a sliding-window discretization is done by using the CodifyVariables discretization scheme, which takes as input some OutcomeSpace that dictates how each window is encoded, and also dictates the width of the encoding windows.

For column/variable-wise encoding, you can pick between the following outcome spaces.

ComplexityMeasures.OutcomeSpaceType
OutcomeSpace

The supertype for all outcome space implementation.

Description

In ComplexityMeasures.jl, an outcome space defines a set of possible outcomes $\Omega = \{\omega_1, \omega_2, \ldots, \omega_L \}$ (some form of discretization). In the literature, the outcome space is often also called an "alphabet", while each outcome is called a "symbol" or an "event".

An outcome space also defines a set of rules for mapping input data to to each outcome $\omega_i$, a processes called encoding or symbolizing or discretizing in the literature (see encodings). Some OutcomeSpaces first apply a transformation, e.g. a delay embedding, to the data before discretizing/encoding, while other OutcomeSpaces discretize/encode the data directly.

Implementations

Outcome spacePrincipleInput dataCounting-compatible
UniqueElementsCount of unique elementsAny
ValueBinningBinning (histogram)Vector, StateSpaceSet
OrdinalPatternsOrdinal patternsVector, StateSpaceSet
SpatialOrdinalPatternsOrdinal patterns in spaceArray
DispersionDispersion patternsVector
SpatialDispersionDispersion patterns in spaceArray
CosineSimilarityBinningCosine similarityVector
BubbleSortSwapsSwap counts when sortingVector
SequentialPairDistancesSequential state vector distancesVector, StateSpaceSet
TransferOperatorBinning (transfer operator)Vector, StateSpaceSet
NaiveKernelKernel density estimationStateSpaceSet
WeightedOrdinalPatternsOrdinal patternsVector, StateSpaceSet
AmplitudeAwareOrdinalPatternsOrdinal patternsVector, StateSpaceSet
WaveletOverlapWavelet transformVector
PowerSpectrumFourier transformVector

In the column "input data" it is assumed that the eltype of the input is <: Real.

Usage

Outcome spaces are used as input to

Counting-compatible vs. non-counting compatible outcome spaces

There are two main types of outcome spaces.

  • Counting-compatible outcome spaces have a well-defined way of counting how often each point in the (encoded) input data is mapped to a particular outcome $\omega_i$. These outcome spaces use encode to discretize the input data. Examples are OrdinalPatterns (which encodes input data into ordinal patterns) or ValueBinning (which discretizes points onto a regular grid). The table below lists which outcome spaces are counting compatible.
  • Non-counting compatible outcome spaces have no well-defined way of counting explicitly how often each point in the input data is mapped to a particular outcome $\omega_i$. Instead, these outcome spaces returns a vector of pre-normalized "relative counts", one for each outcome $\omega_i$. Examples are WaveletOverlap or PowerSpectrum.

Counting-compatible outcome spaces can be used with any ProbabilitiesEstimator to convert counts into probability mass functions. Non-counting-compatible outcome spaces can only be used with the maximum likelihood (RelativeAmount) probabilities estimator, which estimates probabilities precisely by the relative frequency of each outcome (formally speaking, the RelativeAmount estimator also requires counts, but for the sake of code consistency, we allow it to be used with relative frequencies as well).

The function is_counting_based can be used to check whether an outcome space is based on counting.

Deducing the outcome space (from data)

Some outcome space models can deduce $\Omega$ without knowledge of the input, such as OrdinalPatterns. Other outcome spaces require knowledge of the input data for concretely specifying $\Omega$, such as ValueBinning with RectangularBinning. If o is some outcome space model and x some input data, then outcome_space(o, x) returns the possible outcomes $\Omega$. To get the cardinality of $\Omega$, use total_outcomes.

Implementation details

The element type of $\Omega$ varies between outcome space models, but it is guaranteed to be hashable and sortable. This allows for conveniently tracking the counts of a specific event across experimental realizations, by using the outcome as a dictionary key and the counts as the value for that key (or, alternatively, the key remains the outcome and one has a vector of probabilities, one for each experimental realization).

source
ComplexityMeasures.UniqueElementsType
UniqueElements()

An OutcomeSpace based on straight-forward counting of distinct elements in a univariate time series or multivariate dataset. This is the same as giving no estimator to probabilities.

Outcome space

The outcome space is the unique sorted values of the input. Hence, input x is needed for a well-defined outcome_space.

Implements

  • codify. Used for encoding inputs where ordering matters (e.g. time series).
source
ComplexityMeasures.CosineSimilarityBinningType
CosineSimilarityBinning(; m::Int, τ::Int, nbins::Int)

A OutcomeSpace based on the cosine similarity (Wang et al., 2020).

It can be used with information to compute the "diversity entropy" of an input timeseries (Wang et al., 2020).

The implementation here allows for τ != 1, which was not considered in the original paper.

Description

CosineSimilarityBinning probabilities are computed as follows.

  1. From the input time series x, using embedding lag τ and embedding dimension m, construct the embedding $Y = \{\bf x_i \} = \{(x_{i}, x_{i+\tau}, x_{i+2\tau}, \ldots, x_{i+m\tau - 1}\}_{i = 1}^{N-mτ}$.
  2. Compute $D = \{d(\bf x_t, \bf x_{t+1}) \}_{t=1}^{N-mτ-1}$, where $d(\cdot, \cdot)$ is the cosine similarity between two m-dimensional vectors in the embedding.
  3. Divide the interval [-1, 1] into nbins equally sized subintervals (including the value +1).
  4. Construct a histogram of cosine similarities $d \in D$ over those subintervals.
  5. Sum-normalize the histogram to obtain probabilities.

Outcome space

The outcome space for CosineSimilarityBinning is the bins of the [-1, 1] interval, and the return configuration is the same as in ValueBinning (left bin edge).

Implements

  • codify. Used for encoding inputs where ordering matters (e.g. time series).
source
ComplexityMeasures.DispersionType
Dispersion(; c = 5, m = 2, τ = 1, check_unique = true)

An OutcomeSpace based on dispersion patterns, originally used by Rostaghi and Azami (2016) to compute the "dispersion entropy", which characterizes the complexity and irregularity of a time series.

Recommended parameter values (Li et al., 2019) are m ∈ [2, 3], τ = 1 for the embedding, and c ∈ [3, 4, …, 8] categories for the Gaussian symbol mapping.

Description

Assume we have a univariate time series $X = \{x_i\}_{i=1}^N$. First, this time series is encoded into a symbol timeseries $S$ using the Gaussian encoding GaussianCDFEncoding with empirical mean μ and empirical standard deviation σ (both determined from $X$), and c as given to Dispersion.

Then, $S$ is embedded into an $m$-dimensional time series, using an embedding lag of $\tau$, which yields a total of $N - (m - 1)\tau$ delay vectors $z_i$, or "dispersion patterns". Since each element of $z_i$ can take on c different values, and each delay vector has m entries, there are c^m possible dispersion patterns. This number is used for normalization when computing dispersion entropy.

The returned probabilities are simply the frequencies of the unique dispersion patterns present in $S$ (i.e., the UniqueElements of $S$).

Outcome space

The outcome space for Dispersion is the unique delay vectors whose elements are the the symbols (integers) encoded by the Gaussian CDF, i.e., the unique elements of $S$.

Data requirements and parameters

The input must have more than one unique element for the Gaussian mapping to be well-defined. Li et al. (2019) recommends that x has at least 1000 data points.

If check_unique == true (default), then it is checked that the input has more than one unique value. If check_unique == false and the input only has one unique element, then a InexactError is thrown when trying to compute probabilities.

Why 'dispersion patterns'?

Each embedding vector is called a "dispersion pattern". Why? Let's consider the case when $m = 5$ and $c = 3$, and use some very imprecise terminology for illustration:

When $c = 3$, values clustering far below mean are in one group, values clustered around the mean are in one group, and values clustering far above the mean are in a third group. Then the embedding vector $[2, 2, 2, 2, 2]$ consists of values that are close together (close to the mean), so it represents a set of numbers that are not very spread out (less dispersed). The embedding vector $[1, 1, 2, 3, 3]$, however, represents numbers that are much more spread out (more dispersed), because the categories representing "outliers" both above and below the mean are represented, not only values close to the mean.

For a version of this estimator that can be used on high-dimensional arrays, see SpatialDispersion.

Implements

  • codify. Used for encoding inputs where ordering matters (e.g. time series).
source
ComplexityMeasures.OrdinalPatternsType
OrdinalPatterns <: OutcomeSpace
OrdinalPatterns{m}(τ = 1, lt::Function = ComplexityMeasures.isless_rand)

An OutcomeSpace based on lengh-m ordinal permutation patterns, originally introduced in Bandt and Pompe (2002)'s paper on permutation entropy. Note that m is given as a type parameter, so that when it is a literal integer there are performance accelerations.

When passed to probabilities the output depends on the input data type:

  • Univariate data. If applied to a univariate timeseries (AbstractVector), then the timeseries is first embedded using embedding delay τ and dimension m, resulting in embedding vectors $\{ \bf{x}_i \}_{i=1}^{N-(m-1)\tau}$. Then, for each $\bf{x}_i$, we find its permutation pattern $\pi_{i}$. Probabilities are then estimated as the frequencies of the encoded permutation symbols by using UniqueElements. When giving the resulting probabilities to information, the original permutation entropy is computed (Bandt and Pompe, 2002).
  • Multivariate data. If applied to a an D-dimensional StateSpaceSet, then no embedding is constructed, m must be equal to D and τ is ignored. Each vector $\bf{x}_i$ of the dataset is mapped directly to its permutation pattern $\pi_{i}$ by comparing the relative magnitudes of the elements of $\bf{x}_i$. Like above, probabilities are estimated as the frequencies of the permutation symbols. The resulting probabilities can be used to compute multivariate permutation entropy (He et al., 2016), although here we don't perform any further subdivision of the permutation patterns (as in Figure 3 of He et al. (2016)).

Internally, OrdinalPatterns uses the OrdinalPatternEncoding to represent ordinal patterns as integers for efficient computations.

See WeightedOrdinalPatterns and AmplitudeAwareOrdinalPatterns for estimators that not only consider ordinal (sorting) patterns, but also incorporate information about within-state-vector amplitudes. For a version of this estimator that can be used on spatial data, see SpatialOrdinalPatterns.

Handling equal values in ordinal patterns

In Bandt and Pompe (2002), equal values are ordered after their order of appearance, but this can lead to erroneous temporal correlations, especially for data with low amplitude resolution (Zunino et al., 2017). Here, by default, if two values are equal, then one of the is randomly assigned as "the largest", using lt = ComplexityMeasures.isless_rand. To get the behaviour from Bandt and Pompe (2002), use lt = Base.isless.

Outcome space

The outcome space Ω for OrdinalPatterns is the set of length-m ordinal patterns (i.e. permutations) that can be formed by the integers 1, 2, …, m. There are factorial(m) such patterns.

For example, the outcome [2, 3, 1] corresponds to the ordinal pattern of having the smallest value in the second position, the next smallest value in the third position, and the next smallest, i.e. the largest value in the first position. See also OrdinalPatternEncoding.

In-place symbolization

OrdinalPatterns also implements the in-place probabilities! for StateSpaceSet input (or embedded vector input) for reducing allocations in looping scenarios. The length of the pre-allocated symbol vector must be the length of the dataset. For example

using ComplexityMeasures
m, N = 2, 100
est = OrdinalPatterns{m}(τ)
x = StateSpaceSet(rand(N, m)) # some input dataset
πs_ts = zeros(Int, N) # length must match length of `x`
p = probabilities!(πs_ts, est, x)
source
ComplexityMeasures.BubbleSortSwapsType
BubbleSortSwaps <: CountBasedOutcomeSpace
BubbleSortSwaps(; m = 3, τ = 1)

The BubbleSortSwaps outcome space is based on Manis et al. (2017)'s paper on "bubble entropy".

Description

BubbleSortSwaps does the following:

  • Embeds the input data using embedding dimension m and embedding lag τ
  • For each state vector in the embedding, counting how many swaps are necessary for the bubble sort algorithm to sort state vectors.

For counts_and_outcomes, we then define a distribution over the number of necessary swaps. This distribution can then be used to estimate probabilities using probabilities_and_outcomes, which again can be used to estimate any InformationMeasure. An example of how to compute the "Shannon bubble entropy" is given below.

Outcome space

The outcome_space for BubbleSortSwaps are the integers 0:N, where N = (m * (m - 1)) / 2 + 1 (the worst-case number of swaps). Hence, the number of total_outcomes is N + 1.

Implements

  • codify. Returns the number of swaps required for each embedded state vector.

Examples

With the BubbleSortSwaps outcome space, we can easily compute a "bubble entropy" inspired by (Manis et al., 2017). Note: this is not actually a new entropy - it is just a new way of discretizing the input data. To reproduce the bubble entropy complexity measure from (Manis et al., 2017), see BubbleEntropy.

Examples

using ComplexityMeasures
x = rand(100000)
o = BubbleSortSwaps(; m = 5) # 5-dimensional embedding vectors
information(Shannon(; base = 2), o, x)

# We can also compute any other "bubble quantity", for example the 
# "Tsallis bubble extropy", with arbitrary probabilities estimators:
information(TsallisExtropy(), BayesianRegularization(), o, x)
source
ComplexityMeasures.ValueBinningType
ValueBinning(b::AbstractBinning) <: OutcomeSpace

An OutcomeSpace based on binning the values of the data as dictated by the binning scheme b and formally computing their histogram, i.e., the frequencies of points in the bins. An alias to this is VisitationFrequency. Available binnings are subtypes of AbstractBinning.

The ValueBinning estimator has a linearithmic time complexity (n log(n) for n = length(x)) and a linear space complexity (l for l = dimension(x)). This allows computation of probabilities (histograms) of high-dimensional datasets and with small box sizes ε without memory overflow and with maximum performance. For performance reasons, the probabilities returned never contain 0s and are arbitrarily ordered.

ValueBinning(ϵ::Union{Real,Vector})

A convenience method that accepts same input as RectangularBinning and initializes this binning directly.

Outcomes

The outcome space for ValueBinning is the unique bins constructed from b. Each bin is identified by its left (lowest-value) corner, because bins are always left-closed-right-open intervals [a, b). The bins are in data units, not integer (cartesian indices units), and are returned as SVectors, i.e., same type as input data.

For convenience, outcome_space returns the outcomes in the same array format as the underlying binning (e.g., Matrix for 2D input).

For FixedRectangularBinning the outcome_space is well-defined from the binning, but for RectangularBinning input x is needed as well.

Implements

  • codify. Used for encoding inputs where ordering matters (e.g. time series).
source
ComplexityMeasures.RectangularBinningType
RectangularBinning(ϵ, precise = false) <: AbstractBinning

Rectangular box partition of state space using the scheme ϵ, deducing the histogram extent and bin width from the input data.

RectangularBinning is a convenience struct. It is re-cast into FixedRectangularBinning once the data are provided, so see that docstring for info on the bin calculation and the meaning of precise.

Binning instructions are deduced from the type of ϵ as follows:

  1. ϵ::Int divides each coordinate axis into ϵ equal-length intervals that cover all data.
  2. ϵ::Float64 divides each coordinate axis into intervals of fixed size ϵ, starting from the axis minima until the data is completely covered by boxes.
  3. ϵ::Vector{Int} divides the i-th coordinate axis into ϵ[i] equal-length intervals that cover all data.
  4. ϵ::Vector{Float64} divides the i-th coordinate axis into intervals of fixed size ϵ[i], starting from the axis minima until the data is completely covered by boxes.

RectangularBinning ensures all input data are covered by extending the created ranges if need be.

source
ComplexityMeasures.FixedRectangularBinningType
FixedRectangularBinning <: AbstractBinning
FixedRectangularBinning(ranges::Tuple{<:AbstractRange...}, precise = false)

Rectangular box partition of state space where the partition along each dimension is explicitly given by each range ranges, which is a tuple of AbstractRange subtypes. Typically, each range is the output of the range Base function, e.g., ranges = (0:0.1:1, range(0, 1; length = 101), range(2.1, 3.2; step = 0.33)). All ranges must be sorted.

The optional second argument precise dictates whether Julia Base's TwicePrecision is used for when searching where a point falls into the range. Useful for edge cases of points being almost exactly on the bin edges, but it is exactly four times as slow, so by default it is false.

Points falling outside the partition do not contribute to probabilities. Bins are always left-closed-right-open: [a, b). This means that the last value of each of the ranges dictates the last right-closing value. This value does not belong to the histogram! E.g., if given a range r = range(0, 1; length = 11), with r[end] = 1, the value 1 is outside the partition and would not attribute any increase of the probability corresponding to the last bin (here [0.9, 1))!

Equivalently, the size of the histogram is histsize = map(r -> length(r)-1, ranges)!

FixedRectangularBinning leads to a well-defined outcome space without knowledge of input data, see ValueBinning.

source

Example: encoding columns (one variable at a time)

Some OutcomeSpaces dictate a sliding window which has the width of one element when used with CodifyVariables. ValueBinning is such an outcome space.

using Associations
using Random; rng = Xoshiro(1234)

x = rand(rng, 100)
o = ValueBinning(3)
cx = codify(CodifyVariables(o), x)
100-element Vector{Int64}:
 2
 2
 3
 1
 2
 2
 3
 3
 3
 3
 ⋮
 1
 3
 1
 3
 2
 1
 3
 2
 3

We can verify that ValueBinning preserves the cardinality of the input dataset.

length(x) == length(cx)
true

Other outcome spaces such as Dispersion or OrdinalPatterns do not preserve the cardinality of the input dataset when used with CodifyVariables. This is because when they are applied in a sliding window, they compress sliding windows consisting of potentially multiple points into single integers. This means that some points at the end of each input variable are lost. For example, with OrdinalPatterns, the number of encoded points decrease with the embedding parameter m.

using Associations
using Random; rng = Xoshiro(1234)

x = rand(rng, 100)
o = OrdinalPatterns(m = 3)
cx = codify(CodifyVariables(o), x)
98-element Vector{Int64}:
 3
 5
 4
 1
 1
 1
 5
 6
 6
 6
 ⋮
 5
 3
 5
 4
 2
 6
 3
 2
 4

We can simultaneously encode multiple variable/columns of a StateSpaceSet using the same outcome space, as long as the operation will result in the same number of encoded data points for each column.

using Associations
using Random; rng = Xoshiro(1234)

x = rand(rng, 100)
y = rand(rng, 100)
o = OrdinalPatterns(m = 3)
# Alternatively provide a tuple of input time series: codify(CodifyVariables(o), (x, y))
cx, cy = codify(CodifyVariables(o), StateSpaceSet(x, y))

[cx cy]
98×2 Matrix{Int64}:
 3  1
 5  5
 4  6
 1  4
 1  2
 1  3
 5  1
 6  5
 6  4
 6  1
 ⋮  
 5  5
 3  4
 5  1
 4  5
 2  3
 6  2
 3  4
 2  5
 4  4