Discretization API
Encoding multiple input datasets
A fundamental operation when computing multivariate information measures from data is discretization. When discretizing, what happens is that we "encode" input data into an intermediate representation indexed by the positive integers. This intermediate representation is called an "encoding". This is useful in several ways:
- Once a dataset has been encoded into integers, we can estimate
Counts
orProbabilities
(tutorial). - Once probabilities have been estimated, one can use these to estimate
MultivariateInformationMeasure
(tutorial).
The following functions and types are used by Associations.jl to perform discretization of input data.
Associations.Discretization
— TypeDiscretization
The supertype of all discretization schemes.
Concrete implementations
Associations.CodifyVariables
— TypeCodifyVariables <: Discretization
CodifyVariables(outcome_space::OutcomeSpace)
The CodifyVariables
discretization scheme quantises input data in a column-wise manner using the given outcome_space
.
Compatible outcome spaces
UniqueElements
(for when data are pre-discretized)BubbleSortSwaps
CosineSimilarityBinning
OrdinalPatterns
Dispersion
Description
The main difference between CodifyVariables
and [CodifyPoints
] is that the former uses OutcomeSpace
s for discretization. This usually means that some transformation is applied to the data before discretizing. For example, some outcome constructs a delay embedding from the input (and thus encodes sequential information) before encoding the data.
Specifically, given x::AbstractStateSpaceSet...
, where the i
-th dataset x[i]
is assumed to represent a single series of measurements, CodifyVariables
encodes x[i]
by codify
-ing into a series of integers using an appropriate OutcomeSpace
. This is typically done by first sequentially transforming the data and then running sliding window (the width of the window is controlled by outcome_space
) across the data, and then encoding the values within each window to an integer.
Examples
using Associations
x, y = rand(100), rand(100)
d = CodifyVariables(OrdinalPatterns(m=2))
cx, cy = codify(d, x, y)
Associations.CodifyPoints
— TypeCodifyPoints{N}
CodifyPoints(encodings::NTuple{N, Encoding})
CodifyPoints
points is a Discretization
scheme that encodes input data points without applying any sequential transformation to the input (as opposed to CodifyVariables
, which may apply some transformation before encoding).
Usage
- Use with
codify
to encode/discretize input variable on a point-by-point basis.
Compatible encodings
GaussianCDFEncoding
OrdinalPatternEncoding
RelativeMeanEncoding
RelativeFirstDifferenceEncoding
UniqueElementsEncoding
RectangularBinEncoding
CombinationEncoding
Description
Given x::AbstractStateSpaceSet...
, where the i
-th dataset is assumed to represent a single series of measurements, CodifyPoints
encodes each point pₖ ∈ x[i]
using some Encoding
(s), without applying any (sequential) transformation to the x[i]
first. This behaviour is different to CodifyVariables
, which does apply a transformation to x[i]
before encoding.
If length(x) == N
(i.e. there are N
input dataset), then encodings
must be a tuple of N
Encoding
. Alternatively, if encodings
is a single Encoding
, then that same encoding is applied to every x[i]
.
Examples
using Associations
# The same encoding on two input datasets
x = StateSpaceSet(rand(100, 3))
y = StateSpaceSet(rand(100, 3))
encoding_ord = OrdinalPatternEncoding(3)
cx, cy = codify(CodifyPoints(encoding_ord), x, y)
# Different encodings on multiple datasets
z = StateSpaceSet(rand(100, 2))
encoding_bin = RectangularBinEncoding(RectangularBinning(3), z)
d = CodifyPoints(encoding_ord, encoding_ord, encoding_bin)
cx, cy, cz = codify(d, x, y, z)
ComplexityMeasures.codify
— Functioncodify(o::OutcomeSpace, x::Vector) → s::Vector{Int}
Codify x
according to the outcome space o
.
Description
The reason this function exists is that we don't always want to encode
the entire input x
at once. Sometimes, it is desirable to first apply some transformation to x
first, then apply Encoding
s in a point-wise manner in the transformed space. (the OutcomeSpace
dictates this transformation). This is useful for encoding timeseries data.
The length of the returned s
depends on the OutcomeSpace
. Some outcome spaces preserve the input data length (e.g. UniqueElements
), while some outcome spaces (e.g. OrdinalPatterns
) do e.g. delay embeddings before encoding, so that length(s) < length(x)
.
codify(encoding::CodifyPoints{N}, x::Vararg{<:AbstractStateSpaceSet, N})
Codify each timeseries xᵢ ∈ x
according to the given encoding
.
Examples
x = StateSpaceSet(rand(10000, 2))
y = StateSpaceSet(rand(10000, 3))
z = StateSpaceSet(rand(10000, 2))
# For `x`, we use a relative mean encoding.
ex = RelativeMeanEncoding(0.0, 1.0, n = 3)
# For `y`, we use a combination encoding.
ey = CombinationEncoding(
RelativeMeanEncoding(0.0, 1.0, n = 2),
OrdinalPatternEncoding(3)
)
# For `z`, we use ordinal patterns to encode.
ez = OrdinalPatternEncoding(2)
# Codify two input datasets gives a 2-tuple of Vector{Int}
codify(CodifyPoints(ex, ey), x, y)
# Codify three input datasets gives a 3-tuple of Vector{Int}
codify(CodifyPoints(ex, ey, ez), x, y, z)
codify(d::CodifyVariables, x::Vararg{<:AbstractStateSpaceSet, N})
codify(d::CodifyPoints, x::Vararg{<:AbstractStateSpaceSet, N})
Codify each timeseries xᵢ ∈ x
according to the given encoding/discretization d
.
Compatible discretizations
Examples
using Associations
# Sliding window encoding
x = [0.1, 0.2, 0.3, 0.2, 0.1, 0.0, 0.5, 0.3, 0.5]
xc1 = codify(CodifyVariables(OrdinalPatterns(m=2)), x) # should give [1, 1, 2, 2, 2, 1, 2, 1]
xc2 = codify(OrdinalPatterns(m=2), x) # equivalent
length(xc1) < length(x) # should be true, because `OrdinalPatterns` delay embeds.
# Point-by-point encoding
x, y = StateSpaceSet(rand(100, 3)), StateSpaceSet(rand(100, 3))
cx, cy = codify(CodifyPoints(OrdinalPatternEncoding(3)), x, y)
In summary, the two main ways of discretizing data in Associations are as follows.
- The
CodifyPoints
discretization scheme encodes input data on a point-by-point basis by applying someEncoding
to each point. - The
CodifyVariables
discretization scheme encodes input data on a column-by-column basis by applying a sliding window to each column, and encoding the data within the sliding window according to someOutcomeSpace
(Internally, this usescodify
).
Encoding
, OutcomeSpace
and codify
are all from ComplexityMeasures.jl. In this package, they are used to discretize multiple input variables instead of just one input variable.
Encoding per point/row
In some cases, it may be desireable to encode data on a row-wise basis. This typically happens when working with pre-embedded time series or StateSpaceSet
s (respecting the fact that time ordering is already taken care of by the embedding procedure). If we want to apply something like OrdinalPatternEncoding
to such data, then we must encode each point individually, such that vectors like [1.2, 2.4, 4.5]
or ["howdy", "partner"]
gets mapped to an integer. The CodifyPoints
discretization intstruction ensures input data are encoded on a point-by-point basis.
A point-by-point discretization using CodifyPoints
is formally done by applying some Encoding
to each input data point. You can pick between the following encodings, or combine them in arbitrary ways using CombinationEncoding
.
ComplexityMeasures.Encoding
— TypeEncoding
The supertype for all encoding schemes. Encodings always encode elements of input data into the positive integers. The encoding API is defined by the functions encode
and decode
. Some probability estimators utilize encodings internally.
Current available encodings are:
OrdinalPatternEncoding
.GaussianCDFEncoding
.RectangularBinEncoding
.RelativeMeanEncoding
.RelativeFirstDifferenceEncoding
.UniqueElementsEncoding
.BubbleSortSwapsEncoding
.PairDistanceEncoding
.CombinationEncoding
, which can combine any of the above encodings.
ComplexityMeasures.GaussianCDFEncoding
— TypeGaussianCDFEncoding <: Encoding
GaussianCDFEncoding{m}(; μ, σ, c::Int = 3)
An encoding scheme that encode
s a scalar or vector χ
into one of the integers sᵢ ∈ [1, 2, …, c]
based on the normal cumulative distribution function (NCDF), and decode
s the sᵢ
into subintervals of [0, 1]
(with some loss of information).
Initializing a GaussianCDFEncoding
The size of the input to be encoded must be known beforehand. One must therefore set m = length(χ)
, where χ
is the input (m = 1
for scalars, m ≥ 2
for vectors). To do so, one must explicitly give m
as a type parameter: e.g. encoding = GaussianCDFEncoding{3}(; μ = 0.0, σ = 0.1)
to encode 3-element vectors, or encoding = GaussianCDFEncoding{1}(; μ = 0.0, σ = 0.1)
to encode scalars.
Description
Encoding/decoding scalars
GaussianCDFEncoding
first maps an input scalar $χ$ to a new real number $y_ \in [0, 1]$ by using the normal cumulative distribution function (CDF) with the given mean μ
and standard deviation σ
, according to the map
\[x \to y : y = \dfrac{1}{ \sigma \sqrt{2 \pi}} \int_{-\infty}^{x} e^{(-(x - \mu)^2)/(2 \sigma^2)} dx.\]
Next, the interval [0, 1]
is equidistantly binned and enumerated $1, 2, \ldots, c$, and $y$ is linearly mapped to one of these integers using the linear map $y \to z : z = \text{floor}(y(c-1)) + 1$.
Because of the floor operation, some information is lost, so when used with decode
, each decoded sᵢ
is mapped to a subinterval of [0, 1]
. This subinterval is returned as a length-1
Vector{SVector}
.
Notice that the decoding step does not yield an element of any outcome space of the estimators that use GaussianCDFEncoding
internally, such as Dispersion
. That is because these estimators additionally delay embed the encoded data.
Encoding/decoding vectors
If GaussianCDFEncoding
is used with a vector χ
, then each element of χ
is encoded separately, resulting in a length(χ)
sequence of integers which may be treated as a CartesianIndex
. The encoded symbol s ∈ [1, 2, …, c]
is then just the linear index corresponding to this cartesian index (similar to how CombinationEncoding
works).
When decode
d, the integer symbol s
is converted back into its CartesianIndex
representation, which is just a sequence of integers that refer to subdivisions of the [0, 1]
interval. The relevant subintervals are then returned as a length-χ
Vector{SVector}
.
Examples
julia> using ComplexityMeasures, Statistics
julia> x = [0.1, 0.4, 0.7, -2.1, 8.0];
julia> μ, σ = mean(x), std(x); encoding = GaussianCDFEncoding(; μ, σ, c = 5)
julia> es = encode.(Ref(encoding), x)
5-element Vector{Int64}:
2
2
3
1
5
julia> decode(encoding, 3)
2-element SVector{2, Float64} with indices SOneTo(2):
0.4
0.6
ComplexityMeasures.OrdinalPatternEncoding
— TypeOrdinalPatternEncoding <: Encoding
OrdinalPatternEncoding{m}(lt = ComplexityMeasures.isless_rand)
An encoding scheme that encode
s length-m
vectors into their permutation/ordinal patterns and then into the integers based on the Lehmer code. It is used by OrdinalPatterns
and similar estimators, see that for a description of the outcome space.
The ordinal/permutation pattern of a vector χ
is simply sortperm(χ)
, which gives the indices that would sort χ
in ascending order.
Description
The Lehmer code, as implemented here, is a bijection between the set of factorial(m)
possible permutations for a length-m
sequence, and the integers 1, 2, …, factorial(m)
. The encoding step uses algorithm 1 in Berger et al. (2019), which is highly optimized. The decoding step is much slower due to missing optimizations (pull requests welcomed!).
Example
julia> using ComplexityMeasures
julia> χ = [4.0, 1.0, 9.0];
julia> c = OrdinalPatternEncoding(3);
julia> i = encode(c, χ)
3
julia> decode(c, i)
3-element SVector{3, Int64} with indices SOneTo(3):
2
1
3
If you want to encode something that is already a permutation pattern, then you can use the non-exported permutation_to_integer
function.
ComplexityMeasures.RelativeMeanEncoding
— TypeRelativeMeanEncoding <: Encoding
RelativeMeanEncoding(minval::Real, maxval::Real; n = 2)
RelativeMeanEncoding
encodes a vector based on the relative position the mean of the vector has with respect to a predefined minimum and maximum value (minval
and maxval
, respectively).
Description
This encoding is inspired by Azami and Escudero (2016)'s algorithm for amplitude-aware permutation entropy. They use a linear combination of amplitude information and first differences information of state vectors to correct probabilities. Here, however, we explicitly encode the amplitude-part of the correction as an a integer symbol Λ ∈ [1, 2, …, n]
. The first-difference part of the encoding is available as the RelativeFirstDifferenceEncoding
encoding.
Encoding/decoding
When used with encode
, an $m$-element state vector $\bf{x} = (x_1, x_2, \ldots, x_m)$ is encoded as $Λ = \dfrac{1}{N}\sum_{i=1}^m abs(x_i)$. The value of $Λ$ is then normalized to lie on the interval [0, 1]
, assuming that the minimum/maximum value any single element $x_i$ can take is minval
/maxval
, respectively. Finally, the interval [0, 1]
is discretized into n
discrete bins, enumerated by positive integers 1, 2, …, n
, and the number of the bin that the normalized $Λ$ falls into is returned.
When used with decode
, the left-edge of the bin that the normalized $Λ$ fell into is returned.
ComplexityMeasures.RelativeFirstDifferenceEncoding
— TypeRelativeFirstDifferenceEncoding <: Encoding
RelativeFirstDifferenceEncoding(minval::Real, maxval::Real; n = 2)
RelativeFirstDifferenceEncoding
encodes a vector based on the relative position the average of the first differences of the vectors has with respect to a predefined minimum and maximum value (minval
and maxval
, respectively).
Description
This encoding is inspired by Azami and Escudero (2016)'s algorithm for amplitude-aware permutation entropy. They use a linear combination of amplitude information and first differences information of state vectors to correct probabilities. Here, however, we explicitly encode the first differences part of the correction as an a integer symbol Λ ∈ [1, 2, …, n]
. The amplitude part of the encoding is available as the RelativeMeanEncoding
encoding.
Encoding/decoding
When used with encode
, an $m$-element state vector $\bf{x} = (x_1, x_2, \ldots, x_m)$ is encoded as $Λ = \dfrac{1}{m - 1}\sum_{k=2}^m |x_{k} - x_{k-1}|$. The value of $Λ$ is then normalized to lie on the interval [0, 1]
, assuming that the minimum/maximum value any single $abs(x_k - x_{k-1})$ can take is minval
/maxval
, respectively. Finally, the interval [0, 1]
is discretized into n
discrete bins, enumerated by positive integers 1, 2, …, n
, and the number of the bin that the normalized $Λ$ falls into is returned. The smaller the mean first difference of the state vector is, the smaller the bin number is. The higher the mean first difference of the state vectors is, the higher the bin number is.
When used with decode
, the left-edge of the bin that the normalized $Λ$ fell into is returned.
Performance tips
If you are encoding multiple input vectors, it is more efficient to construct a RelativeFirstDifferenceEncoding
instance and re-use it:
minval, maxval = 0, 1
encoding = RelativeFirstDifferenceEncoding(minval, maxval; n = 4)
pts = [rand(3) for i = 1:1000]
[encode(encoding, x) for x in pts]
ComplexityMeasures.UniqueElementsEncoding
— TypeUniqueElementsEncoding <: Encoding
UniqueElementsEncoding(x)
UniqueElementsEncoding
is a generic encoding that encodes each xᵢ ∈ unique(x)
to one of the positive integers. The xᵢ
are encoded according to the order of their first appearance in the input data.
The constructor requires the input data x
, since the number of possible symbols is length(unique(x))
.
Example
using ComplexityMeasures
x = ['a', 2, 5, 2, 5, 'a']
e = UniqueElementsEncoding(x)
encode.(Ref(e), x) == [1, 2, 3, 2, 3, 1] # true
ComplexityMeasures.RectangularBinEncoding
— TypeRectangularBinEncoding <: Encoding
RectangularBinEncoding(binning::RectangularBinning, x)
RectangularBinEncoding(binning::FixedRectangularBinning)
An encoding scheme that encode
s points χ ∈ x
into their histogram bins.
The first call signature simply initializes a FixedRectangularBinning
and then calls the second call signature.
See FixedRectangularBinning
for info on mapping points to bins.
ComplexityMeasures.CombinationEncoding
— TypeCombinationEncoding <: Encoding
CombinationEncoding(encodings)
A CombinationEncoding
takes multiple Encoding
s and creates a combined encoding that can be used to encode inputs that are compatible with the given encodings
.
Encoding/decoding
When used with encode
, each Encoding
in encodings
returns integers in the set 1, 2, …, n_e
, where n_e
is the total number of outcomes for a particular encoding. For k
different encodings, we can thus construct the cartesian coordinate (c₁, c₂, …, cₖ)
(cᵢ ∈ 1, 2, …, n_i
), which can uniquely be identified by an integer. We can thus identify each unique combined encoding with a single integer.
When used with decode
, the integer symbol is converted to its corresponding cartesian coordinate, which is used to retrieve the decoded symbols for each of the encodings, and a tuple of the decoded symbols are returned.
The total number of outcomes is prod(total_outcomes(e) for e in encodings)
.
Examples
using ComplexityMeasures
# We want to encode the vector `x`.
x = [0.9, 0.2, 0.3]
# To do so, we will use a combination of first-difference encoding, amplitude encoding,
# and ordinal pattern encoding.
encodings = (
RelativeFirstDifferenceEncoding(0, 1; n = 2),
RelativeMeanEncoding(0, 1; n = 5),
OrdinalPatternEncoding(3) # x is a three-element vector
)
c = CombinationEncoding(encodings)
# Encode `x` as integer
ω = encode(c, x)
# Decode symbol (into a vector of decodings, one for each encodings `e ∈ encodings`).
# In this particular case, the first two element will be left-bin edges, and
# the last element will be the decoded ordinal pattern (indices that would sort `x`).
d = decode(c, ω)
Examples: encoding rows (one point at a time)
We'll here use the OrdinalPatternEncoding
with differing parameter m
to encode multiple StateSpaceSet
of differing dimensions.
using Associations
using StateSpaceSets
using Random; rng = Xoshiro(1234)
# The first variable is 2-dimensional and has 50 points
x = StateSpaceSet(rand(rng, 50, 2))
# The second variable is 3-dimensional and has 50 points
y = StateSpaceSet(rand(rng, 50, 3))
# The third variable is 4-dimensional and has 50 points
z = StateSpaceSet(rand(rng, 50, 4))
# One encoding scheme per input variable
# encode `x` using `ox` on a point-by-point basis (Vector{SVector{4}} → Vector{Int})
# encode `y` using `oy` on a point-by-point basis (Vector{SVector{3}} → Vector{Int})
# encode `z` using `oz` on a point-by-point basis (Vector{SVector{2}} → Vector{Int})
ox = OrdinalPatternEncoding(2)
oy = OrdinalPatternEncoding(3)
oz = OrdinalPatternEncoding(4)
# This given three column vectors of integers.
cx, cy, cz = codify(CodifyPoints(ox, oy, oz), x, y, z)
[cx cy cz]
50×3 Matrix{Int64}:
2 1 9
1 4 5
2 6 2
1 3 22
1 1 23
1 1 17
2 2 2
2 6 7
2 4 20
1 3 22
⋮
2 3 12
1 6 19
2 1 24
1 4 5
2 2 2
2 6 11
1 6 18
1 3 21
2 1 2
Notice that the 2-dimensional x
has been encoded into integer values 1
or 2
, because there are 2!
possible ordinal patterns for dimension m = 2
. The 3-dimensional y
has been encoded into integers in the range 1
to 3! = 6
, while the 4-dimensional z
is encoded into an even larger range of integers, because the number of possible ordinal patterns is 4! = 24
for 4-dimensional embedding vectors.
Encoding per variable/column
Sometimes, it may be desireable to encode input data one variable/column at a time. This typically happens when the input are either a single or multiple timeseries.
To encode columns, we move a sliding window across each input variable/column and encode points within that window. Formally, such a sliding-window discretization is done by using the CodifyVariables
discretization scheme, which takes as input some OutcomeSpace
that dictates how each window is encoded, and also dictates the width of the encoding windows.
For column/variable-wise encoding, you can pick between the following outcome spaces.
ComplexityMeasures.OutcomeSpace
— TypeOutcomeSpace
The supertype for all outcome space implementation.
Description
In ComplexityMeasures.jl, an outcome space defines a set of possible outcomes $\Omega = \{\omega_1, \omega_2, \ldots, \omega_L \}$ (some form of discretization). In the literature, the outcome space is often also called an "alphabet", while each outcome is called a "symbol" or an "event".
An outcome space also defines a set of rules for mapping input data to to each outcome $\omega_i$, a processes called encoding or symbolizing or discretizing in the literature (see encodings). Some OutcomeSpace
s first apply a transformation, e.g. a delay embedding, to the data before discretizing/encoding, while other OutcomeSpace
s discretize/encode the data directly.
Implementations
Outcome space | Principle | Input data | Counting-compatible |
---|---|---|---|
UniqueElements | Count of unique elements | Any | ✔ |
ValueBinning | Binning (histogram) | Vector , StateSpaceSet | ✔ |
OrdinalPatterns | Ordinal patterns | Vector , StateSpaceSet | ✔ |
SpatialOrdinalPatterns | Ordinal patterns in space | Array | ✔ |
Dispersion | Dispersion patterns | Vector | ✔ |
SpatialDispersion | Dispersion patterns in space | Array | ✔ |
CosineSimilarityBinning | Cosine similarity | Vector | ✔ |
BubbleSortSwaps | Swap counts when sorting | Vector | ✔ |
SequentialPairDistances | Sequential state vector distances | Vector , StateSpaceSet | ✔ |
TransferOperator | Binning (transfer operator) | Vector , StateSpaceSet | ✖ |
NaiveKernel | Kernel density estimation | StateSpaceSet | ✖ |
WeightedOrdinalPatterns | Ordinal patterns | Vector , StateSpaceSet | ✖ |
AmplitudeAwareOrdinalPatterns | Ordinal patterns | Vector , StateSpaceSet | ✖ |
WaveletOverlap | Wavelet transform | Vector | ✖ |
PowerSpectrum | Fourier transform | Vector | ✖ |
In the column "input data" it is assumed that the eltype
of the input is <: Real
.
Usage
Outcome spaces are used as input to
probabilities
/allprobabilities_and_outcomes
for computing probability mass functions.outcome_space
, which returns the elements of the outcome space.total_outcomes
, which returns the cardinality of the outcome space.counts
/counts_and_outcomes
/allcounts_and_outcomes
, for obtaining raw counts instead of probabilities (only for counting-compatible outcome spaces).
Counting-compatible vs. non-counting compatible outcome spaces
There are two main types of outcome spaces.
- Counting-compatible outcome spaces have a well-defined way of counting how often each point in the (encoded) input data is mapped to a particular outcome $\omega_i$. These outcome spaces use
encode
to discretize the input data. Examples areOrdinalPatterns
(which encodes input data into ordinal patterns) orValueBinning
(which discretizes points onto a regular grid). The table below lists which outcome spaces are counting compatible. - Non-counting compatible outcome spaces have no well-defined way of counting explicitly how often each point in the input data is mapped to a particular outcome $\omega_i$. Instead, these outcome spaces returns a vector of pre-normalized "relative counts", one for each outcome $\omega_i$. Examples are
WaveletOverlap
orPowerSpectrum
.
Counting-compatible outcome spaces can be used with any ProbabilitiesEstimator
to convert counts into probability mass functions. Non-counting-compatible outcome spaces can only be used with the maximum likelihood (RelativeAmount
) probabilities estimator, which estimates probabilities precisely by the relative frequency of each outcome (formally speaking, the RelativeAmount
estimator also requires counts, but for the sake of code consistency, we allow it to be used with relative frequencies as well).
The function is_counting_based
can be used to check whether an outcome space is based on counting.
Deducing the outcome space (from data)
Some outcome space models can deduce $\Omega$ without knowledge of the input, such as OrdinalPatterns
. Other outcome spaces require knowledge of the input data for concretely specifying $\Omega$, such as ValueBinning
with RectangularBinning
. If o
is some outcome space model and x
some input data, then outcome_space
(o, x)
returns the possible outcomes $\Omega$. To get the cardinality of $\Omega$, use total_outcomes
.
Implementation details
The element type of $\Omega$ varies between outcome space models, but it is guaranteed to be hashable and sortable. This allows for conveniently tracking the counts of a specific event across experimental realizations, by using the outcome as a dictionary key and the counts as the value for that key (or, alternatively, the key remains the outcome and one has a vector of probabilities, one for each experimental realization).
ComplexityMeasures.UniqueElements
— TypeUniqueElements()
An OutcomeSpace
based on straight-forward counting of distinct elements in a univariate time series or multivariate dataset. This is the same as giving no estimator to probabilities
.
Outcome space
The outcome space is the unique sorted values of the input. Hence, input x
is needed for a well-defined outcome_space
.
Implements
codify
. Used for encoding inputs where ordering matters (e.g. time series).
ComplexityMeasures.CosineSimilarityBinning
— TypeCosineSimilarityBinning(; m::Int, τ::Int, nbins::Int)
A OutcomeSpace
based on the cosine similarity (Wang et al., 2020).
It can be used with information
to compute the "diversity entropy" of an input timeseries (Wang et al., 2020).
The implementation here allows for τ != 1
, which was not considered in the original paper.
Description
CosineSimilarityBinning probabilities are computed as follows.
- From the input time series
x
, using embedding lagτ
and embedding dimensionm
, construct the embedding $Y = \{\bf x_i \} = \{(x_{i}, x_{i+\tau}, x_{i+2\tau}, \ldots, x_{i+m\tau - 1}\}_{i = 1}^{N-mτ}$. - Compute $D = \{d(\bf x_t, \bf x_{t+1}) \}_{t=1}^{N-mτ-1}$, where $d(\cdot, \cdot)$ is the cosine similarity between two
m
-dimensional vectors in the embedding. - Divide the interval
[-1, 1]
intonbins
equally sized subintervals (including the value+1
). - Construct a histogram of cosine similarities $d \in D$ over those subintervals.
- Sum-normalize the histogram to obtain probabilities.
Outcome space
The outcome space for CosineSimilarityBinning
is the bins of the [-1, 1]
interval, and the return configuration is the same as in ValueBinning
(left bin edge).
Implements
codify
. Used for encoding inputs where ordering matters (e.g. time series).
ComplexityMeasures.Dispersion
— TypeDispersion(; c = 5, m = 2, τ = 1, check_unique = true)
An OutcomeSpace
based on dispersion patterns, originally used by Rostaghi and Azami (2016) to compute the "dispersion entropy", which characterizes the complexity and irregularity of a time series.
Recommended parameter values (Li et al., 2019) are m ∈ [2, 3]
, τ = 1
for the embedding, and c ∈ [3, 4, …, 8]
categories for the Gaussian symbol mapping.
Description
Assume we have a univariate time series $X = \{x_i\}_{i=1}^N$. First, this time series is encoded into a symbol timeseries $S$ using the Gaussian encoding GaussianCDFEncoding
with empirical mean μ
and empirical standard deviation σ
(both determined from $X$), and c
as given to Dispersion
.
Then, $S$ is embedded into an $m$-dimensional time series, using an embedding lag of $\tau$, which yields a total of $N - (m - 1)\tau$ delay vectors $z_i$, or "dispersion patterns". Since each element of $z_i$ can take on c
different values, and each delay vector has m
entries, there are c^m
possible dispersion patterns. This number is used for normalization when computing dispersion entropy.
The returned probabilities are simply the frequencies of the unique dispersion patterns present in $S$ (i.e., the UniqueElements
of $S$).
Outcome space
The outcome space for Dispersion
is the unique delay vectors whose elements are the the symbols (integers) encoded by the Gaussian CDF, i.e., the unique elements of $S$.
Data requirements and parameters
The input must have more than one unique element for the Gaussian mapping to be well-defined. Li et al. (2019) recommends that x
has at least 1000 data points.
If check_unique == true
(default), then it is checked that the input has more than one unique value. If check_unique == false
and the input only has one unique element, then a InexactError
is thrown when trying to compute probabilities.
Each embedding vector is called a "dispersion pattern". Why? Let's consider the case when $m = 5$ and $c = 3$, and use some very imprecise terminology for illustration:
When $c = 3$, values clustering far below mean are in one group, values clustered around the mean are in one group, and values clustering far above the mean are in a third group. Then the embedding vector $[2, 2, 2, 2, 2]$ consists of values that are close together (close to the mean), so it represents a set of numbers that are not very spread out (less dispersed). The embedding vector $[1, 1, 2, 3, 3]$, however, represents numbers that are much more spread out (more dispersed), because the categories representing "outliers" both above and below the mean are represented, not only values close to the mean.
For a version of this estimator that can be used on high-dimensional arrays, see SpatialDispersion
.
Implements
codify
. Used for encoding inputs where ordering matters (e.g. time series).
ComplexityMeasures.OrdinalPatterns
— TypeOrdinalPatterns <: OutcomeSpace
OrdinalPatterns{m}(τ = 1, lt::Function = ComplexityMeasures.isless_rand)
An OutcomeSpace
based on lengh-m
ordinal permutation patterns, originally introduced in Bandt and Pompe (2002)'s paper on permutation entropy. Note that m
is given as a type parameter, so that when it is a literal integer there are performance accelerations.
When passed to probabilities
the output depends on the input data type:
- Univariate data. If applied to a univariate timeseries (
AbstractVector
), then the timeseries is first embedded using embedding delayτ
and dimensionm
, resulting in embedding vectors $\{ \bf{x}_i \}_{i=1}^{N-(m-1)\tau}$. Then, for each $\bf{x}_i$, we find its permutation pattern $\pi_{i}$. Probabilities are then estimated as the frequencies of the encoded permutation symbols by usingUniqueElements
. When giving the resulting probabilities toinformation
, the original permutation entropy is computed (Bandt and Pompe, 2002). - Multivariate data. If applied to a an
D
-dimensionalStateSpaceSet
, then no embedding is constructed,m
must be equal toD
andτ
is ignored. Each vector $\bf{x}_i$ of the dataset is mapped directly to its permutation pattern $\pi_{i}$ by comparing the relative magnitudes of the elements of $\bf{x}_i$. Like above, probabilities are estimated as the frequencies of the permutation symbols. The resulting probabilities can be used to compute multivariate permutation entropy (He et al., 2016), although here we don't perform any further subdivision of the permutation patterns (as in Figure 3 of He et al. (2016)).
Internally, OrdinalPatterns
uses the OrdinalPatternEncoding
to represent ordinal patterns as integers for efficient computations.
See WeightedOrdinalPatterns
and AmplitudeAwareOrdinalPatterns
for estimators that not only consider ordinal (sorting) patterns, but also incorporate information about within-state-vector amplitudes. For a version of this estimator that can be used on spatial data, see SpatialOrdinalPatterns
.
In Bandt and Pompe (2002), equal values are ordered after their order of appearance, but this can lead to erroneous temporal correlations, especially for data with low amplitude resolution (Zunino et al., 2017). Here, by default, if two values are equal, then one of the is randomly assigned as "the largest", using lt = ComplexityMeasures.isless_rand
. To get the behaviour from Bandt and Pompe (2002), use lt = Base.isless
.
Outcome space
The outcome space Ω
for OrdinalPatterns
is the set of length-m
ordinal patterns (i.e. permutations) that can be formed by the integers 1, 2, …, m
. There are factorial(m)
such patterns.
For example, the outcome [2, 3, 1]
corresponds to the ordinal pattern of having the smallest value in the second position, the next smallest value in the third position, and the next smallest, i.e. the largest value in the first position. See also OrdinalPatternEncoding
.
In-place symbolization
OrdinalPatterns
also implements the in-place probabilities!
for StateSpaceSet
input (or embedded vector input) for reducing allocations in looping scenarios. The length of the pre-allocated symbol vector must be the length of the dataset. For example
using ComplexityMeasures
m, N = 2, 100
est = OrdinalPatterns{m}(τ)
x = StateSpaceSet(rand(N, m)) # some input dataset
πs_ts = zeros(Int, N) # length must match length of `x`
p = probabilities!(πs_ts, est, x)
ComplexityMeasures.BubbleSortSwaps
— TypeBubbleSortSwaps <: CountBasedOutcomeSpace
BubbleSortSwaps(; m = 3, τ = 1)
The BubbleSortSwaps
outcome space is based on Manis et al. (2017)'s paper on "bubble entropy".
Description
BubbleSortSwaps
does the following:
- Embeds the input data using embedding dimension
m
and embedding lagτ
- For each state vector in the embedding, counting how many swaps are necessary for the bubble sort algorithm to sort state vectors.
For counts_and_outcomes
, we then define a distribution over the number of necessary swaps. This distribution can then be used to estimate probabilities using probabilities_and_outcomes
, which again can be used to estimate any InformationMeasure
. An example of how to compute the "Shannon bubble entropy" is given below.
Outcome space
The outcome_space
for BubbleSortSwaps
are the integers 0:N
, where N = (m * (m - 1)) / 2 + 1
(the worst-case number of swaps). Hence, the number of total_outcomes
is N + 1
.
Implements
codify
. Returns the number of swaps required for each embedded state vector.
Examples
With the BubbleSortSwaps
outcome space, we can easily compute a "bubble entropy" inspired by (Manis et al., 2017). Note: this is not actually a new entropy - it is just a new way of discretizing the input data. To reproduce the bubble entropy complexity measure from (Manis et al., 2017), see BubbleEntropy
.
Examples
using ComplexityMeasures
x = rand(100000)
o = BubbleSortSwaps(; m = 5) # 5-dimensional embedding vectors
information(Shannon(; base = 2), o, x)
# We can also compute any other "bubble quantity", for example the
# "Tsallis bubble extropy", with arbitrary probabilities estimators:
information(TsallisExtropy(), BayesianRegularization(), o, x)
ComplexityMeasures.ValueBinning
— TypeValueBinning(b::AbstractBinning) <: OutcomeSpace
An OutcomeSpace
based on binning the values of the data as dictated by the binning scheme b
and formally computing their histogram, i.e., the frequencies of points in the bins. An alias to this is VisitationFrequency
. Available binnings are subtypes of AbstractBinning
.
The ValueBinning
estimator has a linearithmic time complexity (n log(n)
for n = length(x)
) and a linear space complexity (l
for l = dimension(x)
). This allows computation of probabilities (histograms) of high-dimensional datasets and with small box sizes ε
without memory overflow and with maximum performance. For performance reasons, the probabilities returned never contain 0s and are arbitrarily ordered.
ValueBinning(ϵ::Union{Real,Vector})
A convenience method that accepts same input as RectangularBinning
and initializes this binning directly.
Outcomes
The outcome space for ValueBinning
is the unique bins constructed from b
. Each bin is identified by its left (lowest-value) corner, because bins are always left-closed-right-open intervals [a, b)
. The bins are in data units, not integer (cartesian indices units), and are returned as SVector
s, i.e., same type as input data.
For convenience, outcome_space
returns the outcomes in the same array format as the underlying binning (e.g., Matrix
for 2D input).
For FixedRectangularBinning
the outcome_space
is well-defined from the binning, but for RectangularBinning
input x
is needed as well.
Implements
codify
. Used for encoding inputs where ordering matters (e.g. time series).
ComplexityMeasures.RectangularBinning
— TypeRectangularBinning(ϵ, precise = false) <: AbstractBinning
Rectangular box partition of state space using the scheme ϵ
, deducing the histogram extent and bin width from the input data.
RectangularBinning
is a convenience struct. It is re-cast into FixedRectangularBinning
once the data are provided, so see that docstring for info on the bin calculation and the meaning of precise
.
Binning instructions are deduced from the type of ϵ
as follows:
ϵ::Int
divides each coordinate axis intoϵ
equal-length intervals that cover all data.ϵ::Float64
divides each coordinate axis into intervals of fixed sizeϵ
, starting from the axis minima until the data is completely covered by boxes.ϵ::Vector{Int}
divides the i-th coordinate axis intoϵ[i]
equal-length intervals that cover all data.ϵ::Vector{Float64}
divides the i-th coordinate axis into intervals of fixed sizeϵ[i]
, starting from the axis minima until the data is completely covered by boxes.
RectangularBinning
ensures all input data are covered by extending the created ranges if need be.
ComplexityMeasures.FixedRectangularBinning
— TypeFixedRectangularBinning <: AbstractBinning
FixedRectangularBinning(ranges::Tuple{<:AbstractRange...}, precise = false)
Rectangular box partition of state space where the partition along each dimension is explicitly given by each range ranges
, which is a tuple of AbstractRange
subtypes. Typically, each range is the output of the range
Base function, e.g., ranges = (0:0.1:1, range(0, 1; length = 101), range(2.1, 3.2; step = 0.33))
. All ranges must be sorted.
The optional second argument precise
dictates whether Julia Base's TwicePrecision
is used for when searching where a point falls into the range. Useful for edge cases of points being almost exactly on the bin edges, but it is exactly four times as slow, so by default it is false
.
Points falling outside the partition do not contribute to probabilities. Bins are always left-closed-right-open: [a, b)
. This means that the last value of each of the ranges dictates the last right-closing value. This value does not belong to the histogram! E.g., if given a range r = range(0, 1; length = 11)
, with r[end] = 1
, the value 1
is outside the partition and would not attribute any increase of the probability corresponding to the last bin (here [0.9, 1)
)!
Equivalently, the size of the histogram is histsize = map(r -> length(r)-1, ranges)
!
FixedRectangularBinning
leads to a well-defined outcome space without knowledge of input data, see ValueBinning
.
Example: encoding columns (one variable at a time)
Some OutcomeSpace
s dictate a sliding window which has the width of one element when used with CodifyVariables
. ValueBinning
is such an outcome space.
using Associations
using Random; rng = Xoshiro(1234)
x = rand(rng, 100)
o = ValueBinning(3)
cx = codify(CodifyVariables(o), x)
100-element Vector{Int64}:
2
2
3
1
2
2
3
3
3
3
⋮
1
3
1
3
2
1
3
2
3
We can verify that ValueBinning
preserves the cardinality of the input dataset.
length(x) == length(cx)
true
Other outcome spaces such as Dispersion
or OrdinalPatterns
do not preserve the cardinality of the input dataset when used with CodifyVariables
. This is because when they are applied in a sliding window, they compress sliding windows consisting of potentially multiple points into single integers. This means that some points at the end of each input variable are lost. For example, with OrdinalPatterns
, the number of encoded points decrease with the embedding parameter m
.
using Associations
using Random; rng = Xoshiro(1234)
x = rand(rng, 100)
o = OrdinalPatterns(m = 3)
cx = codify(CodifyVariables(o), x)
98-element Vector{Int64}:
3
5
4
1
1
1
5
6
6
6
⋮
5
3
5
4
2
6
3
2
4
We can simultaneously encode multiple variable/columns of a StateSpaceSet
using the same outcome space, as long as the operation will result in the same number of encoded data points for each column.
using Associations
using Random; rng = Xoshiro(1234)
x = rand(rng, 100)
y = rand(rng, 100)
o = OrdinalPatterns(m = 3)
# Alternatively provide a tuple of input time series: codify(CodifyVariables(o), (x, y))
cx, cy = codify(CodifyVariables(o), StateSpaceSet(x, y))
[cx cy]
98×2 Matrix{Int64}:
3 1
5 5
4 6
1 4
1 2
1 3
5 1
6 5
6 4
6 1
⋮
5 5
3 4
5 1
4 5
2 3
6 2
3 4
2 5
4 4