Discretization API
Encoding multiple input datasets
A fundamental operation when computing multivariate information measures from data is discretization. When discretizing, what happens is that we "encode" input data into an intermediate representation indexed by the positive integers. This intermediate representation is called an "encoding". This is useful in several ways:
- Once a dataset has been encoded into integers, we can estimate
Counts
orProbabilities
(tutorial). - Once probabilities have been estimated, one can use these to estimate
MultivariateInformationMeasure
(tutorial).
The following functions and types are used by Associations.jl to perform discretization of input data.
Associations.Discretization
— TypeDiscretization
The supertype of all discretization schemes.
Concrete implementations
Associations.CodifyVariables
— TypeCodifyVariables <: Discretization
CodifyVariables(outcome_space::OutcomeSpace)
The CodifyVariables
discretization scheme quantises input data in a column-wise manner using the given outcome_space
.
Compatible outcome spaces
UniqueElements
(for when data are pre-discretized)BubbleSortSwaps
CosineSimilarityBinning
OrdinalPatterns
Dispersion
Description
The main difference between CodifyVariables
and [CodifyPoints
] is that the former uses OutcomeSpace
s for discretization. This usually means that some transformation is applied to the data before discretizing. For example, some outcome constructs a delay embedding from the input (and thus encodes sequential information) before encoding the data.
Specifically, given x::AbstractStateSpaceSet...
, where the i
-th dataset x[i]
is assumed to represent a single series of measurements, CodifyVariables
encodes x[i]
by codify
-ing into a series of integers using an appropriate OutcomeSpace
. This is typically done by first sequentially transforming the data and then running sliding window (the width of the window is controlled by outcome_space
) across the data, and then encoding the values within each window to an integer.
Examples
using Associations
x, y = rand(100), rand(100)
d = CodifyVariables(OrdinalPatterns(m=2))
cx, cy = codify(d, x, y)
Associations.CodifyPoints
— TypeCodifyPoints{N}
CodifyPoints(encodings::NTuple{N, Encoding})
CodifyPoints
points is a Discretization
scheme that encodes input data points without applying any sequential transformation to the input (as opposed to CodifyVariables
, which may apply some transformation before encoding).
Usage
- Use with
codify
to encode/discretize input variable on a point-by-point basis.
Compatible encodings
GaussianCDFEncoding
OrdinalPatternEncoding
RelativeMeanEncoding
RelativeFirstDifferenceEncoding
UniqueElementsEncoding
RectangularBinEncoding
CombinationEncoding
Description
Given x::AbstractStateSpaceSet...
, where the i
-th dataset is assumed to represent a single series of measurements, CodifyPoints
encodes each point pₖ ∈ x[i]
using some Encoding
(s), without applying any (sequential) transformation to the x[i]
first. This behaviour is different to CodifyVariables
, which does apply a transformation to x[i]
before encoding.
If length(x) == N
(i.e. there are N
input dataset), then encodings
must be a tuple of N
Encoding
. Alternatively, if encodings
is a single Encoding
, then that same encoding is applied to every x[i]
.
Examples
using Associations
# The same encoding on two input datasets
x = StateSpaceSet(rand(100, 3))
y = StateSpaceSet(rand(100, 3))
encoding_ord = OrdinalPatternEncoding(3)
cx, cy = codify(CodifyPoints(encoding_ord), x, y)
# Different encodings on multiple datasets
z = StateSpaceSet(rand(100, 2))
encoding_bin = RectangularBinEncoding(RectangularBinning(3), z)
d = CodifyPoints(encoding_ord, encoding_ord, encoding_bin)
cx, cy, cz = codify(d, x, y, z)
ComplexityMeasures.codify
— Functioncodify(encoding::CodifyPoints{N}, x::Vararg{<:AbstractStateSpaceSet, N})
Codify each timeseries xᵢ ∈ x
according to the given encoding
.
Examples
x = StateSpaceSet(rand(10000, 2))
y = StateSpaceSet(rand(10000, 3))
z = StateSpaceSet(rand(10000, 2))
# For `x`, we use a relative mean encoding.
ex = RelativeMeanEncoding(0.0, 1.0, n = 3)
# For `y`, we use a combination encoding.
ey = CombinationEncoding(
RelativeMeanEncoding(0.0, 1.0, n = 2),
OrdinalPatternEncoding(3)
)
# For `z`, we use ordinal patterns to encode.
ez = OrdinalPatternEncoding(2)
# Codify two input datasets gives a 2-tuple of Vector{Int}
codify(CodifyPoints(ex, ey), x, y)
# Codify three input datasets gives a 3-tuple of Vector{Int}
codify(CodifyPoints(ex, ey, ez), x, y, z)
codify(d::CodifyVariables, x::Vararg{<:AbstractStateSpaceSet, N})
codify(d::CodifyPoints, x::Vararg{<:AbstractStateSpaceSet, N})
Codify each timeseries xᵢ ∈ x
according to the given encoding/discretization d
.
Compatible discretizations
Examples
using Associations
# Sliding window encoding
x = [0.1, 0.2, 0.3, 0.2, 0.1, 0.0, 0.5, 0.3, 0.5]
xc1 = codify(CodifyVariables(OrdinalPatterns(m=2)), x) # should give [1, 1, 2, 2, 2, 1, 2, 1]
xc2 = codify(OrdinalPatterns(m=2), x) # equivalent
length(xc1) < length(x) # should be true, because `OrdinalPatterns` delay embeds.
# Point-by-point encoding
x, y = StateSpaceSet(rand(100, 3)), StateSpaceSet(rand(100, 3))
cx, cy = codify(CodifyPoints(OrdinalPatternEncoding(3)), x, y)
In summary, the two main ways of discretizing data in Associations are as follows.
- The
CodifyPoints
discretization scheme encodes input data on a point-by-point basis by applying someEncoding
to each point. - The
CodifyVariables
discretization scheme encodes input data on a column-by-column basis by applying a sliding window to each column, and encoding the data within the sliding window according to someOutcomeSpace
(Internally, this usescodify
).
Encoding
, OutcomeSpace
and codify
are all from ComplexityMeasures.jl. In this package, they are used to discretize multiple input variables instead of just one input variable.
Encoding per point/row
In some cases, it may be desireable to encode data on a row-wise basis. This typically happens when working with pre-embedded time series or StateSpaceSet
s (respecting the fact that time ordering is already taken care of by the embedding procedure). If we want to apply something like OrdinalPatternEncoding
to such data, then we must encode each point individually, such that vectors like [1.2, 2.4, 4.5]
or ["howdy", "partner"]
gets mapped to an integer. The CodifyPoints
discretization intstruction ensures input data are encoded on a point-by-point basis.
A point-by-point discretization using CodifyPoints
is formally done by applying some Encoding
to each input data point. You can pick between the following encodings, or combine them in arbitrary ways using CombinationEncoding
.
Encoding
GaussianCDFEncoding
OrdinalPatternEncoding
RelativeMeanEncoding
RelativeFirstDifferenceEncoding
UniqueElementsEncoding
RectangularBinEncoding
CombinationEncoding
Examples: encoding rows (one point at a time)
We'll here use the OrdinalPatternEncoding
with differing parameter m
to encode multiple StateSpaceSet
of differing dimensions.
using Associations
using StateSpaceSets
using Random; rng = Xoshiro(1234)
# The first variable is 2-dimensional and has 50 points
x = StateSpaceSet(rand(rng, 50, 2))
# The second variable is 3-dimensional and has 50 points
y = StateSpaceSet(rand(rng, 50, 3))
# The third variable is 4-dimensional and has 50 points
z = StateSpaceSet(rand(rng, 50, 4))
# One encoding scheme per input variable
# encode `x` using `ox` on a point-by-point basis (Vector{SVector{4}} → Vector{Int})
# encode `y` using `oy` on a point-by-point basis (Vector{SVector{3}} → Vector{Int})
# encode `z` using `oz` on a point-by-point basis (Vector{SVector{2}} → Vector{Int})
ox = OrdinalPatternEncoding(2)
oy = OrdinalPatternEncoding(3)
oz = OrdinalPatternEncoding(4)
# This given three column vectors of integers.
cx, cy, cz = codify(CodifyPoints(ox, oy, oz), x, y, z)
[cx cy cz]
50×3 Matrix{Int64}:
2 1 9
1 4 5
2 6 2
1 3 22
1 1 23
1 1 17
2 2 2
2 6 7
2 4 20
1 3 22
⋮
2 3 12
1 6 19
2 1 24
1 4 5
2 2 2
2 6 11
1 6 18
1 3 21
2 1 2
Notice that the 2-dimensional x
has been encoded into integer values 1
or 2
, because there are 2!
possible ordinal patterns for dimension m = 2
. The 3-dimensional y
has been encoded into integers in the range 1
to 3! = 6
, while the 4-dimensional z
is encoded into an even larger range of integers, because the number of possible ordinal patterns is 4! = 24
for 4-dimensional embedding vectors.
Encoding per variable/column
Sometimes, it may be desireable to encode input data one variable/column at a time. This typically happens when the input are either a single or multiple timeseries.
To encode columns, we move a sliding window across each input variable/column and encode points within that window. Formally, such a sliding-window discretization is done by using the CodifyVariables
discretization scheme, which takes as input some OutcomeSpace
that dictates how each window is encoded, and also dictates the width of the encoding windows.
For column/variable-wise encoding, you can pick between the following outcome spaces.
OutcomeSpace
UniqueElements
CosineSimilarityBinning
Dispersion
OrdinalPatterns
OrdinalPatterns
BubbleSortSwaps
ValueBinning
RectangularBinning
FixedRectangularBinning
Example: encoding columns (one variable at a time)
Some OutcomeSpace
s dictate a sliding window which has the width of one element when used with CodifyVariables
. ValueBinning
is such an outcome space.
using Associations
using Random; rng = Xoshiro(1234)
x = rand(rng, 100)
o = ValueBinning(3)
cx = codify(CodifyVariables(o), x)
100-element Vector{Int64}:
2
2
3
1
2
2
3
3
3
3
⋮
1
3
1
3
2
1
3
2
3
We can verify that ValueBinning
preserves the cardinality of the input dataset.
length(x) == length(cx)
true
Other outcome spaces such as Dispersion
or OrdinalPatterns
do not preserve the cardinality of the input dataset when used with CodifyVariables
. This is because when they are applied in a sliding window, they compress sliding windows consisting of potentially multiple points into single integers. This means that some points at the end of each input variable are lost. For example, with OrdinalPatterns
, the number of encoded points decrease with the embedding parameter m
.
using Associations
using Random; rng = Xoshiro(1234)
x = rand(rng, 100)
o = OrdinalPatterns(m = 3)
cx = codify(CodifyVariables(o), x)
98-element Vector{Int64}:
3
5
4
1
1
1
5
6
6
6
⋮
5
3
5
4
2
6
3
2
4
We can simultaneously encode multiple variable/columns of a StateSpaceSet
using the same outcome space, as long as the operation will result in the same number of encoded data points for each column.
using Associations
using Random; rng = Xoshiro(1234)
x = rand(rng, 100)
y = rand(rng, 100)
o = OrdinalPatterns(m = 3)
# Alternatively provide a tuple of input time series: codify(CodifyVariables(o), (x, y))
cx, cy = codify(CodifyVariables(o), StateSpaceSet(x, y))
[cx cy]
98×2 Matrix{Int64}:
3 1
5 5
4 6
1 4
1 2
1 3
5 1
6 5
6 4
6 1
⋮
5 5
3 4
5 1
4 5
2 3
6 2
3 4
2 5
4 4