Discretization API

Encoding multiple input datasets

A fundamental operation when computing multivariate information measures from data is discretization. When discretizing, what happens is that we "encode" input data into an intermediate representation indexed by the positive integers. This intermediate representation is called an "encoding". This is useful in several ways:

Once a dataset has been encoded into integers, we can estimate Counts or Probabilities (tutorial).
Once probabilities have been estimated, one can use these to estimate MultivariateInformationMeasure (tutorial).

The following functions and types are used by Associations.jl to perform discretization of input data.

Associations.Discretization — Type

Discretization

The supertype of all discretization schemes.

Concrete implementations

CodifyVariables
CodifyPoints

source

Associations.CodifyVariables — Type

CodifyVariables <: Discretization
CodifyVariables(outcome_space::OutcomeSpace)

The CodifyVariables discretization scheme quantises input data in a column-wise manner using the given outcome_space.

Compatible outcome spaces

UniqueElements (for when data are pre-discretized)
BubbleSortSwaps
CosineSimilarityBinning
OrdinalPatterns
Dispersion

Description

The main difference between CodifyVariables and [CodifyPoints] is that the former uses OutcomeSpaces for discretization. This usually means that some transformation is applied to the data before discretizing. For example, some outcome constructs a delay embedding from the input (and thus encodes sequential information) before encoding the data.

Specifically, given x::AbstractStateSpaceSet..., where the i-th dataset x[i] is assumed to represent a single series of measurements, CodifyVariables encodes x[i] by codify-ing into a series of integers using an appropriate OutcomeSpace. This is typically done by first sequentially transforming the data and then running sliding window (the width of the window is controlled by outcome_space) across the data, and then encoding the values within each window to an integer.

Examples

using Associations
x, y = rand(100), rand(100)
d = CodifyVariables(OrdinalPatterns(m=2))
cx, cy = codify(d, x, y)

source

Associations.CodifyPoints — Type

CodifyPoints{N}
CodifyPoints(encodings::NTuple{N, Encoding})

CodifyPoints points is a Discretization scheme that encodes input data points without applying any sequential transformation to the input (as opposed to CodifyVariables, which may apply some transformation before encoding).

Usage

Use with codify to encode/discretize input variable on a point-by-point basis.

Compatible encodings

Description

Given x::AbstractStateSpaceSet..., where the i-th dataset is assumed to represent a single series of measurements, CodifyPoints encodes each point pₖ ∈ x[i] using some Encoding(s), without applying any (sequential) transformation to the x[i] first. This behaviour is different to CodifyVariables, which does apply a transformation to x[i] before encoding.

If length(x) == N (i.e. there are N input dataset), then encodings must be a tuple of N Encoding. Alternatively, if encodings is a single Encoding, then that same encoding is applied to every x[i].

Examples

using Associations

# The same encoding on two input datasets
x = StateSpaceSet(rand(100, 3))
y = StateSpaceSet(rand(100, 3))
encoding_ord = OrdinalPatternEncoding(3)
cx, cy = codify(CodifyPoints(encoding_ord), x, y)

# Different encodings on multiple datasets
z = StateSpaceSet(rand(100, 2))
encoding_bin = RectangularBinEncoding(RectangularBinning(3), z)
d = CodifyPoints(encoding_ord, encoding_ord, encoding_bin)
cx, cy, cz = codify(d, x, y, z)

source

ComplexityMeasures.codify — Function

codify(encoding::CodifyPoints{N}, x::Vararg{<:AbstractStateSpaceSet, N})

Codify each timeseries xᵢ ∈ x according to the given encoding.

Examples

x = StateSpaceSet(rand(10000, 2))
y = StateSpaceSet(rand(10000, 3))
z = StateSpaceSet(rand(10000, 2))

# For `x`, we use a relative mean encoding.
ex = RelativeMeanEncoding(0.0, 1.0, n = 3)
# For `y`, we use a combination encoding.
ey = CombinationEncoding(
    RelativeMeanEncoding(0.0, 1.0, n = 2), 
    OrdinalPatternEncoding(3)
)
# For `z`, we use ordinal patterns to encode.
ez = OrdinalPatternEncoding(2)

# Codify two input datasets gives a 2-tuple of Vector{Int}
codify(CodifyPoints(ex, ey), x, y)

# Codify three input datasets gives a 3-tuple of Vector{Int}
codify(CodifyPoints(ex, ey, ez), x, y, z)

source

codify(d::CodifyVariables, x::Vararg{<:AbstractStateSpaceSet, N})
codify(d::CodifyPoints, x::Vararg{<:AbstractStateSpaceSet, N})

Codify each timeseries xᵢ ∈ x according to the given encoding/discretization d.

Compatible discretizations

CodifyVariables
CodifyPoints

Examples

using Associations

# Sliding window encoding
x = [0.1, 0.2, 0.3, 0.2, 0.1, 0.0, 0.5, 0.3, 0.5]
xc1 = codify(CodifyVariables(OrdinalPatterns(m=2)), x) # should give [1, 1, 2, 2, 2, 1, 2, 1]
xc2 = codify(OrdinalPatterns(m=2), x) # equivalent
length(xc1) < length(x) # should be true, because `OrdinalPatterns` delay embeds.  

# Point-by-point encoding
x, y = StateSpaceSet(rand(100, 3)), StateSpaceSet(rand(100, 3))
cx, cy = codify(CodifyPoints(OrdinalPatternEncoding(3)), x, y)

source

In summary, the two main ways of discretizing data in Associations are as follows.

The CodifyPoints discretization scheme encodes input data on a point-by-point basis by applying some Encoding to each point.
The CodifyVariables discretization scheme encodes input data on a column-by-column basis by applying a sliding window to each column, and encoding the data within the sliding window according to some OutcomeSpace (Internally, this uses codify).

Note

Encoding, OutcomeSpace and codify are all from ComplexityMeasures.jl. In this package, they are used to discretize multiple input variables instead of just one input variable.

Encoding per point/row

In some cases, it may be desireable to encode data on a row-wise basis. This typically happens when working with pre-embedded time series or StateSpaceSets (respecting the fact that time ordering is already taken care of by the embedding procedure). If we want to apply something like OrdinalPatternEncoding to such data, then we must encode each point individually, such that vectors like [1.2, 2.4, 4.5] or ["howdy", "partner"] gets mapped to an integer. The CodifyPoints discretization intstruction ensures input data are encoded on a point-by-point basis.

A point-by-point discretization using CodifyPoints is formally done by applying some Encoding to each input data point. You can pick between the following encodings, or combine them in arbitrary ways using CombinationEncoding.

Examples: encoding rows (one point at a time)

We'll here use the OrdinalPatternEncoding with differing parameter m to encode multiple StateSpaceSet of differing dimensions.

using Associations
using StateSpaceSets
using Random; rng = Xoshiro(1234)

# The first variable is 2-dimensional and has 50 points
x = StateSpaceSet(rand(rng, 50, 2))
# The second variable is 3-dimensional and has 50 points
y = StateSpaceSet(rand(rng, 50, 3))
# The third variable is 4-dimensional and has 50 points
z = StateSpaceSet(rand(rng, 50, 4))

# One encoding scheme per input variable
# encode `x` using `ox` on a point-by-point basis (Vector{SVector{4}} → Vector{Int})
# encode `y` using `oy` on a point-by-point basis (Vector{SVector{3}} → Vector{Int})
# encode `z` using `oz` on a point-by-point basis (Vector{SVector{2}} → Vector{Int})
ox = OrdinalPatternEncoding(2)
oy = OrdinalPatternEncoding(3)
oz = OrdinalPatternEncoding(4)

# This given three column vectors of integers.
cx, cy, cz = codify(CodifyPoints(ox, oy, oz), x, y, z)

[cx cy cz]

50×3 Matrix{Int64}:
 2  1   9
 1  4   5
 2  6   2
 1  3  22
 1  1  23
 1  1  17
 2  2   2
 2  6   7
 2  4  20
 1  3  22
 ⋮     
 2  3  12
 1  6  19
 2  1  24
 1  4   5
 2  2   2
 2  6  11
 1  6  18
 1  3  21
 2  1   2

Notice that the 2-dimensional x has been encoded into integer values 1 or 2, because there are 2! possible ordinal patterns for dimension m = 2. The 3-dimensional y has been encoded into integers in the range 1 to 3! = 6, while the 4-dimensional z is encoded into an even larger range of integers, because the number of possible ordinal patterns is 4! = 24 for 4-dimensional embedding vectors.

Encoding per variable/column

Sometimes, it may be desireable to encode input data one variable/column at a time. This typically happens when the input are either a single or multiple timeseries.

To encode columns, we move a sliding window across each input variable/column and encode points within that window. Formally, such a sliding-window discretization is done by using the CodifyVariables discretization scheme, which takes as input some OutcomeSpace that dictates how each window is encoded, and also dictates the width of the encoding windows.

For column/variable-wise encoding, you can pick between the following outcome spaces.

Example: encoding columns (one variable at a time)

Some OutcomeSpaces dictate a sliding window which has the width of one element when used with CodifyVariables. ValueBinning is such an outcome space.

using Associations
using Random; rng = Xoshiro(1234)

x = rand(rng, 100)
o = ValueBinning(3)
cx = codify(CodifyVariables(o), x)

100-element Vector{Int64}:
 2
 2
 3
 1
 2
 2
 3
 3
 3
 3
 ⋮
 1
 3
 1
 3
 2
 1
 3
 2
 3

We can verify that ValueBinning preserves the cardinality of the input dataset.

length(x) == length(cx)

true

Other outcome spaces such as Dispersion or OrdinalPatterns do not preserve the cardinality of the input dataset when used with CodifyVariables. This is because when they are applied in a sliding window, they compress sliding windows consisting of potentially multiple points into single integers. This means that some points at the end of each input variable are lost. For example, with OrdinalPatterns, the number of encoded points decrease with the embedding parameter m.

using Associations
using Random; rng = Xoshiro(1234)

x = rand(rng, 100)
o = OrdinalPatterns(m = 3)
cx = codify(CodifyVariables(o), x)

98-element Vector{Int64}:
 3
 5
 4
 1
 1
 1
 5
 6
 6
 6
 ⋮
 5
 3
 5
 4
 2
 6
 3
 2
 4

We can simultaneously encode multiple variable/columns of a StateSpaceSet using the same outcome space, as long as the operation will result in the same number of encoded data points for each column.

using Associations
using Random; rng = Xoshiro(1234)

x = rand(rng, 100)
y = rand(rng, 100)
o = OrdinalPatterns(m = 3)
# Alternatively provide a tuple of input time series: codify(CodifyVariables(o), (x, y))
cx, cy = codify(CodifyVariables(o), StateSpaceSet(x, y))

[cx cy]

98×2 Matrix{Int64}:
 3  1
 5  5
 4  6
 1  4
 1  2
 1  3
 5  1
 6  5
 6  4
 6  1
 ⋮  
 5  5
 3  4
 5  1
 4  5
 2  3
 6  2
 3  4
 2  5
 4  4