Encodings

Encodings API

Some probability estimators first "encode" input data into an intermediate representation indexed by the positive integers. This intermediate representation is called an "encoding".

The encodings API is defined by:

Encoding
encode
decode

ComplexityMeasures.Encoding — Type

Encoding

The supertype for all encoding schemes. Encodings always encode elements of input data into the positive integers. The encoding API is defined by the functions encode and decode. Some probability estimators utilize encodings internally.

Current available encodings are:

OrdinalPatternEncoding.
GaussianCDFEncoding.
RectangularBinEncoding.

source

ComplexityMeasures.encode — Function

encode(c::Encoding, χ) -> i::Int

Encode an element χ ∈ x of input data x (those given to probabilities) using encoding c.

The special value of -1 is reserved as a return value for inappropriate elements χ that cannot be encoded according to c.

source

ComplexityMeasures.decode — Function

decode(c::Encoding, i::Int) -> ω

Decode an encoded element i into the outcome ω ∈ Ω it corresponds to.

Ω is the outcome_space of a probabilities estimator that uses encoding c.

source

Available encodings

ComplexityMeasures.OrdinalPatternEncoding — Type

OrdinalPatternEncoding <: Encoding
OrdinalPatternEncoding(m::Int, lt = ComplexityMeasures.isless_rand)

An encoding scheme that encodes length-m vectors into their permutation/ordinal patterns and then into the integers based on the Lehmer code. It is used by SymbolicPermutation and similar estimators, see that for a description of the outcome space.

Description

The Lehmer code, as implemented here, is a bijection between the set of factorial(m) possible permutations for a length-m sequence, and the integers 1, 2, …, factorial(m). The encoding step uses algorithm 1 in Berger et al. (2019)^[Berger2019], which is highly optimized. The decoding step is much slower due to missing optimizations (pull requests welcomed!).

Example

julia> using ComplexityMeasures

julia> x = [4.0, 1.0, 9.0];

julia> c = OrdinalPatternEncoding(3);

julia> encode(c, x)
3

julia> decode(c, 1)
3-element SVector{3, Int64} with indices SOneTo(3):
 2
 1
 3

source

ComplexityMeasures.GaussianCDFEncoding — Type

GaussianCDFEncoding <: Encoding
GaussianCDFEncoding(; μ, σ, c::Int = 3)

An encoding scheme that encodes a scalar value into one of the integers sᵢ ∈ [1, 2, …, c] based on the normal cumulative distribution function (NCDF), and decodes the sᵢ into subintervals of [0, 1] (with some loss of information).

Notice that the decoding step does not yield an element of any outcome space of the estimators that use GaussianCDFEncoding internally, such as Dispersion. That is because these estimators additionally delay embed the encoded data.

Description

GaussianCDFEncoding first maps an input point $x$ (scalar) to a new real number $y_ \in [0, 1]$ by using the normal cumulative distribution function (CDF) with the given mean μ and standard deviation σ, according to the map

\[x \to y : y = \dfrac{1}{ \sigma \sqrt{2 \pi}} \int_{-\infty}^{x} e^{(-(x - \mu)^2)/(2 \sigma^2)} dx.\]

Next, the interval [0, 1] is equidistantly binned and enumerated $1, 2, \ldots, c$, and $y$ is linearly mapped to one of these integers using the linear map $y \to z : z = \text{floor}(y(c-1)) + 1$.

Because of the floor operation, some information is lost, so when used with decode, each decoded sᵢ is mapped to a subinterval of [0, 1].

Examples

julia> using ComplexityMeasures, Statistics

julia> x = [0.1, 0.4, 0.7, -2.1, 8.0];

julia> μ, σ = mean(x), std(x); encoding = GaussianCDFEncoding(; μ, σ, c = 5)

julia> es = encode.(Ref(encoding), x)
5-element Vector{Int64}:
 2
 2
 3
 1
 5

julia> decode(encoding, 3)
2-element SVector{2, Float64} with indices SOneTo(2):
 0.4
 0.6

source

ComplexityMeasures.RectangularBinEncoding — Type

RectangularBinEncoding <: Encoding
RectangularBinEncoding(binning::RectangularBinning, x; n_eps = 2)
RectangularBinEncoding(binning::FixedRectangularBinning; n_eps = 2)

Struct used in outcomes to map points of x into their respective bins. It finds the minima along each dimension, and computes appropriate edge lengths for each dimension of x given a rectangular binning.

The second signature does not need x because (1) the binning is fixed, and the size of x doesn't matter, and (2) because the binning contains the dimensionality information as ϵmin/max is already an NTuple.

Due to roundoff error when computing bin edges, a small tolerance n_eps * eps() is added to bin widths to ensure the correct number of bins is produced.

source

Berger2019Berger et al. "Teaching Ordinal Patterns to a Computer: Efficient Encoding Algorithms Based on the Lehmer Code." Entropy 21.10 (2019): 1023.