Encodings

Encodings API

Some probability estimators first "encode" input data into an intermediate representation indexed by the positive integers. This intermediate representation is called an "encoding".

The encodings API is defined by:

Encoding
encode
decode

ComplexityMeasures.Encoding — Type

Encoding

The supertype for all encoding schemes. Encodings always encode elements of input data into the positive integers. The encoding API is defined by the functions encode and decode. Some probability estimators utilize encodings internally.

Current available encodings are:

OrdinalPatternEncoding.
GaussianCDFEncoding.
RectangularBinEncoding.

source

ComplexityMeasures.encode — Function

encode(c::Encoding, χ) -> i::Int

Encode an element χ ∈ x of input data x (those given to probabilities) using encoding c.

The special value of -1 is reserved as a return value for inappropriate elements χ that cannot be encoded according to c.

source

ComplexityMeasures.decode — Function

decode(c::Encoding, i::Int) -> ω

Decode an encoded element i into the outcome ω ∈ Ω it corresponds to.

Ω is the outcome_space of a probabilities estimator that uses encoding c.

source

Available encodings

ComplexityMeasures.OrdinalPatternEncoding — Type

OrdinalPatternEncoding <: Encoding
OrdinalPatternEncoding(m::Int, lt = ComplexityMeasures.isless_rand)

An encoding scheme that encodes length-m vectors into their permutation/ordinal patterns and then into the integers based on the Lehmer code. It is used by SymbolicPermutation and similar estimators, see that for a description of the outcome space.

The ordinal/permutation pattern of a vector χ is simply sortperm(χ), which gives the indices that would sort χ in ascending order.

Description

The Lehmer code, as implemented here, is a bijection between the set of factorial(m) possible permutations for a length-m sequence, and the integers 1, 2, …, factorial(m). The encoding step uses algorithm 1 in Berger et al. (2019)^[Berger2019], which is highly optimized. The decoding step is much slower due to missing optimizations (pull requests welcomed!).

Example

julia> using ComplexityMeasures

julia> χ = [4.0, 1.0, 9.0];

julia> c = OrdinalPatternEncoding(3);

julia> i = encode(c, χ)
3

julia> decode(c, i)
3-element SVector{3, Int64} with indices SOneTo(3):
 2
 1
 3

If you want to encode something that is already a permutation pattern, then you can use the non-exported permutation_to_integer function.

source

ComplexityMeasures.GaussianCDFEncoding — Type

GaussianCDFEncoding <: Encoding
GaussianCDFEncoding(; μ, σ, c::Int = 3)

An encoding scheme that encodes a scalar value into one of the integers sᵢ ∈ [1, 2, …, c] based on the normal cumulative distribution function (NCDF), and decodes the sᵢ into subintervals of [0, 1] (with some loss of information).

Notice that the decoding step does not yield an element of any outcome space of the estimators that use GaussianCDFEncoding internally, such as Dispersion. That is because these estimators additionally delay embed the encoded data.

Description

GaussianCDFEncoding first maps an input point $x$ (scalar) to a new real number $y_ \in [0, 1]$ by using the normal cumulative distribution function (CDF) with the given mean μ and standard deviation σ, according to the map

\[x \to y : y = \dfrac{1}{ \sigma \sqrt{2 \pi}} \int_{-\infty}^{x} e^{(-(x - \mu)^2)/(2 \sigma^2)} dx.\]

Next, the interval [0, 1] is equidistantly binned and enumerated $1, 2, \ldots, c$, and $y$ is linearly mapped to one of these integers using the linear map $y \to z : z = \text{floor}(y(c-1)) + 1$.

Because of the floor operation, some information is lost, so when used with decode, each decoded sᵢ is mapped to a subinterval of [0, 1].

Examples

julia> using ComplexityMeasures, Statistics

julia> x = [0.1, 0.4, 0.7, -2.1, 8.0];

julia> μ, σ = mean(x), std(x); encoding = GaussianCDFEncoding(; μ, σ, c = 5)

julia> es = encode.(Ref(encoding), x)
5-element Vector{Int64}:
 2
 2
 3
 1
 5

julia> decode(encoding, 3)
2-element SVector{2, Float64} with indices SOneTo(2):
 0.4
 0.6

source

ComplexityMeasures.RectangularBinEncoding — Type

RectangularBinEncoding <: Encoding
RectangularBinEncoding(binning::RectangularBinning, x)
RectangularBinEncoding(binning::FixedRectangularBinning)

An encoding scheme that encodes points χ ∈ x into their histogram bins.

The first call signature simply initializes a FixedRectangularBinning and then calls the second call signature.

See FixedRectangularBinning for info on mapping points to bins.

source

Berger2019Berger et al. "Teaching Ordinal Patterns to a Computer: Efficient Encoding Algorithms Based on the Lehmer Code." Entropy 21.10 (2019): 1023.