Encodings
Encodings API
Some probability estimators first "encode" input data into an intermediate representation indexed by the positive integers. This intermediate representation is called an "encoding".
The encodings API is defined by:
ComplexityMeasures.Encoding
— TypeEncoding
The supertype for all encoding schemes. Encodings always encode elements of input data into the positive integers. The encoding API is defined by the functions encode
and decode
. Some probability estimators utilize encodings internally.
Current available encodings are:
ComplexityMeasures.encode
— Functionencode(c::Encoding, χ) -> i::Int
Encode an element χ ∈ x
of input data x
(those given to probabilities
) using encoding c
.
The special value of -1
is reserved as a return value for inappropriate elements χ
that cannot be encoded according to c
.
ComplexityMeasures.decode
— Functiondecode(c::Encoding, i::Int) -> ω
Decode an encoded element i
into the outcome ω ∈ Ω
it corresponds to.
Ω
is the outcome_space
of a probabilities estimator that uses encoding c
.
Available encodings
ComplexityMeasures.OrdinalPatternEncoding
— TypeOrdinalPatternEncoding <: Encoding
OrdinalPatternEncoding(m::Int, lt = ComplexityMeasures.isless_rand)
An encoding scheme that encode
s length-m
vectors into their permutation/ordinal patterns and then into the integers based on the Lehmer code. It is used by SymbolicPermutation
and similar estimators, see that for a description of the outcome space.
Description
The Lehmer code, as implemented here, is a bijection between the set of factorial(m)
possible permutations for a length-m
sequence, and the integers 1, 2, …, factorial(m)
. The encoding step uses algorithm 1 in Berger et al. (2019)[Berger2019], which is highly optimized. The decoding step is much slower due to missing optimizations (pull requests welcomed!).
Example
julia> using ComplexityMeasures
julia> x = [4.0, 1.0, 9.0];
julia> c = OrdinalPatternEncoding(3);
julia> encode(c, x)
3
julia> decode(c, 1)
3-element SVector{3, Int64} with indices SOneTo(3):
2
1
3
ComplexityMeasures.GaussianCDFEncoding
— TypeGaussianCDFEncoding <: Encoding
GaussianCDFEncoding(; μ, σ, c::Int = 3)
An encoding scheme that encode
s a scalar value into one of the integers sᵢ ∈ [1, 2, …, c]
based on the normal cumulative distribution function (NCDF), and decode
s the sᵢ
into subintervals of [0, 1]
(with some loss of information).
Notice that the decoding step does not yield an element of any outcome space of the estimators that use GaussianCDFEncoding
internally, such as Dispersion
. That is because these estimators additionally delay embed the encoded data.
Description
GaussianCDFEncoding
first maps an input point $x$ (scalar) to a new real number $y_ \in [0, 1]$ by using the normal cumulative distribution function (CDF) with the given mean μ
and standard deviation σ
, according to the map
\[x \to y : y = \dfrac{1}{ \sigma \sqrt{2 \pi}} \int_{-\infty}^{x} e^{(-(x - \mu)^2)/(2 \sigma^2)} dx.\]
Next, the interval [0, 1]
is equidistantly binned and enumerated $1, 2, \ldots, c$, and $y$ is linearly mapped to one of these integers using the linear map $y \to z : z = \text{floor}(y(c-1)) + 1$.
Because of the floor operation, some information is lost, so when used with decode
, each decoded sᵢ
is mapped to a subinterval of [0, 1]
.
Examples
julia> using ComplexityMeasures, Statistics
julia> x = [0.1, 0.4, 0.7, -2.1, 8.0];
julia> μ, σ = mean(x), std(x); encoding = GaussianCDFEncoding(; μ, σ, c = 5)
julia> es = encode.(Ref(encoding), x)
5-element Vector{Int64}:
2
2
3
1
5
julia> decode(encoding, 3)
2-element SVector{2, Float64} with indices SOneTo(2):
0.4
0.6
ComplexityMeasures.RectangularBinEncoding
— TypeRectangularBinEncoding <: Encoding
RectangularBinEncoding(binning::RectangularBinning, x; n_eps = 2)
RectangularBinEncoding(binning::FixedRectangularBinning; n_eps = 2)
Struct used in outcomes
to map points of x
into their respective bins. It finds the minima along each dimension, and computes appropriate edge lengths for each dimension of x
given a rectangular binning.
The second signature does not need x
because (1) the binning is fixed, and the size of x
doesn't matter, and (2) because the binning contains the dimensionality information as ϵmin/max
is already an NTuple
.
Due to roundoff error when computing bin edges, a small tolerance n_eps * eps()
is added to bin widths to ensure the correct number of bins is produced.
See also: RectangularBinning
, FixedRectangularBinning
.
- Berger2019Berger et al. "Teaching Ordinal Patterns to a Computer: Efficient Encoding Algorithms Based on the Lehmer Code." Entropy 21.10 (2019): 1023.