Encodings
Encodings API
Some probability estimators first "encode" input data into an intermediate representation indexed by the positive integers. This intermediate representation is called an "encoding".
The encodings API is defined by:
ComplexityMeasures.Encoding
— TypeEncoding
The supertype for all encoding schemes. Encodings always encode elements of input data into the positive integers. The encoding API is defined by the functions encode
and decode
. Some probability estimators utilize encodings internally.
Current available encodings are:
ComplexityMeasures.encode
— Functionencode(c::Encoding, χ) -> i::Int
Encode an element χ ∈ x
of input data x
(those given to probabilities
) using encoding c
.
The special value of -1
is reserved as a return value for inappropriate elements χ
that cannot be encoded according to c
.
ComplexityMeasures.decode
— Functiondecode(c::Encoding, i::Int) -> ω
Decode an encoded element i
into the outcome ω ∈ Ω
it corresponds to.
Ω
is the outcome_space
of a probabilities estimator that uses encoding c
.
Available encodings
ComplexityMeasures.OrdinalPatternEncoding
— TypeOrdinalPatternEncoding <: Encoding
OrdinalPatternEncoding(m::Int, lt = ComplexityMeasures.isless_rand)
An encoding scheme that encode
s length-m
vectors into their permutation/ordinal patterns and then into the integers based on the Lehmer code. It is used by SymbolicPermutation
and similar estimators, see that for a description of the outcome space.
The ordinal/permutation pattern of a vector χ
is simply sortperm(χ)
, which gives the indices that would sort χ
in ascending order.
Description
The Lehmer code, as implemented here, is a bijection between the set of factorial(m)
possible permutations for a length-m
sequence, and the integers 1, 2, …, factorial(m)
. The encoding step uses algorithm 1 in Berger et al. (2019)[Berger2019], which is highly optimized. The decoding step is much slower due to missing optimizations (pull requests welcomed!).
Example
julia> using ComplexityMeasures
julia> χ = [4.0, 1.0, 9.0];
julia> c = OrdinalPatternEncoding(3);
julia> i = encode(c, χ)
3
julia> decode(c, i)
3-element SVector{3, Int64} with indices SOneTo(3):
2
1
3
If you want to encode something that is already a permutation pattern, then you can use the non-exported permutation_to_integer
function.
ComplexityMeasures.GaussianCDFEncoding
— TypeGaussianCDFEncoding <: Encoding
GaussianCDFEncoding(; μ, σ, c::Int = 3)
An encoding scheme that encode
s a scalar value into one of the integers sᵢ ∈ [1, 2, …, c]
based on the normal cumulative distribution function (NCDF), and decode
s the sᵢ
into subintervals of [0, 1]
(with some loss of information).
Notice that the decoding step does not yield an element of any outcome space of the estimators that use GaussianCDFEncoding
internally, such as Dispersion
. That is because these estimators additionally delay embed the encoded data.
Description
GaussianCDFEncoding
first maps an input point $x$ (scalar) to a new real number $y_ \in [0, 1]$ by using the normal cumulative distribution function (CDF) with the given mean μ
and standard deviation σ
, according to the map
\[x \to y : y = \dfrac{1}{ \sigma \sqrt{2 \pi}} \int_{-\infty}^{x} e^{(-(x - \mu)^2)/(2 \sigma^2)} dx.\]
Next, the interval [0, 1]
is equidistantly binned and enumerated $1, 2, \ldots, c$, and $y$ is linearly mapped to one of these integers using the linear map $y \to z : z = \text{floor}(y(c-1)) + 1$.
Because of the floor operation, some information is lost, so when used with decode
, each decoded sᵢ
is mapped to a subinterval of [0, 1]
.
Examples
julia> using ComplexityMeasures, Statistics
julia> x = [0.1, 0.4, 0.7, -2.1, 8.0];
julia> μ, σ = mean(x), std(x); encoding = GaussianCDFEncoding(; μ, σ, c = 5)
julia> es = encode.(Ref(encoding), x)
5-element Vector{Int64}:
2
2
3
1
5
julia> decode(encoding, 3)
2-element SVector{2, Float64} with indices SOneTo(2):
0.4
0.6
ComplexityMeasures.RectangularBinEncoding
— TypeRectangularBinEncoding <: Encoding
RectangularBinEncoding(binning::RectangularBinning, x)
RectangularBinEncoding(binning::FixedRectangularBinning)
An encoding scheme that encode
s points χ ∈ x
into their histogram bins.
The first call signature simply initializes a FixedRectangularBinning
and then calls the second call signature.
See FixedRectangularBinning
for info on mapping points to bins.
- Berger2019Berger et al. "Teaching Ordinal Patterns to a Computer: Efficient Encoding Algorithms Based on the Lehmer Code." Entropy 21.10 (2019): 1023.