Multivariate counts and probabilities API
For counting and probabilities, Associations.jl extends the single-variable machinery in ComplexityMeasures.jl to multiple variables. See the following types:
ComplexityMeasures.counts — Methodcounts(o::UniqueElements, x₁, x₂, ..., xₙ) → Counts{N}
counts(encoding::CodifyPoints, x₁, x₂, ..., xₙ) → Counts{N}
counts(encoding::CodifyVariables, x₁, x₂, ..., xₙ) → Counts{N}Construct an N-dimensional contingency table from the input iterables x₁, x₂, ..., xₙ which are such that length(x₁) == length(x₂) == ⋯ == length(xₙ).
If x₁, x₂, ..., xₙ are already discrete, then use UniqueElements as the first argument to directly construct the joint contingency table.
If x₁, x₂, ..., xₙ need to be discretized, provide as the first argument
CodifyPoints(encodes every point in each of the input variablesxᵢs individually)CodifyVariables(encodes everyxᵢindividually using a sliding window encoding). NB: If using differentOutcomeSpaces for the differentxᵢ, thentotal_outcomesmust be the same for every outcome space.
Examples
# Discretizing some non-discrete data using a sliding-window encoding for each variable
x, y = rand(100), rand(100)
c = CodifyVariables(OrdinalPatterns(m = 4))
counts(c, x, y)
# Discretizing the data by binning each individual data point
binning = RectangularBinning(3)
encoding = RectangularBinEncoding(binning, [x; y]) # give input values to ensure binning covers all data
c = CodifyPoints(encoding)
counts(c, x, y)
# Counts table for already discrete data
n = 50 # all variables must have the same number of elements
x = rand(["dog", "cat", "mouse"], n)
y = rand(1:3, n)
z = rand([(1, 2), (2, 1)], n)
counts(UniqueElements(), x, y, z)See also: CodifyPoints, CodifyVariables, UniqueElements, OutcomeSpace, probabilities.
ComplexityMeasures.probabilities — Methodprobabilities(o::UniqueElements, x₁, x₂, ..., xₙ) → Counts{N}
probabilities(encoding::CodifyPoints, x₁, x₂, ..., xₙ) → Counts{N}
probabilities(encoding::CodifyVariables, x₁, x₂, ..., xₙ) → Counts{N}Construct an N-dimensional Probabilities array from the input iterables x₁, x₂, ..., xₙ which are such that length(x₁) == length(x₂) == ⋯ == length(xₙ).
Description
Probabilities are computed by first constructing a joint contingency matrix in the form of a Counts instance.
If x₁, x₂, ..., xₙ are already discrete, then use UniqueElements as the first argument to directly construct the joint contingency table.
If x₁, x₂, ..., xₙ need to be discretized, provide as the first argument
CodifyPoints(encodes every point in each of the input variablesxᵢs individually)CodifyVariables(encodes everyxᵢindividually using a sliding window encoding).
Examples
# Discretizing some non-discrete data using a sliding-window encoding for each variable
x, y = rand(100), rand(100)
c = CodifyVariables(OrdinalPatterns(m = 4))
probabilities(c, x, y)
# Discretizing the data by binning each individual data point
binning = RectangularBinning(3)
encoding = RectangularBinEncoding(binning, [x; y]) # give input values to ensure binning covers all data
c = CodifyPoints(encoding)
probabilities(c, x, y)
# Joint probabilities for already discretized data
n = 50 # all variables must have the same number of elements
x = rand(["dog", "cat", "mouse"], n)
y = rand(1:3, n)
z = rand([(1, 2), (2, 1)], n)
probabilities(UniqueElements(), x, y, z)See also: CodifyPoints, CodifyVariables, UniqueElements, OutcomeSpace.
The utility function marginal is also useful.
Associations.marginal — Functionmarginal(p::Probabilities; dims = 1:ndims(p))
marginal(c::Counts; dims = 1:ndims(p))Given a set of counts c (a contingency table), or a multivariate probability mass function p, return the marginal counts/probabilities along the given dims.
Example: estimating Counts and Probabilities
Estimating multivariate counts (contingency matrices) and PMFs is simple. If the data are pre-discretized, then we can use UniqueElements to simply count the number of occurrences.
using Associations
n = 50 # the number of samples must be the same for each input variable
x = rand(["dog", "cat", "snake"], n)
y = rand(1:4, n)
z = rand([(2, 1), (0, 0), (1, 1)], n)
discretization = CodifyVariables(UniqueElements())
counts(discretization, x, y, z) 3×4×3 Counts{Int64,3}
[:, :, 1]
1 3 2 4
"dog" 2 0 2 1
"cat" 3 0 2 1
"snake" 1 2 1 1
[and 2 more slices...]Probabilities are computed analogously, except counts are normalized to sum to 1.
discretization = CodifyVariables(UniqueElements())
probabilities(discretization, x, y, z) 3×4×3 Probabilities{Float64,3}
[:, :, 1]
1 3 2 4
"dog" 0.04 0.0 0.04 0.02
"cat" 0.06 0.0 0.04 0.02
"snake" 0.02 0.04 0.02 0.02
[and 2 more slices...]For numerical data, we can estimate both counts and probabilities using CodifyVariables with any count-based OutcomeSpace.
using Associations
x, y = rand(100), rand(100)
discretization = CodifyVariables(BubbleSortSwaps(m = 4))
probabilities(discretization, x, y) 7×7 Probabilities{Float64,2}
4 1 … 0
3 0.04123711340206186 0.06185567010309279 0.0
2 0.04123711340206186 0.02061855670103093 0.010309278350515465
4 0.02061855670103093 0.030927835051546396 0.010309278350515465
5 0.04123711340206186 0.02061855670103093 0.010309278350515465
1 0.02061855670103093 0.030927835051546396 … 0.0
6 0.0 0.0 0.0
0 0.010309278350515465 0.010309278350515465 0.0For more fine-grained control, we can use CodifyPoints with one or several Encodings.
using Associations
x, y = StateSpaceSet(rand(1000, 2)), StateSpaceSet(rand(1000, 3))
# min/max of the `rand` call is 0 and 1
precise = true # precise bin edges
r = range(0, 1; length = 3)
binning = FixedRectangularBinning(r, dimension(x), precise)
encoding_x = RectangularBinEncoding(binning, x)
encoding_y = CombinationEncoding(RelativeMeanEncoding(0.0, 1, n = 2), OrdinalPatternEncoding(3))
discretization = CodifyPoints(encoding_x, encoding_y)
# now estimate probabilities
probabilities(discretization, x, y) 4×12 Probabilities{Float64,2}
3 5 1 10 … 7 2 8 6 4
4 0.021 0.018 0.018 0.019 0.015 0.02 0.022 0.023 0.018
3 0.02 0.022 0.03 0.025 0.021 0.019 0.014 0.029 0.017
2 0.023 0.021 0.028 0.011 0.024 0.029 0.024 0.014 0.018
1 0.018 0.026 0.021 0.021 0.021 0.018 0.022 0.015 0.017