Multivariate counts and probabilities API
For counting and probabilities, Associations.jl extends the single-variable machinery in ComplexityMeasures.jl to multiple variables.
ComplexityMeasures.Counts
— TypeCounts <: Array{<:Integer, N}
Counts(counts [, outcomes [, dimlabels]]) → c
Counts
stores an N
-dimensional array of integer counts
corresponding to a set of outcomes
. This is typically called a "frequency table" or "contingency table".
If c isa Counts
, then c.outcomes[i]
is an abstract vector containing the outcomes along the i
-th dimension, where c[i][j]
is the count corresponding to the outcome c.outcomes[i][j]
, and c.dimlabels[i]
is the label of the i
-th dimension. Both labels and outcomes are assigned automatically if not given. c
itself can be manipulated and iterated over like its stored array.
ComplexityMeasures.counts
— Methodcounts(o::UniqueElements, x₁, x₂, ..., xₙ) → Counts{N}
counts(encoding::CodifyPoints, x₁, x₂, ..., xₙ) → Counts{N}
counts(encoding::CodifyVariables, x₁, x₂, ..., xₙ) → Counts{N}
Construct an N
-dimensional contingency table from the input iterables x₁, x₂, ..., xₙ
which are such that length(x₁) == length(x₂) == ⋯ == length(xₙ)
.
If x₁, x₂, ..., xₙ
are already discrete, then use UniqueElements
as the first argument to directly construct the joint contingency table.
If x₁, x₂, ..., xₙ
need to be discretized, provide as the first argument
CodifyPoints
(encodes every point in each of the input variablesxᵢ
s individually)CodifyVariables
(encodes everyxᵢ
individually using a sliding window encoding). NB: If using differentOutcomeSpace
s for the differentxᵢ
, thentotal_outcomes
must be the same for every outcome space.
Examples
# Discretizing some non-discrete data using a sliding-window encoding for each variable
x, y = rand(100), rand(100)
c = CodifyVariables(OrdinalPatterns(m = 4))
counts(c, x, y)
# Discretizing the data by binning each individual data point
binning = RectangularBinning(3)
encoding = RectangularBinEncoding(binning, [x; y]) # give input values to ensure binning covers all data
c = CodifyPoints(encoding)
counts(c, x, y)
# Counts table for already discrete data
n = 50 # all variables must have the same number of elements
x = rand(["dog", "cat", "mouse"], n)
y = rand(1:3, n)
z = rand([(1, 2), (2, 1)], n)
counts(UniqueElements(), x, y, z)
See also: CodifyPoints
, CodifyVariables
, UniqueElements
, OutcomeSpace
, probabilities
.
ComplexityMeasures.Probabilities
— TypeProbabilities <: Array{<:AbstractFloat, N}
Probabilities(probs::Array [, outcomes [, dimlabels]]) → p
Probabilities(counts::Counts [, outcomes [, dimlabels]]) → p
Probabilities
stores an N
-dimensional array of probabilities, while ensuring that the array sums to 1 (normalized probability mass). In most cases the array is a standard vector. p
itself can be manipulated and iterated over, just like its stored array.
The probabilities correspond to outcomes
that describe the axes of the array. If p isa Probabilities
, then p.outcomes[i]
is an an abstract vector containing the outcomes along the i
-th dimension. The outcomes have the same ordering as the probabilities, so that p[i][j]
is the probability for outcome p.outcomes[i][j]
. The dimensions of the array are named, and can be accessed by p.dimlabels
, where p.dimlabels[i]
is the label of the i
-th dimension. Both outcomes
and dimlabels
are assigned automatically if not given. If the input is a set of Counts
, and outcomes
and dimlabels
are not given, then the labels and outcomes are inherited from the counts.
Examples
julia> probs = [0.2, 0.2, 0.2, 0.2]; Probabilities(probs) # will be normalized to sum to 1
Probabilities{Float64,1} over 4 outcomes
Outcome(1) 0.25
Outcome(2) 0.25
Outcome(3) 0.25
Outcome(4) 0.25
julia> c = Counts([12, 16, 12], ["out1", "out2", "out3"]); Probabilities(c)
Probabilities{Float64,1} over 3 outcomes
"out1" 0.3
"out2" 0.4
"out3" 0.3
ComplexityMeasures.probabilities
— Methodprobabilities(o::UniqueElements, x₁, x₂, ..., xₙ) → Counts{N}
probabilities(encoding::CodifyPoints, x₁, x₂, ..., xₙ) → Counts{N}
probabilities(encoding::CodifyVariables, x₁, x₂, ..., xₙ) → Counts{N}
Construct an N
-dimensional Probabilities
array from the input iterables x₁, x₂, ..., xₙ
which are such that length(x₁) == length(x₂) == ⋯ == length(xₙ)
.
Description
Probabilities are computed by first constructing a joint contingency matrix in the form of a Counts
instance.
If x₁, x₂, ..., xₙ
are already discrete, then use UniqueElements
as the first argument to directly construct the joint contingency table.
If x₁, x₂, ..., xₙ
need to be discretized, provide as the first argument
CodifyPoints
(encodes every point in each of the input variablesxᵢ
s individually)CodifyVariables
(encodes everyxᵢ
individually using a sliding window encoding).
Examples
# Discretizing some non-discrete data using a sliding-window encoding for each variable
x, y = rand(100), rand(100)
c = CodifyVariables(OrdinalPatterns(m = 4))
probabilities(c, x, y)
# Discretizing the data by binning each individual data point
binning = RectangularBinning(3)
encoding = RectangularBinEncoding(binning, [x; y]) # give input values to ensure binning covers all data
c = CodifyPoints(encoding)
probabilities(c, x, y)
# Joint probabilities for already discretized data
n = 50 # all variables must have the same number of elements
x = rand(["dog", "cat", "mouse"], n)
y = rand(1:3, n)
z = rand([(1, 2), (2, 1)], n)
probabilities(UniqueElements(), x, y, z)
See also: CodifyPoints
, CodifyVariables
, UniqueElements
, OutcomeSpace
.
The utility function marginal
is also useful.
Associations.marginal
— Functionmarginal(p::Probabilities; dims = 1:ndims(p))
marginal(c::Counts; dims = 1:ndims(p))
Given a set of counts c
(a contingency table), or a multivariate probability mass function p
, return the marginal counts/probabilities along the given dims
.
Example: estimating Counts
and Probabilities
Estimating multivariate counts (contingency matrices) and PMFs is simple. If the data are pre-discretized, then we can use UniqueElements
to simply count the number of occurrences.
using Associations
n = 50 # the number of samples must be the same for each input variable
x = rand(["dog", "cat", "snake"], n)
y = rand(1:4, n)
z = rand([(2, 1), (0, 0), (1, 1)], n)
discretization = CodifyVariables(UniqueElements())
counts(discretization, x, y, z)
3×4×3 Counts{Int64,3}
[:, :, 1]
1 2 3 4
1 2 1 2 1
2 0 2 2 2
3 2 1 1 2
[and 2 more slices...]
Probabilities are computed analogously, except counts are normalized to sum to 1
.
discretization = CodifyVariables(UniqueElements())
probabilities(discretization, x, y, z)
3×4×3 Probabilities{Float64,3}
[:, :, 1]
1 2 3 4
1 0.04 0.02 0.04 0.02
2 0.0 0.04 0.04 0.04
3 0.04 0.02 0.02 0.04
[and 2 more slices...]
For numerical data, we can estimate both counts and probabilities using CodifyVariables
with any count-based OutcomeSpace
.
using Associations
x, y = rand(100), rand(100)
discretization = CodifyVariables(BubbleSortSwaps(m = 4))
probabilities(discretization, x, y)
7×7 Probabilities{Float64,2}
1 4 … 0
3 0.04123711340206186 0.08247422680412372 0.0
2 0.05154639175257733 0.0309278350515464 0.010309278350515465
0 0.010309278350515465 0.010309278350515465 0.0
4 0.0309278350515464 0.05154639175257733 0.0
5 0.02061855670103093 0.0309278350515464 … 0.0
1 0.0 0.04123711340206186 0.0
6 0.0 0.0 0.0
For more fine-grained control, we can use CodifyPoints
with one or several Encoding
s.
using Associations
x, y = StateSpaceSet(rand(1000, 2)), StateSpaceSet(rand(1000, 3))
# min/max of the `rand` call is 0 and 1
precise = true # precise bin edges
r = range(0, 1; length = 3)
binning = FixedRectangularBinning(r, dimension(x), precise)
encoding_x = RectangularBinEncoding(binning, x)
encoding_y = CombinationEncoding(RelativeMeanEncoding(0.0, 1, n = 2), OrdinalPatternEncoding(3))
discretization = CodifyPoints(encoding_x, encoding_y)
# now estimate probabilities
probabilities(discretization, x, y)
4×12 Probabilities{Float64,2}
11 1 7 12 … 4 6 9 8 2
4 0.027 0.025 0.02 0.023 0.019 0.025 0.026 0.027 0.023
2 0.024 0.018 0.015 0.019 0.029 0.022 0.024 0.023 0.023
1 0.019 0.012 0.018 0.019 0.029 0.017 0.019 0.021 0.024
3 0.023 0.014 0.012 0.014 0.03 0.02 0.013 0.021 0.014