Multivariate counts and probabilities API

For counting and probabilities, Associations.jl extends the single-variable machinery in ComplexityMeasures.jl to multiple variables.

ComplexityMeasures.CountsType
Counts <: Array{<:Integer, N}
Counts(counts [, outcomes [, dimlabels]]) → c

Counts stores an N-dimensional array of integer counts corresponding to a set of outcomes. This is typically called a "frequency table" or "contingency table".

If c isa Counts, then c.outcomes[i] is an abstract vector containing the outcomes along the i-th dimension, where c[i][j] is the count corresponding to the outcome c.outcomes[i][j], and c.dimlabels[i] is the label of the i-th dimension. Both labels and outcomes are assigned automatically if not given. c itself can be manipulated and iterated over like its stored array.

source
ComplexityMeasures.countsMethod
counts(o::UniqueElements, x₁, x₂, ..., xₙ) → Counts{N}
counts(encoding::CodifyPoints, x₁, x₂, ..., xₙ) → Counts{N}
counts(encoding::CodifyVariables, x₁, x₂, ..., xₙ) → Counts{N}

Construct an N-dimensional contingency table from the input iterables x₁, x₂, ..., xₙ which are such that length(x₁) == length(x₂) == ⋯ == length(xₙ).

If x₁, x₂, ..., xₙ are already discrete, then use UniqueElements as the first argument to directly construct the joint contingency table.

If x₁, x₂, ..., xₙ need to be discretized, provide as the first argument

  • CodifyPoints (encodes every point in each of the input variables xᵢs individually)
  • CodifyVariables (encodes every xᵢ individually using a sliding window encoding). NB: If using different OutcomeSpaces for the different xᵢ, then total_outcomes must be the same for every outcome space.

Examples

# Discretizing some non-discrete data using a sliding-window encoding for each variable
x, y = rand(100), rand(100)
c = CodifyVariables(OrdinalPatterns(m = 4))
counts(c, x, y)

# Discretizing the data by binning each individual data point
binning = RectangularBinning(3)
encoding = RectangularBinEncoding(binning, [x; y]) # give input values to ensure binning covers all data
c = CodifyPoints(encoding)
counts(c, x, y)

# Counts table for already discrete data
n = 50 # all variables must have the same number of elements
x = rand(["dog", "cat", "mouse"], n)
y = rand(1:3, n)
z = rand([(1, 2), (2, 1)], n)

counts(UniqueElements(), x, y, z)

See also: CodifyPoints, CodifyVariables, UniqueElements, OutcomeSpace, probabilities.

source
ComplexityMeasures.ProbabilitiesType
Probabilities <: Array{<:AbstractFloat, N}
Probabilities(probs::Array [, outcomes [, dimlabels]]) → p
Probabilities(counts::Counts [, outcomes [, dimlabels]]) → p

Probabilities stores an N-dimensional array of probabilities, while ensuring that the array sums to 1 (normalized probability mass). In most cases the array is a standard vector. p itself can be manipulated and iterated over, just like its stored array.

The probabilities correspond to outcomes that describe the axes of the array. If p isa Probabilities, then p.outcomes[i] is an an abstract vector containing the outcomes along the i-th dimension. The outcomes have the same ordering as the probabilities, so that p[i][j] is the probability for outcome p.outcomes[i][j]. The dimensions of the array are named, and can be accessed by p.dimlabels, where p.dimlabels[i] is the label of the i-th dimension. Both outcomes and dimlabels are assigned automatically if not given. If the input is a set of Counts, and outcomes and dimlabels are not given, then the labels and outcomes are inherited from the counts.

Examples

julia> probs = [0.2, 0.2, 0.2, 0.2]; Probabilities(probs) # will be normalized to sum to 1
 Probabilities{Float64,1} over 4 outcomes
 Outcome(1)  0.25
 Outcome(2)  0.25
 Outcome(3)  0.25
 Outcome(4)  0.25
julia> c = Counts([12, 16, 12], ["out1", "out2", "out3"]); Probabilities(c)
 Probabilities{Float64,1} over 3 outcomes
 "out1"  0.3
 "out2"  0.4
 "out3"  0.3
source
ComplexityMeasures.probabilitiesMethod
probabilities(o::UniqueElements, x₁, x₂, ..., xₙ) → Counts{N}
probabilities(encoding::CodifyPoints, x₁, x₂, ..., xₙ) → Counts{N}
probabilities(encoding::CodifyVariables, x₁, x₂, ..., xₙ) → Counts{N}

Construct an N-dimensional Probabilities array from the input iterables x₁, x₂, ..., xₙ which are such that length(x₁) == length(x₂) == ⋯ == length(xₙ).

Description

Probabilities are computed by first constructing a joint contingency matrix in the form of a Counts instance.

If x₁, x₂, ..., xₙ are already discrete, then use UniqueElements as the first argument to directly construct the joint contingency table.

If x₁, x₂, ..., xₙ need to be discretized, provide as the first argument

  • CodifyPoints (encodes every point in each of the input variables xᵢs individually)
  • CodifyVariables (encodes every xᵢ individually using a sliding window encoding).

Examples

# Discretizing some non-discrete data using a sliding-window encoding for each variable
x, y = rand(100), rand(100)
c = CodifyVariables(OrdinalPatterns(m = 4))
probabilities(c, x, y)

# Discretizing the data by binning each individual data point
binning = RectangularBinning(3)
encoding = RectangularBinEncoding(binning, [x; y]) # give input values to ensure binning covers all data
c = CodifyPoints(encoding)
probabilities(c, x, y)

# Joint probabilities for already discretized data
n = 50 # all variables must have the same number of elements
x = rand(["dog", "cat", "mouse"], n)
y = rand(1:3, n)
z = rand([(1, 2), (2, 1)], n)

probabilities(UniqueElements(), x, y, z)

See also: CodifyPoints, CodifyVariables, UniqueElements, OutcomeSpace.

source

The utility function marginal is also useful.

Associations.marginalFunction
marginal(p::Probabilities; dims = 1:ndims(p))
marginal(c::Counts; dims = 1:ndims(p))

Given a set of counts c (a contingency table), or a multivariate probability mass function p, return the marginal counts/probabilities along the given dims.

source

Example: estimating Counts and Probabilities

Estimating multivariate counts (contingency matrices) and PMFs is simple. If the data are pre-discretized, then we can use UniqueElements to simply count the number of occurrences.

using Associations
n = 50 # the number of samples must be the same for each input variable
x = rand(["dog", "cat", "snake"], n)
y = rand(1:4, n)
z = rand([(2, 1), (0, 0), (1, 1)], n)
discretization = CodifyVariables(UniqueElements())
counts(discretization, x, y, z)
 3×4×3 Counts{Int64,3}
[:, :, 1]
          1  3  2  4
 "dog"    2  1  1  1
 "snake"  1  2  1  0
 "cat"    2  2  4  2
[and 2 more slices...]

Probabilities are computed analogously, except counts are normalized to sum to 1.

discretization = CodifyVariables(UniqueElements())
probabilities(discretization, x, y, z)
 3×4×3 Probabilities{Float64,3}
[:, :, 1]
          1     3     2     4
 "dog"    0.04  0.02  0.02  0.02
 "snake"  0.02  0.04  0.02  0.0
 "cat"    0.04  0.04  0.08  0.04
[and 2 more slices...]

For numerical data, we can estimate both counts and probabilities using CodifyVariables with any count-based OutcomeSpace.

using Associations
x, y = rand(100), rand(100)
discretization = CodifyVariables(BubbleSortSwaps(m = 4))
probabilities(discretization, x, y)
 7×7 Probabilities{Float64,2}
       3                     24
 4     0.07216494845360825   0.0618556701030928       0.04123711340206186
 2     0.0618556701030928    0.010309278350515465     0.02061855670103093
 3     0.0309278350515464    0.07216494845360825      0.07216494845360825
 5     0.010309278350515465  0.04123711340206186      0.0309278350515464
 1     0.05154639175257733   0.0309278350515464    …  0.0309278350515464
 6     0.02061855670103093   0.02061855670103093      0.04123711340206186
 0     0.010309278350515465  0.0                      0.0

For more fine-grained control, we can use CodifyPoints with one or several Encodings.

using Associations
x, y = StateSpaceSet(rand(1000, 2)), StateSpaceSet(rand(1000, 3))

 # min/max of the `rand` call is 0 and 1
precise = true # precise bin edges
r = range(0, 1; length = 3)
binning = FixedRectangularBinning(r, dimension(x), precise)
encoding_x = RectangularBinEncoding(binning, x)
encoding_y = CombinationEncoding(RelativeMeanEncoding(0.0, 1, n = 2), OrdinalPatternEncoding(3))
discretization = CodifyPoints(encoding_x, encoding_y)

# now estimate probabilities
probabilities(discretization, x, y)
 4×12 Probabilities{Float64,2}
       9      5      6      42      3      10      1      7
 3     0.016  0.021  0.021  0.015     0.022  0.019   0.017  0.026  0.018
 4     0.019  0.023  0.018  0.019     0.022  0.022   0.02   0.032  0.02
 2     0.022  0.027  0.022  0.023     0.014  0.024   0.023  0.022  0.016
 1     0.017  0.022  0.022  0.019     0.024  0.017   0.017  0.021  0.026