Contingency table API
To estimate discrete information theoretic quantities that are functions of more than one variable, we must estimate empirical joint probability mass functions (pmf). The function contingency_matrix
accepts an arbitrary number of equal-length input data and returns the corresponding multidimensional contingency table as a ContingencyMatrix
. From this table, we can extract the necessary joint and marginal pmfs for computing any discrete function of multivariate discrete probability distributions. This is essentially the multivariate analogue of Probabilities
.
But why would I use a ContingencyMatrix
instead of some other indirect estimation method, you may ask. The answer is that ContingencyMatrix
allows you to compute any of the information theoretic quantities offered in this package for any type of input data. You input data can literally be any hashable type, for example String
, Tuple{Int, String, Int}
, or YourCustomHashableDataType
.
In the case of numeric data, using a ContingencyMatrix
is typically a bit slower than other dedicated estimation procedures. For example, quantities like discrete Shannon-type condmutualinfo
are faster to estimate using a formulation based on sums of four entropies (the H4-principle). This is faster because we can both utilize the blazingly fast StateSpaceSet
structure directly, and we can avoid explicitly estimating the entire joint pmf, which demands many extra calculation steps. Whatever you use in practice depends on your use case and available estimation methods, but you can always fall back to contingency matrices for any discrete measure.
CausalityTools.ContingencyMatrix
— TypeContingencyMatrix{T, N} <: Probabilities{T, N}
ContingencyMatrix(frequencies::AbstractArray{Int, N})
A contingency matrix is essentially a multivariate analogue of Probabilities
that also keep track of raw frequencies.
The contingency matrix can be constructed directyly from an N
-dimensional frequencies
array. Alternatively, the contingency_matrix
function performs counting for you; this works on both raw categorical data, or by first discretizing data using a a ProbabilitiesEstimator
.
Description
A ContingencyMatrix
c
is just a simple wrapper around around AbstractArray{T, N}
. Indexing c
with multiple indices i, j, …
returns the (i, j, …)
th element of the empirical probability mass function (pmf). The following convencience methods are defined:
frequencies(c; dims)
returns the multivariate raw counts along the given `dims (default to all available dimensions).probabilities(c; dims)
returns a multidimensional empirical probability mass function (pmf) along the givendims
(defaults to all available dimensions), i.e. the normalized counts.probabilities(c, i::Int)
returns the marginal probabilities for thei
-th dimension.outcomes(c, i::Int)
returns the marginal outcomes for thei
-th dimension.
Ordering
The ordering of outcomes are internally consistent, but we make no promise on the ordering of outcomes relative to the input data. This means that if your input data are x = rand(["yes", "no"], 100); y = rand(["small", "medium", "large"], 100)
, you'll get a 2-by-3 contingency matrix, but there currently no easy way to determine which outcome the i-j-th row/column of this matrix corresponds to.
Since ContingencyMatrix
is intended for use in information theoretic methods that don't care about ordering, as long as the ordering is internally consistent, this is not an issue for practical applications in this package. This may change in future releases.
Usage
Contingency matrices is used in the computation of discrete versions of the following quantities:
CausalityTools.contingency_matrix
— Functioncontingency_matrix(x, y, [z, ...]) → c::ContingencyMatrix
contingency_matrix(est::ProbabilitiesEstimator, x, y, [z, ...]) → c::ContingencyMatrix
Estimate a multidimensional contingency matrix c
from input data x, y, …
, where the input data can be of any and different types, as long as length(x) == length(y) == …
.
For already discretized data, use the first method. For continuous data, you want to discretize the data before computing the contingency table. You can do this manually and then use the first method. Alternatively, you can provide a ProbabilitiesEstimator
as the first argument to the constructor. Then the input variables x, y, …
are discretized separately according to est
(enforcing the same outcome space for all variables), by calling marginal_encodings
.
Utilities
CausalityTools.marginal_encodings
— Functionmarginal_encodings(est::ProbabilitiesEstimator, x::VectorOrStateSpaceSet...)
Encode/discretize each input vector xᵢ ∈ x
according to a procedure determined by est
. Any xᵢ ∈ X
that are multidimensional (StateSpaceSet
s) will be encoded column-wise, i.e. each column of xᵢ
is treated as a timeseries and is encoded separately.
This is useful for computing any discrete information theoretic quantity, and is used internally by contingency_matrix
.
Supported estimators
ValueHistogram
. Bin visitation frequencies are counted in the joint spaceXY
, then marginal visitations are obtained from the joint bin visits. This behaviour is the same for bothFixedRectangularBinning
andRectangularBinning
(which adapts the grid to the data). When usingFixedRectangularBinning
, the range along the first dimension is used as a template for all other dimensions.SymbolicPermutation
. Each timeseries is separatelyencode
d according to its ordinal pattern.Dispersion
. Each timeseries is separatelyencode
d according to its dispersion pattern.
Many more implementations are possible. Each new implementation gives one new way of estimating the ContingencyMatrix