Contingency table API

To estimate discrete information theoretic quantities that are functions of more than one variable, we must estimate empirical joint probability mass functions (pmf). The function contingency_matrix accepts an arbitrary number of equal-length input data and returns the corresponding multidimensional contingency table as a ContingencyMatrix. From this table, we can extract the necessary joint and marginal pmfs for computing any discrete function of multivariate discrete probability distributions. This is essentially the multivariate analogue of Probabilities.

But why would I use a ContingencyMatrix instead of some other indirect estimation method, you may ask. The answer is that ContingencyMatrix allows you to compute any of the information theoretic quantities offered in this package for any type of input data. You input data can literally be any hashable type, for example String, Tuple{Int, String, Int}, or YourCustomHashableDataType.

In the case of numeric data, using a ContingencyMatrix is typically a bit slower than other dedicated estimation procedures. For example, quantities like discrete Shannon-type condmutualinfo are faster to estimate using a formulation based on sums of four entropies (the H4-principle). This is faster because we can both utilize the blazingly fast StateSpaceSet structure directly, and we can avoid explicitly estimating the entire joint pmf, which demands many extra calculation steps. Whatever you use in practice depends on your use case and available estimation methods, but you can always fall back to contingency matrices for any discrete measure.

CausalityTools.ContingencyMatrixType
ContingencyMatrix{T, N} <: Probabilities{T, N}
ContingencyMatrix(frequencies::AbstractArray{Int, N})

A contingency matrix is essentially a multivariate analogue of Probabilities that also keep track of raw frequencies.

The contingency matrix can be constructed directyly from an N-dimensional frequencies array. Alternatively, the contingency_matrix function performs counting for you; this works on both raw categorical data, or by first discretizing data using a a ProbabilitiesEstimator.

Description

A ContingencyMatrix c is just a simple wrapper around around AbstractArray{T, N}. Indexing c with multiple indices i, j, … returns the (i, j, …)th element of the empirical probability mass function (pmf). The following convencience methods are defined:

  • frequencies(c; dims) returns the multivariate raw counts along the given `dims (default to all available dimensions).
  • probabilities(c; dims) returns a multidimensional empirical probability mass function (pmf) along the given dims (defaults to all available dimensions), i.e. the normalized counts.
  • probabilities(c, i::Int) returns the marginal probabilities for the i-th dimension.
  • outcomes(c, i::Int) returns the marginal outcomes for the i-th dimension.

Ordering

The ordering of outcomes are internally consistent, but we make no promise on the ordering of outcomes relative to the input data. This means that if your input data are x = rand(["yes", "no"], 100); y = rand(["small", "medium", "large"], 100), you'll get a 2-by-3 contingency matrix, but there currently no easy way to determine which outcome the i-j-th row/column of this matrix corresponds to.

Since ContingencyMatrix is intended for use in information theoretic methods that don't care about ordering, as long as the ordering is internally consistent, this is not an issue for practical applications in this package. This may change in future releases.

Usage

Contingency matrices is used in the computation of discrete versions of the following quantities:

source
CausalityTools.contingency_matrixFunction
contingency_matrix(x, y, [z, ...]) → c::ContingencyMatrix
contingency_matrix(est::ProbabilitiesEstimator, x, y, [z, ...]) → c::ContingencyMatrix

Estimate a multidimensional contingency matrix c from input data x, y, …, where the input data can be of any and different types, as long as length(x) == length(y) == ….

For already discretized data, use the first method. For continuous data, you want to discretize the data before computing the contingency table. You can do this manually and then use the first method. Alternatively, you can provide a ProbabilitiesEstimator as the first argument to the constructor. Then the input variables x, y, … are discretized separately according to est (enforcing the same outcome space for all variables), by calling marginal_encodings.

source

Utilities

CausalityTools.marginal_encodingsFunction
marginal_encodings(est::ProbabilitiesEstimator, x::VectorOrStateSpaceSet...)

Encode/discretize each input vector xᵢ ∈ x according to a procedure determined by est. Any xᵢ ∈ X that are multidimensional (StateSpaceSets) will be encoded column-wise, i.e. each column of xᵢ is treated as a timeseries and is encoded separately.

This is useful for computing any discrete information theoretic quantity, and is used internally by contingency_matrix.

Supported estimators

Many more implementations are possible. Each new implementation gives one new way of estimating the ContingencyMatrix

source