Mutual information

MIShannon

Estimation using MutualInformationEstimators

When estimated using a MutualInformationEstimator, some form of bias correction is usually applied. The KraskovStögbauerGrassberger1 and KraskovStögbauerGrassberger2 estimators are perhaps the most popular. A common parametric estimator is GaussianMI.

MIShannon with GaussianMI

using CausalityTools
using Distributions
using Statistics

n = 1000
using CausalityTools
x = randn(1000)
y = rand(1000) .+ x
mutualinfo(KSG1(k = 5), x, y)
mutualinfo(GaussianMI(), x, y) # defaults to `MIShannon()`
1.8811016587266156

MIShannon with KraskovStögbauerGrassberger1

using CausalityTools
x, y = rand(1000), rand(1000)
mutualinfo(KSG1(k = 5), x, y)
-0.040659864023560074

MIShannon with KraskovStögbauerGrassberger2

using CausalityTools
x, y = rand(1000), rand(1000)
mutualinfo(KSG2(k = 5), x, y)
0.005134743658384601

MIShannon with GaoKannanOhViswanath

using CausalityTools
x, y = rand(1000), rand(1000)
mutualinfo(GaoKannanOhViswanath(k = 10), x, y)
0.010193665973455657

MIShannon with GaoOhViswanath

using CausalityTools
x, y = rand(1000), rand(1000)
mutualinfo(GaoOhViswanath(k = 10), x, y)
0.023625206674988096

Reproducing Kraskov et al. (2004)

Here, we'll reproduce Figure 4 from Kraskov et al. (2004)'s seminal paper on the nearest-neighbor based mutual information estimator. We'll estimate the mutual information between marginals of a bivariate Gaussian for a fixed time series length of 2000, varying the number of neighbors. Note: in the original paper, they show multiple curves corresponding to different time series length. We only show two single curves: one for the KSG1 estimator and one for the KSG2 estimator.

using CausalityTools
using LinearAlgebra: det
using Distributions: MvNormal
using StateSpaceSets: StateSpaceSet
using CairoMakie
using Statistics

N = 2000
c = 0.9
Σ = [1 c; c 1]
N2 = MvNormal([0, 0], Σ)
mitrue = -0.5*log(det(Σ)) # in nats
ks = [2; 5; 7; 10:10:70] .* 2

nreps = 30
mis_ksg1 = zeros(nreps, length(ks))
mis_ksg2 = zeros(nreps, length(ks))
for i = 1:nreps
    D2 = StateSpaceSet([rand(N2) for i = 1:N])
    X = D2[:, 1] |> StateSpaceSet
    Y = D2[:, 2] |> StateSpaceSet
    measure = MIShannon(; base = ℯ)
    mis_ksg1[i, :] = map(k -> mutualinfo(measure, KSG1(; k), X, Y), ks)
    mis_ksg2[i, :] = map(k -> mutualinfo(measure, KSG2(; k), X, Y), ks)
end
fig = Figure()
ax = Axis(fig[1, 1], xlabel = "k / N", ylabel = "Mutual infomation (nats)")
scatterlines!(ax, ks ./ N, mean(mis_ksg1, dims = 1) |> vec, label = "KSG1")
scatterlines!(ax, ks ./ N, mean(mis_ksg2, dims = 1) |> vec, label = "KSG2")
hlines!(ax, [mitrue], color = :black, linewidth = 3, label = "I (true)")
axislegend()
fig

MutualInformationEstimator comparison

Most estimators suffer from significant bias when applied to discrete, finite data. One possible resolution is to add a small amount of noise to discrete variables, so that the data becomes continuous in practice.

Instead of adding noise to your data, you can consider using an estimator that is specifically designed to deal with continuous-discrete mixture data. One example is the GaoKannanOhViswanath estimator.

Here, we compare its performance to KSG1 on uniformly distributed discrete multivariate data. The true mutual information is zero.

using CausalityTools
using Statistics
using StateSpaceSets: StateSpaceSet
using Statistics: mean
using CairoMakie

function compare_ksg_gkov(;
        k = 5,
        base = 2,
        nreps = 15,
        Ls = [500:100:1000; 1500; 2000; 3000; 4000; 5000; 1000])

    est_gkov = GaoKannanOhViswanath(; k)
    est_ksg1 = KSG1(; k)

    mis_ksg1_mix = zeros(nreps, length(Ls))
    mis_ksg1_discrete = zeros(nreps, length(Ls))
    mis_ksg1_cont = zeros(nreps, length(Ls))
    mis_gkov_mix = zeros(nreps, length(Ls))
    mis_gkov_discrete = zeros(nreps, length(Ls))
    mis_gkov_cont = zeros(nreps, length(Ls))

    for (j, L) in enumerate(Ls)
        for i = 1:nreps
            X = StateSpaceSet(float.(rand(1:8, L, 2)))
            Y = StateSpaceSet(float.(rand(1:8, L, 2)))
            Z = StateSpaceSet(rand(L, 2))
            W = StateSpaceSet(rand(L, 2))
            measure = MIShannon(; base = ℯ)
            mis_ksg1_discrete[i, j] = mutualinfo(measure, est_ksg1, X, Y)
            mis_gkov_discrete[i, j] = mutualinfo(measure, est_gkov, X, Y)
            mis_ksg1_mix[i, j] = mutualinfo(measure, est_ksg1, X, Z)
            mis_gkov_mix[i, j] = mutualinfo(measure, est_gkov, X, Z)
            mis_ksg1_cont[i, j] = mutualinfo(measure, est_ksg1, Z, W)
            mis_gkov_cont[i, j] = mutualinfo(measure, est_gkov, Z, W)
        end
    end
    return mis_ksg1_mix, mis_ksg1_discrete, mis_ksg1_cont,
        mis_gkov_mix, mis_gkov_discrete, mis_gkov_cont
end

fig = Figure()
ax = Axis(fig[1, 1],
    xlabel = "Sample size",
    ylabel = "Mutual information (bits)")
Ls = [100; 200; 500; 1000; 2500; 5000; 10000]
nreps = 5
k = 3
mis_ksg1_mix, mis_ksg1_discrete, mis_ksg1_cont,
    mis_gkov_mix, mis_gkov_discrete, mis_gkov_cont =
    compare_ksg_gkov(; nreps, k, Ls)

scatterlines!(ax, Ls, mean(mis_ksg1_mix, dims = 1) |> vec,
    label = "KSG1 (mixed)", color = :black,
    marker = :utriangle)
scatterlines!(ax, Ls, mean(mis_ksg1_discrete, dims = 1) |> vec,
    label = "KSG1 (discrete)", color = :black,
    linestyle = :dash, marker = '▲')
scatterlines!(ax, Ls, mean(mis_ksg1_cont, dims = 1) |> vec,
    label = "KSG1 (continuous)", color = :black,
    linestyle = :dot, marker = '●')
scatterlines!(ax, Ls, mean(mis_gkov_mix, dims = 1) |> vec,
    label = "GaoKannanOhViswanath (mixed)", color = :red,
    marker = :utriangle)
scatterlines!(ax, Ls, mean(mis_gkov_discrete, dims = 1) |> vec,
    label = "GaoKannanOhViswanath (discrete)", color = :red,
    linestyle = :dash, marker = '▲')
scatterlines!(ax, Ls, mean(mis_gkov_cont, dims = 1) |> vec,
    label = "GaoKannanOhViswanath (continuous)", color = :red,
    linestyle = :dot, marker = '●')
axislegend(position = :rb)
fig

Estimation using DifferentialEntropyEstimators

Simple example

We can compute MIShannon by naively applying a DifferentialEntropyEstimator. Note that this doesn't apply any bias correction.

using CausalityTools
x, y = rand(1000), rand(1000)
mutualinfo(Kraskov(k = 3), x, y)
-0.05494698222381378

DifferentialEntropyEstimator comparison

Let's compare the performance of a subset of the implemented mutual information estimators. We'll use example data from Lord et al., where the analytical mutual information is known.

using CausalityTools
using LinearAlgebra: det
using StateSpaceSets: StateSpaceSet
using Distributions: MvNormal
using LaTeXStrings
using CairoMakie

# adapted from https://juliadatascience.io/makie_colors
function new_cycle_theme()
    # https://nanx.me/ggsci/reference/pal_locuszoom.html
    my_colors = ["#D43F3AFF", "#EEA236FF", "#5CB85CFF", "#46B8DAFF",
        "#357EBDFF", "#9632B8FF", "#B8B8B8FF"]
    cycle = Cycle([:color, :linestyle, :marker], covary=true) # alltogether
    my_markers = [:circle, :rect, :utriangle, :dtriangle, :diamond,
        :pentagon, :cross, :xcross]
    my_linestyle = [nothing, :dash, :dot, :dashdot, :dashdotdot]
    return Theme(
        fontsize = 22, font="CMU Serif",
        colormap = :linear_bmy_10_95_c78_n256,
        palette = (
            color = my_colors,
            marker = my_markers,
            linestyle = my_linestyle,
        ),
        Axis = (
            backgroundcolor= (:white, 0.2),
            xgridstyle = :dash,
            ygridstyle = :dash
        ),
        Lines = (
            cycle= cycle,
        ),
        ScatterLines = (
            cycle = cycle,
        ),
        Scatter = (
            cycle = cycle,
        ),
        Legend = (
            bgcolor = (:grey, 0.05),
            framecolor = (:white, 0.2),
            labelsize = 13,
        )
    )
end

run(est; f::Function, # function that generates data
        base::Real = ℯ,
        nreps::Int = 10,
        αs = [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1],
        n::Int = 1000) =
    map(α -> mutualinfo(MIShannon(; base), est, f(α, n)...), αs)

function compute_results(f::Function; estimators, k = 5, k_lord = 20,
        n = 1000, base = ℯ, nreps = 10,
        as = 7:-1:0,
        αs = [1/10^(a) for a in as])

    is = [zeros(length(αs)) for est in estimators]
    for (k, est) in enumerate(estimators)
        tmp = zeros(length(αs))
        for i = 1:nreps
            tmp .+= run(est; f = f, αs, base, n)
        end
        is[k] .= tmp ./ nreps
    end

    return is
end

function plot_results(f::Function, ftrue::Function;
        base, estimators, k_lord, k,
        as = 7:-1:0, αs = [1/10^(a) for a in as], kwargs...
    )
    is = compute_results(f;
        base, estimators, k_lord, k, as, αs, kwargs...)
    itrue = [ftrue(α; base) for α in αs]

    xmin, xmax = minimum(αs), maximum(αs)

    ymin = floor(Int, min(minimum(itrue), minimum(Iterators.flatten(is))))
    ymax = ceil(Int, max(maximum(itrue), maximum(Iterators.flatten(is))))
    f = Figure()
    ax = Axis(f[1, 1],
        xlabel = "α", ylabel = "I (nats)",
        xscale = log10, aspect = 1,
        xticks = (αs, [latexstring("10^{$(-a)}") for a in as]),
        yticks = (ymin:ymax)
        )
    xlims!(ax, (1/10^first(as), 1/10^last(as)))
    ylims!(ax, (ymin, ymax))
    lines!(ax, αs, itrue,
        label = "I (true)", linewidth = 4, color = :black)
    for (i, est) in enumerate(estimators)
        es = string(typeof(est).name.name)
        lbl = occursin("Lord", es) ? "$es (k = $k_lord)" : "$es (k = $k)"
        scatter!(ax, αs, is[i], label = lbl)
        lines!(ax, αs, is[i])

    end
    axislegend()
    return f
end

set_theme!(new_cycle_theme())
k_lord = 20
k = 5
base = ℯ

estimators = [
    Kraskov(; k),
    KozachenkoLeonenko(),
    Zhu(; k),
    ZhuSingh(; k),
    Gao(; k),
    Lord(; k = k_lord),
    KSG1(; k),
    KSG2(; k),
    GaoOhViswanath(; k),
    GaoKannanOhViswanath(; k),
    GaussianMI(),
]
11-element Vector{Any}:
 Kraskov{Int64}(5, 1, 2)
 KozachenkoLeonenko{Int64}(0, 2)
 Zhu{Int64}(5, 0, 2)
 ZhuSingh{Int64}(5, 0, 2)
 Gao{Int64}(5, 0, 2, true)
 Lord{Int64}(20, 0, 2)
 KraskovStögbauerGrassberger1{Chebyshev, Chebyshev}(5, 0, Chebyshev(), Chebyshev())
 KraskovStögbauerGrassberger2{Chebyshev, Chebyshev}(5, 0, Chebyshev(), Chebyshev())
 GaoOhViswanath{Euclidean, Euclidean}(5, 0, Euclidean(0.0), Euclidean(0.0))
 GaoKannanOhViswanath(5, 0)
 GaussianMI(false)

Family 1

In this system, samples are concentrated around the diagonal $X = Y$, and the strip of samples gets thinner as $\alpha \to 0$.

function family1(α, n::Int)
    x = rand(n)
    v = rand(n)
    y = x + α * v
    return StateSpaceSet(x), StateSpaceSet(y)
end

# True mutual information values for these data
function ifamily1(α; base = ℯ)
    mi = -log(α) - α - log(2)
    return mi / log(base, ℯ)
end

fig = plot_results(family1, ifamily1;
    k_lord = k_lord, k = k, nreps = 10,
    estimators = estimators,
    base = base)

Family 2

function family2(α, n::Int)
    Σ = [1 α; α 1]
    N2 = MvNormal(zeros(2), Σ)
    D2 = StateSpaceSet([rand(N2) for i = 1:n])
    X = StateSpaceSet(D2[:, 1])
    Y = StateSpaceSet(D2[:, 2])
    return X, Y
end

function ifamily2(α; base = ℯ)
    return (-0.5 * log(1 - α^2)) / log(ℯ, base)
end

αs = 0.05:0.05:0.95
estimators = estimators
with_theme(new_cycle_theme()) do
    f = Figure();
    ax = Axis(f[1, 1], xlabel = "α", ylabel = "I (nats)")
    is_true = map(α -> ifamily2(α), αs)
    is_est = map(est -> run(est; f = family2, αs, nreps = 20), estimators)
    lines!(ax, αs, is_true,
        label = "I (true)", color = :black, linewidth = 3)
    for (i, est) in enumerate(estimators)
        estname = typeof(est).name.name |> String
        scatterlines!(ax, αs, is_est[i], label = estname)
    end
    axislegend(position = :lt)
    return f
end

Family 3

In this system, we draw samples from a 4D Gaussian distribution distributed as specified in the ifamily3 function below. We let $X$ be the two first variables, and $Y$ be the two last variables.

function ifamily3(α; base = ℯ)
    Σ = [7 -5 -1 -3; -5 5 -1 3; -1 -1 3 -1; -3 3 -1 2+α]
    Σx = Σ[1:2, 1:2]; Σy = Σ[3:4, 3:4]
    mi = 0.5*log(det(Σx) * det(Σy) / det(Σ))
    return mi / log(ℯ, base)
end

function family3(α, n::Int)
    Σ = [7 -5 -1 -3; -5 5 -1 3; -1 -1 3 -1; -3 3 -1 2+α]
    N4 = MvNormal(zeros(4), Σ)
    D4 = StateSpaceSet([rand(N4) for i = 1:n])
    X = D4[:, 1:2]
    Y = D4[:, 3:4]
    return X, Y
end

fig = plot_results(family3, ifamily3;
    k_lord = k_lord, k = k, nreps = 10,
    n = 2000,
    estimators = estimators, base = base)

We see that the Lord estimator, which estimates local volume elements using a singular-value decomposition (SVD) of local neighborhoods, outperforms the other estimators by a large margin.

Estimation using ProbabilitiesEstimators

We can also use ProbabilitiesEstimator to estimate Shannon mutual information. This does not apply any bias correction.

Discrete MIShannon with ValueHistogram

A ValueHistogram estimator can be used to bin the data and compute discrete Shannon mutual information.

using CausalityTools
using Random; rng = MersenneTwister(1234)
x = rand(rng, 1000)
y = rand(rng, 1000)

# Use the H3-estimation method with a discrete visitation frequency based
# probabilities estimator over a fixed grid covering the range of the data,
# which is on [0, 1].
est = ValueHistogram(FixedRectangularBinning(0, 1, 5))
mutualinfo(est, x, y)
0.0026305563117743347

Discrete MIShannon with Contingency (numerical)

The above example is in fact equivalent to Contingency. However, using the Contingency estimator is more flexible, because it can also be used on categorical data.

using CausalityTools
using Random; rng = MersenneTwister(1234)
x = rand(rng, 1000)
y = rand(rng, 1000)
est = ValueHistogram(FixedRectangularBinning(0, 1, 5))
mutualinfo(Contingency(est), x, y)
0.0026305563117749483

Discrete MIShannon with ContingencyMatrix (manual)

If you need explicit access to the estimated joint probability mass function, use a ContingencyMatrix directly.

using CausalityTools
using Random; rng = MersenneTwister(1234)
x = rand(rng, 1000)
y = rand(rng, 1000)
c = contingency_matrix(est, x, y)
est = ValueHistogram(FixedRectangularBinning(0, 1, 5))
mutualinfo(c)
0.0026305563117749483

Discrete MIShannon with Contingency (categorical)

The ContingencyMatrix approach can also be used with categorical data. For example, let's compare the Shannon mutual information between the preferences of a population sample with regards to different foods.

using CausalityTools
n = 1000
preferences = rand(["neutral", "like it", "hate it"], n);
random_foods = rand(["water", "flour", "bananas", "booze", "potatoes", "beans", "soup"], n)
biased_foods = map(preferences) do preference
    if cmp(preference, "neutral") == 1
        return rand(["water", "flour"])
    elseif cmp(preference, "like it") == 1
        return rand(["bananas", "booze"])
    else
        return rand(["potatoes", "beans", "soup"])
    end
end

c_biased = contingency_matrix(preferences, biased_foods)
c_random = contingency_matrix(preferences, random_foods)
mutualinfo(c_biased), mutualinfo(c_random)
(0.9257816227191029, 0.011047884009353137)

Longer example: AR1-system and unidirectionally coupled logistic maps

In this example we generate realizations of two different systems where we know the strength of coupling between the variables. Our aim is to compute Shannon mutual information $I^S(X; Y)$ (MIShannon) between time series of each variable and assess how the magnitude of $I^S(X; Y)$ changes as we change the strength of coupling between $X$ and $Y$. We'll use two systems that ship with CausalityTools.jl:

  • A stochastic system consisting of two unidirectionally coupled first-order autoregressive processes (ar1_unidir)
  • A deterministic, chaotic system consisting of two unidirectionally coupled logistic maps (logistic2_unidir)

We use the default input parameter values (see AR1Unidir and Logistic2Unidir for details) and below we toggle only the random initial conditions and the coupling strength parameter c_xy. For each value of c_xy we generate 1,000 unique realizations of the system and obtain 500-point time series of the coupled variables.

To estimate the mutual information, we use the binning-based ValueHistogram estimator. We summarize the distribution of $I(X; Y)$ values across all realizations using the median and quantiles encompassing 95 % of the values.

using CausalityTools
using Statistics
using CairoMakie

# Span a range of x-y coupling strengths
c = 0.0:0.1:1.0

# Number of observations in each time series
npts = 500

# Number of unique realizations of each system
n_realizations = 1000

# Get MI for multiple realizations of two systems,
# saving three quantiles for each c value
mi = zeros(length(c), 3, 2)

# Define an estimator for MI
b = RectangularBinning(4)
estimator = ValueHistogram(b)

for i in 1 : length(c)

    tmp = zeros(n_realizations, 2)

    for k in 1 : n_realizations

        # Obtain time series realizations of the two 2D systems
        # for a given coupling strength and random initial conditions
        s_logistic = system(Logistic2Unidir(; xi = rand(2), c_xy = c[i]))
        s_ar = system(AR1Unidir(xi = rand(2), c_xy = c[i]))
        lmap = first(trajectory(s_logistic, npts - 1, Ttr = 500))
        ar1 = first(trajectory(s_ar, npts - 1))

        # Compute the MI between the two coupled components of each system
        tmp[k, 1] = mutualinfo(MIShannon(), estimator, lmap[:, 1], lmap[:, 2])
        tmp[k, 2] = mutualinfo(MIShannon(), estimator, ar1[:, 1], ar1[:, 2])
    end

    # Compute lower, middle, and upper quantiles of MI for each coupling strength
    mi[i, :, 1] = quantile(tmp[:, 1], [0.025, 0.5, 0.975])
    mi[i, :, 2] = quantile(tmp[:, 2], [0.025, 0.5, 0.975])
end

# Plot distribution of MI values as a function of coupling strength for both systems
fig = with_theme(theme_minimal()) do
    fig = Figure()
    ax = Axis(fig[1, 1], xlabel = "Coupling strength", ylabel = "Mutual information")
    band!(ax, c, mi[:, 1, 1], mi[:, 3, 1], color = (:black, 0.3))
    lines!(ax, c, mi[:, 2, 1], label = "2D chaotic logistic maps", color = :black)
    band!(ax, c, mi[:, 1, 2], mi[:, 3, 2], color = (:red, 0.3))
    lines!(ax, c, mi[:, 2, 2],  label = "2D order-1 autoregressive", color = :red)
    return fig
end
fig

As expected, $I(X; Y)$ increases with coupling strength in a system-specific manner.