Examples of association measure estimation

HellingerDistance

From precomputed probabilities

using Associations
# From pre-computed PMFs
p1 = Probabilities([0.1, 0.5, 0.2, 0.2])
p2 = Probabilities([0.3, 0.3, 0.2, 0.2])
association(HellingerDistance(), p1, p2)
0.055856605640373105

JointProbabilities + OrdinalPatterns

We expect the Hellinger distance between two uncorrelated variables to be close to zero.

using Associations
using Random; rng = Xoshiro(1234)
n = 100000
x, y = rand(rng, n), rand(rng, n)
est = JointProbabilities(HellingerDistance(), CodifyVariables(OrdinalPatterns(m=3)))
div_hd = association(est, x, y) # pretty close to zero
3.1541797712493756e-6

KLDivergence

From precomputed probabilities

using Associations
# From pre-computed PMFs
p1 = Probabilities([0.1, 0.5, 0.2, 0.2])
p2 = Probabilities([0.3, 0.3, 0.2, 0.2])
association(KLDivergence(), p1, p2)
0.20998654701098748

JointProbabilities + OrdinalPatterns

We expect the KLDivergence between two uncorrelated variables to be close to zero.

using Associations
using Random; rng = Xoshiro(1234)
n = 100000
x, y = rand(rng, n), rand(rng, n)
est = JointProbabilities(KLDivergence(), CodifyVariables(OrdinalPatterns(m=3)))
div_hd = association(est, x, y) # pretty close to zero
1.2875933381844337e-5

RenyiDivergence

From precomputed probabilities

using Associations
# From pre-computed PMFs
p1 = Probabilities([0.1, 0.5, 0.2, 0.2])
p2 = Probabilities([0.3, 0.3, 0.2, 0.2])
association(RenyiDivergence(), p1, p2)
0.11627470204599202

JointProbabilities + OrdinalPatterns

We expect the RenyiDivergence between two uncorrelated variables to be close to zero.

using Associations
using Random; rng = Xoshiro(1234)
n = 100000
x, y = rand(rng, n), rand(rng, n)
est = JointProbabilities(RenyiDivergence(), CodifyVariables(OrdinalPatterns(m=3)))
div_hd = association(est, x, y) # pretty close to zero
6.4354135893600935e-6

VariationDistance

From precomputed probabilities

using Associations
# From pre-computed PMFs
p1 = Probabilities([0.1, 0.5, 0.2, 0.2])
p2 = Probabilities([0.3, 0.3, 0.2, 0.2])
association(VariationDistance(), p1, p2)
0.2

JointProbabilities + OrdinalPatterns

We expect the VariationDistance between two uncorrelated variables to be close to zero.

using Associations
using Random; rng = Xoshiro(1234)
n = 100000
x, y = rand(rng, n), rand(rng, n)
est = JointProbabilities(VariationDistance(), CodifyVariables(OrdinalPatterns(m=3)))
div_hd = association(est, x, y) # pretty close to zero
0.001530030600612034

JointEntropyShannon

JointProbabilities with Dispersion

using Associations
using Random; rng = Xoshiro(1234)
x, y = rand(rng, 100), rand(rng, 100)
measure = JointEntropyShannon()
discretization = CodifyVariables(Dispersion(m = 2, c = 3))
est = JointProbabilities(measure, discretization)
association(est, x, y)
3.115303256966459

JointEntropyTsallis

JointProbabilities with OrdinalPatterns

using Associations
using Random; rng = Xoshiro(1234)
x, y = rand(rng, 100), rand(rng, 100)
measure = JointEntropyTsallis()
discretization = CodifyVariables(OrdinalPatterns(m = 3))
est = JointProbabilities(measure, discretization)
association(est, x, y)
2.3340683377759546

JointEntropyRenyi

JointProbabilities with OrdinalPatterns

using Associations
using Random; rng = Xoshiro(1234)
x, y = rand(rng, 100), rand(rng, 100)
measure = JointEntropyRenyi(q = 0.5)
discretization = CodifyVariables(ValueBinning(2))
est = JointProbabilities(measure, discretization)
association(est, x, y)
1.9472483134811365

ConditionalEntropyShannon

Analytical examples

This is essentially example 2.2.1 in Cover & Thomas (2006), where they use the following relative frequency table as an example. Notethat Julia is column-major, so we need to transpose their example. Then their X is in the first dimension of our table (along columns) and their Y is our second dimension (rows).

using Associations
freqs_yx = [1//8 1//16 1//32 1//32;
    1//16 1//8  1//32 1//32;
    1//16 1//16 1//16 1//16;
    1//4  0//1  0//1  0//1];
# `freqs_yx` is already normalized, se we can feed it directly to `Probabilities`
pxy = Probabilities(freqs_yx)
 4×4 Probabilities{Rational{Int64},2}
                Outcome(1)     Outcome(2)     Outcome(3)     Outcome(4)
 Outcome(1)             1//8           1//16          1//32          1//32
 Outcome(2)             1//16          1//8           1//32          1//32
 Outcome(3)             1//16          1//16          1//16          1//16
 Outcome(4)             1//4           0//1           0//1           0//1

The marginal distribution for x (first dimension) is

marginal(pxy, dims = 2)
 Probabilities{Rational{Int64},1} over 4 outcomes
 Outcome(1)  1//2
 Outcome(2)  1//4
 Outcome(3)  1//8
 Outcome(4)  1//8

The marginal distribution for y (second dimension) is

marginal(pxy, dims = 1)
 Probabilities{Rational{Int64},1} over 4 outcomes
 Outcome(1)  1//4
 Outcome(2)  1//4
 Outcome(3)  1//4
 Outcome(4)  1//4

And the Shannon conditional entropy $H^S(X | Y)$

ce_x_given_y = association(ConditionalEntropyShannon(), pxy) |> Rational
13//8

This is the same as in their example. Hooray! To compute $H^S(Y | X)$, we just need to flip the contingency matrix.

pyx = Probabilities(transpose(freqs_yx))
ce_y_given_x = association(ConditionalEntropyShannon(), pyx) |> Rational
11//8

JointProbabilities + CodifyVariables + UniqueElements

We can of course also estimate conditional entropy from data. To do so, we'll use the JointProbabilities estimator, which constructs a multivariate PMF for us. Thus, we don't explicitly need a set of counts, like in the example above, because they are estimated under the hood for us.

Let's first demonstrate on some categorical data. For that, we must use UniqueElements as the discretization (i.e. just count unique elements).

using Associations
using Random; rng = Xoshiro(1234)
n = 1000
rating = rand(rng, 1:6, n)
movie = rand(rng, ["The Witcher: the movie", "Lord of the Rings"], n)

disc = CodifyVariables(UniqueElements())
est = JointProbabilities(ConditionalEntropyShannon(), disc)
association(est, rating, movie)
2.5744931842297527

JointProbabilities + CodifyPoints + UniqueElementsEncoding

using Associations
using Random; rng = Xoshiro(1234)
x, y, z = rand(rng, 1:5, 100), rand(rng, 1:5, 100), rand(rng, 1:3, 100)
X = StateSpaceSet(x, z)
Y = StateSpaceSet(y, z)
disc = CodifyPoints(UniqueElementsEncoding(X), UniqueElementsEncoding(Y));
est = JointProbabilities(ConditionalEntropyShannon(), disc);
association(est, X, Y)
1.8234663200044612

ConditionalEntropyTsallisAbe

JointProbabilities + CodifyVariables + UniqueElements

We'll here repeat the analysis we did for ConditionalEntropyShannon above.

using Associations
using Random; rng = Xoshiro(1234)
n = 1000
rating = rand(rng, 1:6, n)
movie = rand(rng, ["The Witcher: the movie", "Lord of the Rings"], n)

disc = CodifyVariables(UniqueElements())
est = JointProbabilities(ConditionalEntropyTsallisAbe(q =1.5), disc)
association(est, rating, movie)
1.7608975191987752

JointProbabilities + CodifyPoints + UniqueElementsEncoding

using Associations
using Random; rng = Xoshiro(1234)
x, y, z = rand(rng, 1:5, 100), rand(rng, 1:5, 100), rand(rng, 1:3, 100)
X = StateSpaceSet(x, z)
Y = StateSpaceSet(y, z)
disc = CodifyPoints(UniqueElementsEncoding(X), UniqueElementsEncoding(Y));
est = JointProbabilities(ConditionalEntropyTsallisAbe(q = 1.5), disc);
association(est, X, Y)
1.8293074543081442

ConditionalEntropyTsallisFuruichi

JointProbabilities + CodifyVariables + UniqueElements

We'll here repeat the analysis we did for ConditionalEntropyShannon and ConditionalEntropyTsallisAbe above.

using Associations
using Random; rng = Xoshiro(1234)
n = 1000
rating = rand(rng, 1:6, n)
movie = rand(rng, ["The Witcher: the movie", "Lord of the Rings"], n)

disc = CodifyVariables(UniqueElements())
est = JointProbabilities(ConditionalEntropyTsallisFuruichi(q =0.5), disc)
association(est, rating, movie)
4.087134896624493

JointProbabilities + CodifyPoints + UniqueElementsEncoding

using Associations
using Random; rng = Xoshiro(1234)
x, y, z = rand(rng, 1:5, 100), rand(rng, 1:5, 100), rand(rng, 1:3, 100)
X = StateSpaceSet(x, z)
Y = StateSpaceSet(y, z)
disc = CodifyPoints(UniqueElementsEncoding(X), UniqueElementsEncoding(Y));
est = JointProbabilities(ConditionalEntropyTsallisFuruichi(q = 0.5), disc);
association(est, X, Y)
6.6943366123303045

MIShannon

JointProbabilities + ValueBinning

using Associations
using Random; rng = MersenneTwister(1234)
x = rand(rng, 1000)
y = rand(rng, 1000)
discretization = CodifyVariables(ValueBinning(FixedRectangularBinning(0, 1, 5)))
est = JointProbabilities(MIShannon(), discretization)
association(est, x, y)
0.003096460803260294

JointProbabilities + UniqueElements

The JointProbabilities estimator can also be used with categorical data. For example, let's compare the Shannon mutual information between the preferences of a population sample with regards to different foods.

using Associations
n = 1000
preferences = rand(["neutral", "like it", "hate it"], n);
random_foods = rand(["water", "flour", "bananas", "booze", "potatoes", "beans", "soup"], n)
biased_foods = map(preferences) do preference
    if cmp(preference, "neutral") == 1
        return rand(["water", "flour"])
    elseif cmp(preference, "like it") == 1
        return rand(["bananas", "booze"])
    else
        return rand(["potatoes", "beans", "soup"])
    end
end

est = JointProbabilities(MIShannon(), UniqueElements())
association(est, preferences, biased_foods), association(est, preferences, random_foods)
(0.9367194187941841, 0.008987336108764796)

Dedicated GaussianMI estimator

using Associations
using Distributions
using Statistics

n = 1000
using Associations
x = randn(1000)
y = rand(1000) .+ x
association(GaussianMI(MIShannon()), x, y) # defaults to `MIShannon()`
1.8417632488404279

Dedicated KraskovStögbauerGrassberger1 estimator

using Associations
x, y = rand(1000), rand(1000)
association(KSG1(MIShannon(); k = 5), x, y)
-0.04188252787571423

Dedicated KraskovStögbauerGrassberger2 estimator

using Associations
x, y = rand(1000), rand(1000)
association(KSG2(MIShannon(); k = 5), x, y)
-0.011732913223829284

Dedicated GaoKannanOhViswanath estimator

using Associations
x, y = rand(1000), rand(1000)
association(GaoKannanOhViswanath(MIShannon(); k = 10), x, y)
0.0010651304255182388

EntropyDecomposition + Kraskov

We can compute MIShannon by naively applying a DifferentialInfoEstimator. Note that this doesn't apply any bias correction.

using Associations
x, y = rand(1000), rand(1000)
association(EntropyDecomposition(MIShannon(), Kraskov(k = 3)), x, y)
-0.0065468213340297565

EntropyDecomposition + BubbleSortSwaps

We can also compute MIShannon by naively applying a DiscreteInfoEstimator. Note that this doesn't apply any bias correction.

using Associations
x, y = rand(1000), rand(1000)
disc = CodifyVariables(BubbleSortSwaps(m=5))
hest = PlugIn(Shannon())
association(EntropyDecomposition(MIShannon(), hest, disc), x, y)
0.061449403961530535

EntropyDecomposition + Jackknife + ValueBinning

Shannon mutual information can be written as a sum of marginal entropy terms. Here, we use CodifyVariables with ValueBinning bin the data and compute discrete Shannon mutual information.

using Associations
using Random; rng = MersenneTwister(1234)
x = rand(rng, 50)
y = rand(rng, 50)

# Use the H3-estimation method with a discrete visitation frequency based
# probabilities estimator over a fixed grid covering the range of the data,
# which is on [0, 1].
discretization = CodifyVariables(ValueBinning(FixedRectangularBinning(0, 1, 5)))
hest = Jackknife(Shannon())
est = EntropyDecomposition(MIShannon(), hest, discretization)
association(est, x, y)
-0.14432478545988658

Reproducing Kraskov et al. (2004)

Here, we'll reproduce Figure 4 from Kraskov et al. (2004)'s seminal paper on the nearest-neighbor based mutual information estimator. We'll estimate the mutual information between marginals of a bivariate Gaussian for a fixed time series length of 1000, varying the number of neighbors. Note: in the original paper, they show multiple curves corresponding to different time series length. We only show two single curves: one for the KraskovStögbauerGrassberger1 estimator and one for the KraskovStögbauerGrassberger2 estimator.

using Associations
using LinearAlgebra: det
using Distributions: MvNormal
using StateSpaceSets: StateSpaceSet
using CairoMakie
using Statistics

N = 800
c = 0.9
Σ = [1 c; c 1]
N2 = MvNormal([0, 0], Σ)
mitrue = -0.5*log(det(Σ)) # in nats
ks = [2; 5; 7; 10:10:70] .* 2

nreps = 10 # plot average over 10 independent realizations
mis_ksg1 = zeros(nreps, length(ks))
mis_ksg2 = zeros(nreps, length(ks))
for i = 1:nreps
    D2 = StateSpaceSet([rand(N2) for i = 1:N])
    X = D2[:, 1] |> StateSpaceSet
    Y = D2[:, 2] |> StateSpaceSet
    for (j, k) in enumerate(ks)
        est1 = KSG1(MIShannon(; base = ℯ); k)
        est2 = KSG2(MIShannon(; base = ℯ); k)
        mis_ksg1[i, j] = association(est1, X, Y)
        mis_ksg2[i, j] = association(est2, X, Y)
    end
end
fig = Figure()
ax = Axis(fig[1, 1], xlabel = "k / N", ylabel = "Mutual infomation (nats)")
scatterlines!(ax, ks ./ N, mean(mis_ksg1, dims = 1) |> vec, label = "KSG1")
scatterlines!(ax, ks ./ N, mean(mis_ksg2, dims = 1) |> vec, label = "KSG2")
hlines!(ax, [mitrue], color = :black, linewidth = 3, label = "I (true)")
axislegend()
fig
Example block output

Estimator comparison for MIShannon

Most estimators suffer from significant bias when applied to discrete, finite data. One possible resolution is to add a small amount of noise to discrete variables, so that the data becomes continuous in practice.

But instead of adding noise to your data, you can also consider using an estimator that is specifically designed to deal with continuous-discrete mixture data. One example is the GaoKannanOhViswanath estimator. Below, we compare its performance to KraskovStögbauerGrassberger1 on uniformly distributed discrete multivariate data. The true mutual information is zero. While the "naive" KraskovStögbauerGrassberger1 estimator diverges from the true value for these data, the GaoKannanOhViswanath converges to the true value.

using Associations
using Statistics
using StateSpaceSets: StateSpaceSet
using Statistics: mean
using CairoMakie

function compare_ksg_gkov(;
        k = 5,
        base = 2,
        nreps = 10,
        Ls = [500:100:1000; 1500; 2500; 5000; 7000])


    mis_ksg1_mix = zeros(nreps, length(Ls))
    mis_ksg1_discrete = zeros(nreps, length(Ls))
    mis_ksg1_cont = zeros(nreps, length(Ls))
    mis_gkov_mix = zeros(nreps, length(Ls))
    mis_gkov_discrete = zeros(nreps, length(Ls))
    mis_gkov_cont = zeros(nreps, length(Ls))

    for (j, L) in enumerate(Ls)
        for i = 1:nreps
            X = StateSpaceSet(float.(rand(1:8, L, 2)))
            Y = StateSpaceSet(float.(rand(1:8, L, 2)))
            Z = StateSpaceSet(rand(L, 2))
            W = StateSpaceSet(rand(L, 2))
            est_gkov = GaoKannanOhViswanath(MIShannon(; base = ℯ); k)
            est_ksg1 = KSG1(MIShannon(; base = ℯ); k)
            mis_ksg1_discrete[i, j] = association(est_ksg1, X, Y)
            mis_gkov_discrete[i, j] = association(est_gkov, X, Y)
            mis_ksg1_mix[i, j] = association(est_ksg1, X, Z)
            mis_gkov_mix[i, j] = association(est_gkov, X, Z)
            mis_ksg1_cont[i, j] = association(est_ksg1, Z, W)
            mis_gkov_cont[i, j] = association(est_gkov, Z, W)
        end
    end
    return mis_ksg1_mix, mis_ksg1_discrete, mis_ksg1_cont,
        mis_gkov_mix, mis_gkov_discrete, mis_gkov_cont
end

fig = Figure()
ax = Axis(fig[1, 1],
    xlabel = "Sample size",
    ylabel = "Mutual information (bits)")
Ls = [100; 200; 500; 1000; 2500; 5000; 7000]
nreps = 5
k = 3
mis_ksg1_mix, mis_ksg1_discrete, mis_ksg1_cont,
    mis_gkov_mix, mis_gkov_discrete, mis_gkov_cont =
    compare_ksg_gkov(; nreps, k, Ls)

scatterlines!(ax, Ls, mean(mis_ksg1_mix, dims = 1) |> vec,
    label = "KSG1 (mixed)", color = :black,
    marker = :utriangle)
scatterlines!(ax, Ls, mean(mis_ksg1_discrete, dims = 1) |> vec,
    label = "KSG1 (discrete)", color = :black,
    linestyle = :dash, marker = '▲')
scatterlines!(ax, Ls, mean(mis_ksg1_cont, dims = 1) |> vec,
    label = "KSG1 (continuous)", color = :black,
    linestyle = :dot, marker = '●')
scatterlines!(ax, Ls, mean(mis_gkov_mix, dims = 1) |> vec,
    label = "GaoKannanOhViswanath (mixed)", color = :red,
    marker = :utriangle)
scatterlines!(ax, Ls, mean(mis_gkov_discrete, dims = 1) |> vec,
    label = "GaoKannanOhViswanath (discrete)", color = :red,
    linestyle = :dash, marker = '▲')
scatterlines!(ax, Ls, mean(mis_gkov_cont, dims = 1) |> vec,
    label = "GaoKannanOhViswanath (continuous)", color = :red,
    linestyle = :dot, marker = '●')
axislegend(position = :rb)
fig
Example block output

Estimation using DifferentialInfoEstimators: a comparison

Let's compare the performance of a subset of the implemented mutual information estimators. We'll use example data from Lord et al., where the analytical mutual information is known.

using Associations
using LinearAlgebra: det
using StateSpaceSets: StateSpaceSet
using Distributions: MvNormal
using LaTeXStrings
using CairoMakie

# adapted from https://juliadatascience.io/makie_colors
function new_cycle_theme()
    # https://nanx.me/ggsci/reference/pal_locuszoom.html
    my_colors = ["#D43F3AFF", "#EEA236FF", "#5CB85CFF", "#46B8DAFF",
        "#357EBDFF", "#9632B8FF", "#B8B8B8FF"]
    cycle = Cycle([:color, :linestyle, :marker], covary=true) # alltogether
    my_markers = [:circle, :rect, :utriangle, :dtriangle, :diamond,
        :pentagon, :cross, :xcross]
    my_linestyle = [nothing, :dash, :dot, :dashdot, :dashdotdot]
    return Theme(
        fontsize = 22, font="CMU Serif",
        colormap = :linear_bmy_10_95_c78_n256,
        palette = (
            color = my_colors,
            marker = my_markers,
            linestyle = my_linestyle,
        ),
        Axis = (
            backgroundcolor= (:white, 0.2),
            xgridstyle = :dash,
            ygridstyle = :dash
        ),
        Lines = (
            cycle= cycle,
        ),
        ScatterLines = (
            cycle = cycle,
        ),
        Scatter = (
            cycle = cycle,
        ),
        Legend = (
            bgcolor = (:grey, 0.05),
            framecolor = (:white, 0.2),
            labelsize = 13,
        )
    )
end

run(est; f::Function, # function that generates data
        base::Real = ℯ,
        nreps::Int = 10,
        αs = [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1],
        n::Int = 1000) =
    map(α -> association(est, f(α, n)...), αs)

function compute_results(f::Function; estimators, k = 5, k_lord = 20,
        n = 1000, base = ℯ, nreps = 10,
        as = 7:-1:0,
        αs = [1/10^(a) for a in as])

    is = [zeros(length(αs)) for est in estimators]
    for (k, est) in enumerate(estimators)
        tmp = zeros(length(αs))
        for i = 1:nreps
            tmp .+= run(est; f = f, αs, base, n)
        end
        is[k] .= tmp ./ nreps
    end

    return is
end

function plot_results(f::Function, ftrue::Function;
        base, estimators, k_lord, k,
        as = 7:-1:0, αs = [1/10^(a) for a in as], kwargs...
    )
    is = compute_results(f;
        base, estimators, k_lord, k, as, αs, kwargs...)
    itrue = [ftrue(α; base) for α in αs]

    xmin, xmax = minimum(αs), maximum(αs)

    ymin = floor(Int, min(minimum(itrue), minimum(Iterators.flatten(is))))
    ymax = ceil(Int, max(maximum(itrue), maximum(Iterators.flatten(is))))
    f = Figure()
    ax = Axis(f[1, 1],
        xlabel = "α", ylabel = "I (nats)",
        xscale = log10, aspect = 1,
        xticks = (αs, [latexstring("10^{$(-a)}") for a in as]),
        yticks = (ymin:ymax)
        )
    xlims!(ax, (1/10^first(as), 1/10^last(as)))
    ylims!(ax, (ymin, ymax))
    lines!(ax, αs, itrue,
        label = "I (true)", linewidth = 4, color = :black)
    for (i, est) in enumerate(estimators)
        if est isa EntropyDecomposition
            es = typeof(est.est).name.name |> String
        else
            es = typeof(est).name.name |> String
        end
        @show es
        lbl = occursin("Lord", es) ? "$es (k = $k_lord)" : "$es (k = $k)"
        scatter!(ax, αs, is[i], label = lbl)
        lines!(ax, αs, is[i])

    end
    axislegend()
    return f
end

set_theme!(new_cycle_theme())
k_lord = 20
k = 5
base = ℯ

def = MIShannon(base = ℯ)
estimators = [
    EntropyDecomposition(def, Kraskov(; k)),
    EntropyDecomposition(def, KozachenkoLeonenko()),
    EntropyDecomposition(def, Zhu(; k)),
    EntropyDecomposition(def, ZhuSingh(; k)),
    EntropyDecomposition(def, Gao(; k)),
    EntropyDecomposition(def, Lord(; k = k_lord)),
    EntropyDecomposition(def, LeonenkoProzantoSavani(Shannon(); k)),
    KSG1(def; k),
    KSG2(def; k),
    GaoOhViswanath(def; k),
    GaoKannanOhViswanath(def; k),
    GaussianMI(def),
];
12-element Vector{MultivariateInformationMeasureEstimator{MIShannon{Irrational{:ℯ}}}}:
 EntropyDecomposition{MIShannon{Irrational{:ℯ}}, Kraskov{Shannon{Int64}, Int64}, Nothing, Nothing}(MIShannon{Irrational{:ℯ}}(ℯ), Kraskov(definition = Shannon(base = 2), k = 5, w = 0, base = 2), nothing, nothing)
 EntropyDecomposition{MIShannon{Irrational{:ℯ}}, KozachenkoLeonenko{Shannon{Int64}}, Nothing, Nothing}(MIShannon{Irrational{:ℯ}}(ℯ), KozachenkoLeonenko(definition = Shannon(base = 2), w = 0), nothing, nothing)
 EntropyDecomposition{MIShannon{Irrational{:ℯ}}, Zhu{Shannon{Int64}}, Nothing, Nothing}(MIShannon{Irrational{:ℯ}}(ℯ), Zhu(definition = Shannon(base = 2), k = 5, w = 0), nothing, nothing)
 EntropyDecomposition{MIShannon{Irrational{:ℯ}}, ZhuSingh{Shannon{Int64}}, Nothing, Nothing}(MIShannon{Irrational{:ℯ}}(ℯ), ZhuSingh(definition = Shannon(base = 2), k = 5, w = 0), nothing, nothing)
 EntropyDecomposition{MIShannon{Irrational{:ℯ}}, Gao{Shannon{Int64}}, Nothing, Nothing}(MIShannon{Irrational{:ℯ}}(ℯ), Gao(definition = Shannon(base = 2), k = 5, w = 0, corrected = true), nothing, nothing)
 EntropyDecomposition{MIShannon{Irrational{:ℯ}}, Lord{Shannon{Int64}}, Nothing, Nothing}(MIShannon{Irrational{:ℯ}}(ℯ), Lord(definition = Shannon(base = 2), k = 20, w = 0), nothing, nothing)
 EntropyDecomposition{MIShannon{Irrational{:ℯ}}, LeonenkoProzantoSavani{Shannon{Int64}}, Nothing, Nothing}(MIShannon{Irrational{:ℯ}}(ℯ), LeonenkoProzantoSavani(definition = Shannon(base = 2), k = 5, w = 0), nothing, nothing)
 KraskovStögbauerGrassberger1{MIShannon{Irrational{:ℯ}}, Chebyshev, Chebyshev}(MIShannon{Irrational{:ℯ}}(ℯ), 5, 0, Chebyshev(), Chebyshev())
 KraskovStögbauerGrassberger2{MIShannon{Irrational{:ℯ}}, Chebyshev, Chebyshev}(MIShannon{Irrational{:ℯ}}(ℯ), 5, 0, Chebyshev(), Chebyshev())
 GaoOhViswanath{MIShannon{Irrational{:ℯ}}, Euclidean, Euclidean}(MIShannon{Irrational{:ℯ}}(ℯ), 5, 0, Euclidean(0.0), Euclidean(0.0))
 GaoKannanOhViswanath{MIShannon{Irrational{:ℯ}}}(MIShannon{Irrational{:ℯ}}(ℯ), 5, 0)
 GaussianMI{MIShannon{Irrational{:ℯ}}}(MIShannon{Irrational{:ℯ}}(ℯ), true)

Example system: family 1

In this system, samples are concentrated around the diagonal $X = Y$, and the strip of samples gets thinner as $\alpha \to 0$.

function family1(α, n::Int)
    x = rand(n)
    v = rand(n)
    y = x + α * v
    return StateSpaceSet(x), StateSpaceSet(y)
end

# True mutual information values for these data
function ifamily1(α; base = ℯ)
    mi = -log(α) - α - log(2)
    return mi / log(base, ℯ)
end

fig = plot_results(family1, ifamily1;
    k_lord = k_lord, k = k, nreps = 10, n = 800,
    estimators = estimators,
    base = base)
Example block output

Example system: family 2

function family2(α, n::Int)
    Σ = [1 α; α 1]
    N2 = MvNormal(zeros(2), Σ)
    D2 = StateSpaceSet([rand(N2) for i = 1:n])
    X = StateSpaceSet(D2[:, 1])
    Y = StateSpaceSet(D2[:, 2])
    return X, Y
end

function ifamily2(α; base = ℯ)
    return (-0.5 * log(1 - α^2)) / log(ℯ, base)
end

αs = 0.05:0.05:0.95
estimators = estimators
with_theme(new_cycle_theme()) do
    f = Figure();
    ax = Axis(f[1, 1], xlabel = "α", ylabel = "I (nats)")
    is_true = map(α -> ifamily2(α), αs)
    is_est = map(est -> run(est; f = family2, αs, nreps = 20), estimators)
    lines!(ax, αs, is_true,
        label = "I (true)", color = :black, linewidth = 3)
    for (i, est) in enumerate(estimators)
        if est isa EntropyDecomposition
            estname = typeof(est.est).name.name |> String
        else
            estname = typeof(est).name.name |> String
        end
        scatterlines!(ax, αs, is_est[i], label = estname)
    end
    axislegend(position = :lt)
    return f
end
Example block output

Example system: family 3

In this system, we draw samples from a 4D Gaussian distribution distributed as specified in the ifamily3 function below. We let $X$ be the two first variables, and $Y$ be the two last variables.

function ifamily3(α; base = ℯ)
    Σ = [7 -5 -1 -3; -5 5 -1 3; -1 -1 3 -1; -3 3 -1 2+α]
    Σx = Σ[1:2, 1:2]; Σy = Σ[3:4, 3:4]
    mi = 0.5*log(det(Σx) * det(Σy) / det(Σ))
    return mi / log(ℯ, base)
end

function family3(α, n::Int)
    Σ = [7 -5 -1 -3; -5 5 -1 3; -1 -1 3 -1; -3 3 -1 2+α]
    N4 = MvNormal(zeros(4), Σ)
    D4 = StateSpaceSet([rand(N4) for i = 1:n])
    X = D4[:, 1:2]
    Y = D4[:, 3:4]
    return X, Y
end

fig = plot_results(family3, ifamily3;
    k_lord = k_lord, k = k, nreps = 5, n = 800,
    estimators = estimators, base = base)
Example block output

We see that the Lord estimator, which estimates local volume elements using a singular-value decomposition (SVD) of local neighborhoods, outperforms the other estimators by a large margin.

MIRenyiJizba

JointProbabilities + UniqueElements

MIRenyiJizba can be estimated for categorical data using JointProbabilities estimator with the UniqueElements outcome space.

using Associations
using Random; rng = Xoshiro(1234)
x = rand(rng, ["a", "b", "c"], 200);
y = rand(rng, ["hello", "yoyo", "heyhey"], 200);
est = JointProbabilities(MIRenyiJizba(), UniqueElements())
association(est, x, y)
-0.014289937341761379

EntropyDecomposition + LeonenkoProzantoSavani

MIRenyiJizba can also estimated for numerical data using EntropyDecomposition in combination with any DifferentialInfoEstimator capable of estimating differential Renyi entropy.

using Associations
using Random; rng = Xoshiro(1234)
x = randn(rng, 50); y = randn(rng, 50);
def = MIRenyiJizba()
est_diff = EntropyDecomposition(def, LeonenkoProzantoSavani(Renyi(), k=3))
association(est_diff, x, y)
-0.05308726289365895

EntropyDecomposition + LeonenkoProzantoSavani

MIRenyiJizba can also estimated for numerical data using EntropyDecomposition in combination with any DiscreteInfoEstimator capable of estimating differential Renyi entropy over some OutcomeSpace, e.g. ValueBinning.

using Associations
using Random; rng = Xoshiro(1234)
x = randn(rng, 50); y = randn(rng, 50);
def = MIRenyiJizba()

disc = CodifyVariables(ValueBinning(2))
est_disc = EntropyDecomposition(def, PlugIn(Renyi()), disc);
association(est_disc, x, y)
0.044138581955657896

MIRenyiSarbu

MIRenyiSarbu can be estimated using the JointProbabilities estimator in combination with any CodifyVariables or CodifyPoints discretization scheme.

JointProbabilities + UniqueElements

using Associations
using Random; rng = Xoshiro(1234)
x = rand(rng, ["a", "b", "c"], 200)
y = rand(rng, ["hello", "yoyo", "heyhey"], 200)

est = JointProbabilities(MIRenyiSarbu(), CodifyVariables(UniqueElements()))
association(est, x, y)
0.01866831291309387

JointProbabilities + CosineSimilarityBinning

using Associations
using Random; rng = Xoshiro(1234)
x = rand(rng, 200)
y = rand(rng, 200)

est = JointProbabilities(MIRenyiSarbu(), CodifyVariables(CosineSimilarityBinning()))
association(est, x, y)
0.044214606254861136

MITsallisFuruichi

JointProbabilities + UniqueElements

MITsallisFuruichi can be estimated using the JointProbabilities estimator in combination with any CodifyVariables or CodifyPoints discretization scheme.

using Associations
using Random; rng = Xoshiro(1234)
x = rand(rng, 200)
y = rand(rng, 200)

est = JointProbabilities(MITsallisFuruichi(q = 0.3), UniqueElements())
association(est, x, y)
-2.8721220837828123

EntropyDecomposition + LeonenkoProzantoSavani

using Associations
using Random; rng = Xoshiro(1234)
x = rand(rng, 200)
y = rand(rng, 200)

est_diff = EntropyDecomposition(MITsallisFuruichi(), LeonenkoProzantoSavani(Tsallis(q= 2)))
association(est_diff, x, y)
-0.11358551843093931

EntropyDecomposition + Dispersion

using Associations
using Random; rng = Xoshiro(1234)
x = rand(rng, 200)
y = rand(rng, 200)
disc = CodifyVariables(Dispersion())
est_disc = EntropyDecomposition(MITsallisFuruichi(), PlugIn(Tsallis()), disc)

association(est_disc, x, y)
0.345168809758591

MITsallisMartin

JointProbabilities + UniqueElements

using Associations
using Random; rng = Xoshiro(1234)
x = rand(rng, 200)
y = rand(rng, 200)

est = JointProbabilities(MITsallisMartin(q = 1.5), UniqueElements())
association(est, x, y)
-26.28427124746185

EntropyDecomposition + LeonenkoProzantoSavani

MITsallisMartin can be estimated using a decomposition into entropy terms using EntropyDecomposition with any compatible estimator that can estimate differential Tsallis entropy.

using Associations
using Random; rng = Xoshiro(1234)
x = rand(rng, 500)
y = rand(rng, 500)

est_diff = EntropyDecomposition(MITsallisMartin(), LeonenkoProzantoSavani(Tsallis(q= 1.5)))
association(est_diff, x, y)
0.1241248567715774

EntropyDecomposition + OrdinalPatterns

using Associations
using Random; rng = Xoshiro(1234)
x = rand(rng, 200)
y = rand(rng, 200)
disc = CodifyVariables(OrdinalPatterns())
est_disc = EntropyDecomposition(MITsallisMartin(), PlugIn(Tsallis()), disc)

association(est_disc, x, y)
1.3973731206049993

CMIShannon

CMIShannon with GaussianCMI

using Associations
using Distributions
using Statistics

n = 1000
# A chain X → Y → Z
x = randn(1000)
y = randn(1000) .+ x
z = randn(1000) .+ y
association(GaussianCMI(), x, z, y) # defaults to `CMIShannon()`
0.8554481025988362

CMIShannon with FPVP

using Associations
using Distributions
using Statistics

n = 1000
# A chain X → Y → Z
x = rand(Normal(-1, 0.5), n)
y = rand(BetaPrime(0.5, 1.5), n) .+ x
z = rand(Chisq(100), n)
z = (z ./ std(z)) .+ y

# We expect zero (in practice: very low) CMI when computing I(X; Z | Y), because
# the link between X and Z is exclusively through Y, so when observing Y,
# X and Z should appear independent.
association(FPVP(k = 5), x, z, y) # defaults to `CMIShannon()`
-0.1575745523361

CMIShannon with MesnerShalizi

using Associations
using Distributions
using Statistics
using Random; rng = Xoshiro(1234)

n = 1000
# A chain X → Y → Z
x = rand(rng, Normal(-1, 0.5), n)
y = rand(rng, BetaPrime(0.5, 1.5), n) .+ x
z = rand(rng, Chisq(100), n)
z = (z ./ std(z)) .+ y

# We expect zero (in practice: very low) CMI when computing I(X; Z | Y), because
# the link between X and Z is exclusively through Y, so when observing Y,
# X and Z should appear independent.
association(MesnerShalizi(; k = 10), x, z, y) # defaults to `CMIShannon()`
-0.11407574318482466

CMIShannon with Rahimzamani

using Associations
using Distributions
using Statistics
using Random; rng = Xoshiro(1234)

n = 1000
# A chain X → Y → Z
x = rand(rng, Normal(-1, 0.5), n)
y = rand(rng, BetaPrime(0.5, 1.5), n) .+ x
z = rand(rng, Chisq(100), n)
z = (z ./ std(z)) .+ y

# We expect zero (in practice: very low) CMI when computing I(X; Z | Y), because
# the link between X and Z is exclusively through Y, so when observing Y,
# X and Z should appear independent.
association(Rahimzamani(CMIShannon(base = 10); k = 10), x, z, y)
-0.0343402204762932

MIDecomposition

Shannon-type conditional mutual information can be decomposed as a sum of mutual information terms, which we can each estimate with any dedicated MutualInformationEstimator estimator.

using Associations
using Random; rng = MersenneTwister(1234)
x = rand(rng, 300)
y = rand(rng, 300) .+ x
z = rand(rng, 300) .+ y

est = MIDecomposition(CMIShannon(), KSG1(MIShannon(base = 2), k = 3))
association(est, x, z, y) # should be near 0 (and can be negative)

CMIRenyiPoczos

PoczosSchneiderCMI

using Associations
using Distributions
using Statistics
using Random; rng = Xoshiro(1234)

n = 1000
# A chain X → Y → Z
x = rand(rng, Normal(-1, 0.5), n)
y = rand(rng, BetaPrime(0.5, 1.5), n) .+ x
z = rand(rng, Chisq(100), n)
z = (z ./ std(z)) .+ y

# We expect zero (in practice: very low) CMI when computing I(X; Z | Y), because
# the link between X and Z is exclusively through Y, so when observing Y,
# X and Z should appear independent.
est = PoczosSchneiderCMI(CMIRenyiPoczos(base = 2, q = 1.2); k = 5)
association(est, x, z, y)
-0.39801429256003873

In addition to the dedicated ConditionalMutualInformationEstimators, any MutualInformationEstimator can also be used to compute conditional mutual information using the chain rule of mutual information. However, the naive application of these estimators don't perform any bias correction when taking the difference of mutual information terms.

CMIShannon

MIDecomposition + KraskovStögbauerGrassberger1

using Associations
using Distributions
using Statistics

n = 1000
# A chain X → Y → Z
x = rand(Normal(-1, 0.5), n)
y = rand(BetaPrime(0.5, 1.5), n) .+ x
z = rand(Chisq(100), n)
z = (z ./ std(z)) .+ y

# We expect zero (in practice: very low) CMI when computing I(X; Z | Y), because
# the link between X and Z is exclusively through Y, so when observing Y,
# X and Z should appear independent.
est = MIDecomposition(CMIShannon(base = 2), KSG1(k = 10))
association(est, x, z, y)
-0.463585969730838

ShortExpansionConditionalMutualInformation

JointProbabilities with CodifyVariables and ValueBinning

using Associations
using Test
using Random; rng = Xoshiro(1234)
n = 20
x = rand(rng, n)
y = randn(rng, n) .+ x .^ 2
z = randn(rng, n) .* y

# An estimator for estimating the SECMI measure
est = JointProbabilities(SECMI(base = 2), CodifyVariables(ValueBinning(3)))
association(est, x, z, y)
0.2398365741064942

EntropyDecomposition + Kraskov

Any DifferentialInfoEstimator can also be used to compute conditional mutual information using a sum of entropies. For that, we usethe EntropyDecomposition estimator. No bias correction is applied for EntropyDecomposition either.

using Associations
using Distributions
using Random; rng = Xoshiro(1234)
n = 500
# A chain X → Y → Z
x = rand(rng, Epanechnikov(0.5, 1.0), n)
y = rand(rng, Normal(0, 0.2), n) .+ x
z = rand(rng, FDist(3, 2), n)
est = EntropyDecomposition(CMIShannon(), Kraskov(k = 5))
association(est, x, z, y)
-0.40513415764994387

Any DiscreteInfoEstimator that computes entropy can also be used to compute conditional mutual information using a sum of entropies. For that, we also use EntropyDecomposition. In the discrete case, we also have to specify a discretization (an OutcomeSpace).

EntropyDecomposition + ValueBinning

using Associations
using Distributions
using Random; rng = Xoshiro(1234)
n = 500
# A chain X → Y → Z
x = rand(rng, Epanechnikov(0.5, 1.0), n)
y = rand(rng, Normal(0, 0.2), n) .+ x
z = rand(rng, FDist(3, 2), n)
discretization = CodifyVariables(ValueBinning(RectangularBinning(5)))
hest = PlugIn(Shannon())
est = EntropyDecomposition(CMIShannon(), hest, discretization)
association(est, x, y, z)
0.8238141245350172

CMIRenyiJizba

JointProbabilities + BubbleSortSwaps

using Associations
using Random; rng = Xoshiro(1234)
x = rand(rng, 100)
y = x .+ rand(rng, 100)
z = y .+ rand(rng, 100)
disc = CodifyVariables(BubbleSortSwaps(m = 4))
est = JointProbabilities(CMIRenyiJizba(), disc)
association(est, x, z, y)
0.43284279366170564

EntropyDecomposition + LeonenkoProzantoSavani

using Associations
using Random; rng = Xoshiro(1234)
x, y, z = rand(rng, 1000), rand(rng, 1000), rand(rng, 1000)
def = CMIRenyiJizba(q = 1.5)

# Using a differential Rényi entropy estimator
est = EntropyDecomposition(def, LeonenkoProzantoSavani(Renyi(), k = 10))
association(est, x, y, z)
-0.03274084267284355

EntropyDecomposition + OrdinalPatterns

using Associations
using Random; rng = Xoshiro(1234)
x, y, z = rand(rng, 1000), rand(rng, 1000), rand(rng, 1000)
def = CMIRenyiJizba(q = 1.5)

# Using a plug-in Rényi entropy estimator, discretizing using ordinal patterns.
est = EntropyDecomposition(def, PlugIn(Renyi()), CodifyVariables(OrdinalPatterns(m=2)), RelativeAmount())
association(est, x, y, z)
3.961338273517079e-6

TEShannon

EntropyDecomposition + TransferOperator

For transfer entropy examples, we'll construct some time series for which there is time-delayed forcing between variables.

using Associations
using DynamicalSystemsBase
using StableRNGs
rng = StableRNG(123)

Base.@kwdef struct Logistic4Chain{V, RX, RY, RZ, RW, C1, C2, C3, Σ1, Σ2, Σ3, RNG}
    xi::V = [0.1, 0.2, 0.3, 0.4]
    rx::RX = 3.9
    ry::RY = 3.6
    rz::RZ = 3.6
    rw::RW = 3.8
    c_xy::C1 = 0.4
    c_yz::C2 = 0.4
    c_zw::C3 = 0.35
    σ_xy::Σ1 = 0.05
    σ_yz::Σ2 = 0.05
    σ_zw::Σ3 = 0.05
    rng::RNG = Random.default_rng()
end

function eom_logistic4_chain(u, p::Logistic4Chain, t)
    (; xi, rx, ry, rz, rw, c_xy, c_yz, c_zw, σ_xy, σ_yz, σ_zw, rng) = p
    x, y, z, w = u
    f_xy = (y +  c_xy*(x + σ_xy * rand(rng)) ) / (1 + c_xy*(1+σ_xy))
    f_yz = (z +  c_yz*(y + σ_yz * rand(rng)) ) / (1 + c_yz*(1+σ_yz))
    f_zw = (w +  c_zw*(z + σ_zw * rand(rng)) ) / (1 + c_zw*(1+σ_zw))
    dx = rx * x * (1 - x)
    dy = ry * (f_xy) * (1 - f_xy)
    dz = rz * (f_yz) * (1 - f_yz)
    dw = rw * (f_zw) * (1 - f_zw)
    return SVector{4}(dx, dy, dz, dw)
end

function system(definition::Logistic4Chain)
    return DiscreteDynamicalSystem(eom_logistic4_chain, definition.xi, definition)
end

# An example system where `X → Y → Z → W`.
sys = system(Logistic4Chain(; rng))
x, y, z, w = columns(first(trajectory(sys, 300, Ttr = 10000)))

precise = true # precise bin edges
discretization = CodifyVariables(TransferOperator(RectangularBinning(2, precise))) #
est_disc_to = EntropyDecomposition(TEShannon(), PlugIn(Shannon()), discretization);
association(est_disc_to, x, y), association(est_disc_to, y, x)
(0.2529054207336474, 0.10426214246586962)

The Shannon-type transfer entropy from x to y is stronger than from y to x, which is what we expect if x drives y.

association(est_disc_to, x, z), association(est_disc_to, x, z, y)
(0.05336193759294283, 0.023756513913722355)

The Shannon-type transfer entropy from x to z is stronger than the transfer entropy from x to z given y. This is expected, because x drives z through y, so "conditioning away" the effect of y should decrease the estimated information transfer.

CMIDecomposition

using Associations
using Random; rng = MersenneTwister(1234)
x = rand(rng, 1000)
y = rand(rng, 1000) .+ x
z = rand(rng, 1000) .+ y

# Estimate transfer entropy by representing it as a CMI and using the `FPVP` estimator.
est = CMIDecomposition(TEShannon(base = 2), FPVP(k = 3))
association(est, x, z, y) # should be near 0 (and can be negative)
-0.26323146509776046

SymbolicTransferEntropy estimator

The SymbolicTransferEntropy estimator is just a convenience wrapper which utilizes CodifyVariableswith the OrdinalPatterns outcome space to discretize the input time series before computing transfer entropy.

We'll use coupled time series from the logistic4 system above, where x → y → z → w. Thus, we expect that the association for the direction x → y is larger than for y → x. We also expect an association x → z, but the association should weaken when conditioning on the intermediate value y.

using Associations
using DynamicalSystemsBase
using Random; rng = Xoshiro(1234)
sys = system(Logistic4Chain(; rng))
x, y, z, w = columns(first(trajectory(sys, 300, Ttr = 10000)))
est = SymbolicTransferEntropy(m = 5)
association(est, x, y), association(est, y, x), association(est, x, z), association(est, x, z, y)
(0.5174597240325244, 0.24773918384758742, 0.7233610447154118, 0.1322241113229709)

Comparing different estimators

Let's reproduce Figure 4 from Zhu et al. (2015), where they test some dedicated transfer entropy estimators on a bivariate autoregressive system. We will test

  • The Lindner and Zhu1 dedicated transfer entropy estimators, which try to eliminate bias.
  • The KraskovStögbauerGrassberger1 estimator, which computes TE naively as a sum of mutual information terms (without guaranteed cancellation of biases for the total sum).
  • The Kraskov estimator, which computes TE naively as a sum of entropy terms (without guaranteed cancellation of biases for the total sum).
using Associations
using CairoMakie
using Statistics
using Distributions: Normal

function model2(n::Int)
    𝒩x = Normal(0, 0.1)
    𝒩y = Normal(0, 0.1)
    x = zeros(n+2)
    y = zeros(n+2)
    x[1] = rand(𝒩x)
    x[2] = rand(𝒩x)
    y[1] = rand(𝒩y)
    y[2] = rand(𝒩y)

    for i = 3:n+2
        x[i] = 0.45*sqrt(2)*x[i-1] - 0.9*x[i-2] - 0.6*y[i-2] + rand(𝒩x)
        y[i] = 0.6*x[i-2] - 0.175*sqrt(2)*y[i-1] + 0.55*sqrt(2)*y[i-2] + rand(𝒩y)
    end
    return x[3:end], y[3:end]
end
te_true = 0.42 # eyeball the theoretical value from their Figure 4.

m = TEShannon(embedding = EmbeddingTE(dT = 2, dS = 2), base = ℯ)
estimators = [
    Zhu1(m, k = 8),
    Lindner(m, k = 8),
    MIDecomposition(m, KSG1(k = 8)),
    EntropyDecomposition(m, Kraskov(k = 8)),
]
Ls = [floor(Int, 2^i) for i in 8.0:0.5:11]
nreps = 8
tes_xy = [[zeros(nreps) for i = 1:length(Ls)] for e in estimators]
tes_yx = [[zeros(nreps) for i = 1:length(Ls)] for e in estimators]
for (k, est) in enumerate(estimators)
    for (i, L) in enumerate(Ls)
        for j = 1:nreps
            x, y = model2(L);
            tes_xy[k][i][j] = association(est, x, y)
            tes_yx[k][i][j] = association(est, y, x)
        end
    end
end

ymin = minimum(map(x -> minimum(Iterators.flatten(Iterators.flatten(x))), (tes_xy, tes_yx)))
estimator_names = ["Zhu1", "Lindner", "KSG1", "Kraskov"]
ls = [:dash, :dot, :dash, :dot]
mr = [:rect, :hexagon, :xcross, :pentagon]

fig = Figure(resolution = (800, 350))
ax_xy = Axis(fig[1,1], xlabel = "Signal length", ylabel = "TE (nats)", title = "x → y")
ax_yx = Axis(fig[1,2], xlabel = "Signal length", ylabel = "TE (nats)", title = "y → x")
for (k, e) in enumerate(estimators)
    label = estimator_names[k]
    marker = mr[k]
    scatterlines!(ax_xy, Ls, mean.(tes_xy[k]); label, marker)
    scatterlines!(ax_yx, Ls, mean.(tes_yx[k]); label, marker)
    hlines!(ax_xy, [te_true]; xmin = 0.0, xmax = 1.0, linestyle = :dash, color = :black)
    hlines!(ax_yx, [te_true]; xmin = 0.0, xmax = 1.0, linestyle = :dash, color = :black)
    linkaxes!(ax_xy, ax_yx)
end
axislegend(ax_xy, position = :rb)

fig
Example block output

Reproducing Schreiber (2000)

Let's try to reproduce the results from Schreiber's original paper (Schreiber, 2000) where he introduced the transfer entropy. We'll here use the JointProbabilities estimator, discretizing per column of the input data using the CodifyVariables discretization scheme with the ValueBinning outcome space.

using Associations
using DynamicalSystemsBase
using CairoMakie
using Statistics
using Random; Random.seed!(12234);

function ulam_system(dx, x, p, t)
    f(x) = 2 - x^2
    ε = p[1]
    dx[1] = f(ε*x[length(dx)] + (1-ε)*x[1])
    for i in 2:length(dx)
        dx[i] = f(ε*x[i-1] + (1-ε)*x[i])
    end
end

ds = DiscreteDynamicalSystem(ulam_system, rand(100) .- 0.5, [0.04])
first(trajectory(ds, 1000; Ttr = 1000));

εs = 0.02:0.02:1.0
te_x1x2 = zeros(length(εs)); te_x2x1 = zeros(length(εs))
# Guess an appropriate bin width of 0.2 for the histogram
disc = CodifyVariables(ValueHistogram(0.2))
est = JointProbabilities(TEShannon(; base = 2), disc)

for (i, ε) in enumerate(εs)
    set_parameter!(ds, 1, ε)
    tr = first(trajectory(ds, 300; Ttr = 5000))
    X1 = tr[:, 1]; X2 = tr[:, 2]
    @assert !any(isnan, X1)
    @assert !any(isnan, X2)
    te_x1x2[i] = association(est, X1, X2)
    te_x2x1[i] = association(est, X2, X1)
end

fig = Figure(size = (800, 600))
ax = Axis(fig[1, 1], xlabel = "epsilon", ylabel = "Transfer entropy (bits)")
lines!(ax, εs, te_x1x2, label = "X1 to X2", color = :black, linewidth = 1.5)
lines!(ax, εs, te_x2x1, label = "X2 to X1", color = :red, linewidth = 1.5)
axislegend(ax, position = :lt)
return fig
Example block output

As expected, transfer entropy from X1 to X2 is higher than from X2 to X1 across parameter values for ε. But, by our definition of the ulam system, dynamical coupling only occurs from X1 to X2. The results, however, show nonzero transfer entropy in both directions. What does this mean?

Computing transfer entropy from finite time series introduces bias, and so does any particular choice of entropy estimator used to calculate it. To determine whether a transfer entropy estimate should be trusted, we can employ surrogate testing. We'll generate surrogate using TimeseriesSurrogates.jl. One possible way to do so is to use a SurrogateAssociationTest with independence, but here we'll do the surrogate resampling manually, so we can plot and inspect the results.

In the example below, we continue with the same time series generated above. However, at each value of ε, we also compute transfer entropy for nsurr = 50 different randomly shuffled (permuted) versions of the source process. If the original transfer entropy exceeds that of some percentile the transfer entropy estimates of the surrogate ensemble, we will take that as "significant" transfer entropy.

nsurr = 25 # in real applications, you should use more surrogates
base = 2
te_x1x2 = zeros(length(εs)); te_x2x1 = zeros(length(εs))
te_x1x2_surr = zeros(length(εs), nsurr); te_x2x1_surr = zeros(length(εs), nsurr)

# use same bin-width as before
disc = CodifyVariables(ValueHistogram(0.2))
est = JointProbabilities(TEShannon(; base = 2), disc)

for (i, ε) in enumerate(εs)
    set_parameter!(ds, 1, ε)
    tr = first(trajectory(ds, 300; Ttr = 5000))
    X1 = tr[:, 1]; X2 = tr[:, 2]
    @assert !any(isnan, X1)
    @assert !any(isnan, X2)
    te_x1x2[i] = association(est, X1, X2)
    te_x2x1[i] = association(est, X2, X1)
    s1 = surrogenerator(X1, RandomShuffle()); s2 = surrogenerator(X2, RandomShuffle())

    for j = 1:nsurr
        te_x1x2_surr[i, j] = association(est, s1(), X2)
        te_x2x1_surr[i, j] = association(est, s2(), X1)
    end
end

# Compute 95th percentiles of the surrogates for each ε
qs_x1x2 = [quantile(te_x1x2_surr[i, :], 0.95) for i = 1:length(εs)]
qs_x2x1 = [quantile(te_x2x1_surr[i, :], 0.95) for i = 1:length(εs)]

fig = with_theme(theme_minimal(), markersize = 2) do
    fig = Figure()
    ax = Axis(fig[1, 1], xlabel = "epsilon", ylabel = "Transfer entropy (bits)")
    scatterlines!(ax, εs, te_x1x2, label = "X1 to X2", color = :black, linewidth = 1.5)
    scatterlines!(ax, εs, qs_x1x2, color = :black, linestyle = :dot, linewidth = 1.5)
    scatterlines!(ax, εs, te_x2x1, label = "X2 to X1", color = :red)
    scatterlines!(ax, εs, qs_x2x1, color = :red, linestyle = :dot)
    axislegend(ax, position = :lt)
    return fig
end
fig
Example block output

The plot above shows the original transfer entropies (solid lines) and the 95th percentile transfer entropies of the surrogate ensembles (dotted lines). As expected, using the surrogate test, the transfer entropies from X1 to X2 are mostly significant (solid black line is above dashed black line). The transfer entropies from X2 to X1, on the other hand, are mostly not significant (red solid line is below red dotted line).

TERenyiJizba

EntropyDecomposition + TransferOperator

We can perform the same type of analysis as above using TERenyiJizba instead of TEShannon.

using Associations
using DynamicalSystemsBase
using StableRNGs; rng = StableRNG(123)

# An example system where `X → Y → Z → W`.
sys = system(Logistic4Chain(; rng))
x, y, z, w = columns(first(trajectory(sys, 300, Ttr = 10000)))

precise = true # precise bin edges
discretization = CodifyVariables(TransferOperator(RectangularBinning(2, precise))) #
est_disc_to = EntropyDecomposition(TERenyiJizba(), PlugIn(Renyi()), discretization);
association(est_disc_to, x, y), association(est_disc_to, y, x)
(0.2551426443387981, 0.11917462376531485)

ConvergentCrossMapping

RandomVectors estimator

When cross-mapping with the RandomVectors estimator, a single random subsample of time indices (i.e. not in any particular order) of length l is drawn for each library size l, and cross mapping is performed using the embedding vectors corresponding to those time indices.

using Associations
using Random; rng = MersenneTwister(1234)
x, y = randn(rng, 200), randn(rng, 200)

# We'll draw a single sample at each `l ∈ libsizes`. Sampling with replacement is then
# necessary, because our 200-pt timeseries will result in embeddings with
# less than 200 points.
est = RandomVectors(ConvergentCrossMapping(d = 3); libsizes = 50:25:200, replace = true, rng)
crossmap(est, x, y)
7-element Vector{Float64}:
 0.403543378354588
 0.5856987205283815
 0.44287920268716086
 0.4966767048563532
 0.4355290972758533
 0.6832809291387226
 0.7311864792647486

To generate a distribution of cross-map estimates for each l ∈ libsizes, just call crossmap repeatedly, e.g.

using Associations
using Random; rng = MersenneTwister(1234)
using Statistics

x, y = randn(rng, 300), randn(rng, 300)
def = ConvergentCrossMapping(d = 3)
libsizes = 25:25:200

ρs = [[crossmap(RandomVectors(def; libsizes = L, replace = true, rng), x, y) for i = 1:50] for L in libsizes]

using CairoMakie
f = Figure(); ax = Axis(f[1, 1]);
plot!(ax, libsizes, mean.(ρs))
errorbars!(ax, libsizes, mean.(ρs), std.(ρs))
f

Now, the k-th element of ρs contains 80 estimates of the correspondence measure ρ at library size libsizes[k].

RandomSegment estimator

When cross-mapping with the RandomSegment estimator, a single random subsample of continguous, ordered time indices of length l is drawn for each library size l, and cross mapping is performed using the embedding vectors corresponding to those time indices.

using Associations
using Random; rng = MersenneTwister(1234)
x, y = randn(rng, 200), randn(rng, 200)

# We'll draw a single sample at each `l ∈ libsizes`. We limit the library size to 100,
# because drawing segments of the data longer than half the available data doesn't make
# much sense.
est = RandomSegment(ConvergentCrossMapping(d = 3); libsizes = 50:25:100, rng)
crossmap(est, x, y)
3-element Vector{Float64}:
 0.16203695962111725
 0.30717005220560617
 0.27363894462301874

As above, to generate a distribution of cross-map estimates for each l ∈ libsizes, just call crossmap repeatedly, e.g.

using Associations
using Random; rng = MersenneTwister(1234)
using Statistics

x, y = randn(rng, 200), randn(rng, 200)
def = ConvergentCrossMapping(d = 3)
libsizes = 25:25:100

ρs = [[crossmap(RandomSegment(def; libsizes = L, rng), x, y) for i = 1:50] for L in libsizes]

f = Figure(); ax = Axis(f[1, 1]);
plot!(ax, libsizes, mean.(ρs))
errorbars!(ax, libsizes, mean.(ρs), std.(ρs))
f

PairwiseAsymmetricInference

We repeat the analyses above, but here use the pairwise asymmetric inference algorithm instead of the convergent cross map algorithm.

RandomVectors estimator

using Associations
using Random; rng = MersenneTwister(1234)
x, y = randn(rng, 300), randn(rng, 300)

# We'll draw a single sample at each `l ∈ libsizes`. Sampling with replacement is then
# necessary, because our 200-pt timeseries will result in embeddings with
# less than 200 points.
est = RandomVectors(PairwiseAsymmetricInference(d = 3); libsizes = 50:25:200, replace = true, rng)
crossmap(est, x, y)
7-element Vector{Float64}:
 0.02148895635319534
 0.4131471276386324
 0.16500925285819215
 0.3427771476921493
 0.43406208466336454
 0.536744307953195
 0.7005991361206654

To generate a distribution of cross-map estimates for each l ∈ libsizes, just call crossmap repeatedly, e.g.

using Associations
using Random; rng = MersenneTwister(1234)
using Statistics

x, y = randn(rng, 300), randn(rng,300)
def = PairwiseAsymmetricInference(d = 3)
libsizes = 25:25:200

ρs = [[crossmap(RandomVectors(def; libsizes = L, replace = true, rng), x, y) for i = 1:50] for L in libsizes]

using CairoMakie
f = Figure(); ax = Axis(f[1, 1]);
plot!(ax, libsizes, mean.(ρs))
errorbars!(ax, libsizes, mean.(ρs), std.(ρs))
f

RandomSegment estimator

using Associations
using Random; rng = MersenneTwister(1234)
x, y = randn(rng, 200), randn(rng, 200)

# We'll draw a single sample at each `l ∈ libsizes`. We limit the library size to 100,
# because drawing segments of the data longer than half the available data doesn't make
# much sense.
est = RandomSegment(PairwiseAsymmetricInference(d = 3); libsizes = 50:25:100, rng)
crossmap(est, x, y)
3-element Vector{Float64}:
 -0.42394655036151907
  0.04597997698646141
 -0.13236022998996033

As above, to generate a distribution of cross-map estimates for each l ∈ libsizes, just call crossmap repeatedly, e.g.

using Associations
using Random; rng = MersenneTwister(1234)
using Statistics

x, y = randn(rng, 300), randn(rng, 300)
def = PairwiseAsymmetricInference(d = 3)
libsizes = 25:25:100

ρs = [[crossmap(RandomSegment(def; libsizes = L, rng), x, y) for i = 1:50] for L in libsizes]

using CairoMakie
f = Figure(); ax = Axis(f[1, 1]);
plot!(ax, libsizes, mean.(ρs))
errorbars!(ax, libsizes, mean.(ρs), std.(ρs))
f

MCR

To quantify association by the mean conditional probability of recurrence (MCR), we'll create a chain of variables where X drives Y, which in turn drives Z. We then expect there to be significant detectable association between both X and Y, Y and Z and also X and Z (because Y transfers information from X to Z. We expect the association between X and Z to disappear when conditioning on Y (since we're then "removing the effect" of Y).

using Associations
using Random; rng = Xoshiro(1234);
x = rand(rng, 300); y = rand(rng, 300) .* sin.(x); z = rand(rng, 300) .* y;
est = MCR(r = 0.5)
association(est, x, y), association(est, x, z), association(est, y, z), association(est, x, z, y)
(0.7711920426340347, 0.7420689143732229, 0.924270403653336, -0.00798860118526612)

The interpretation of the MCR measure is that if two variables are symmetrically coupled, then the conditional recurrence in both directions is equal. Two variables that are uncoupled are symmetrically coupled (i.e. no coupling). We therefore expect the difference in conditional recurrence to be around zero.

using Associations
using Random; rng = Xoshiro(1234)
x = rand(rng, 300)
y = rand(rng, 300)
m = MCR(r = 0.5)
Δ = association(m, x, y) - association(m, y, x)
-0.008308662176546688

RMCD

To quantify association by the recurrence measure of conditional dependence (RMCD), we'll create a chain of variables where X drives Y, which in turn drives Z. We then expect there to be significant detectable association between both X and Y, Y and Z and also X and Z (because Y transfers information from X to Z. We expect the association between X and Z to disappear when conditioning on Y (since we're then "removing the effect" of Y).

using Associations
using Random; rng = Xoshiro(1234);
x = rand(rng, 300); y = rand(rng, 300) .* sin.(x); z = rand(rng, 300) .* y;
est = RMCD(r = 0.5)
association(est, x, y), association(est, x, z), association(est, x, z, y)
(0.022387457155671135, 0.022129475954601702, 0.0)

ChatterjeeCorrelation

using Associations
using Random; rng = Xoshiro(1234);
x = rand(rng, 120)
y = rand(rng, 120) .* sin.(x)
120-element Vector{Float64}:
 0.2755719299262952
 0.2070408113032807
 0.6193365054669813
 0.008237347924455006
 0.04148542594436429
 0.313817853101071
 0.6257653008229676
 0.43134904735215984
 0.6829986456858073
 0.3095824205696399
 ⋮
 0.02169777371442121
 0.7764979588125409
 0.1660944201176525
 0.10967358670001909
 0.03230334731057383
 0.026337556651936055
 0.11124764808654403
 0.2599193572475318
 0.4628271687638507

By construction, there will almust surely be no ties in x, so we can use the fast estimate by setting handle_ties == false.

association(ChatterjeeCorrelation(handle_ties = false), x, y)
0.2726578234599625

If we have data where we know there are ties in the data, then we should set handle_ties == true.

w = rand(rng, 1:10, 120) # there will be some ties
z = rand(rng, 1:15, 120) .* sin.(w) # introduce some dependence
association(ChatterjeeCorrelation(handle_ties = true), w, z)
0.618104077588904

AzadkiaChatterjeeCoefficient

using Associations
using Random; rng = Xoshiro(1234);
x = rand(rng, 120)
y = rand(rng, 120) .* x
z = rand(rng, 120) .+ y
120-element Vector{Float64}:
 0.44737499407485565
 0.5516487384954861
 0.7581708420215838
 0.23074778575472404
 0.2500819453735219
 0.7353227843349248
 1.1500496733900634
 0.7621477984126711
 0.851799327542775
 0.5247237075962297
 ⋮
 0.3943206855229302
 1.838156393683287
 0.9709273173395769
 0.23333040123923995
 0.26781788473231355
 0.40232936334694847
 0.26839516685160714
 1.0709803535460094
 0.8668408207007512

For the variables above, where x → y → z, we expect stronger assocation between x and y than between x and z. We also expect the strength of the association between x and z to drop when conditioning on y, because y is the variable that connects x and z.

m = AzadkiaChatterjeeCoefficient(theiler = 0) # only exclude self-neighbors
association(m, x, y), association(m, x, z), association(m, x, z, y)
(0.3062018195708035, 0.06951871657754011, 0.010810810810810811)