Independence testing

JointDistanceDistributionTest

Bidirectionally coupled logistic maps

Let's use the built-in logistic2_bidir discrete dynamical system to create a pair of bidirectionally coupled time series and use the JointDistanceDistributionTest to see if we can confirm from observed time series that these variables are bidirectionally coupled. We'll use a significance level of 1 - α = 0.99, i.e. α = 0.01.

We start by generating some time series and configuring the test.

using CausalityTools
sys = system(Logistic2Bidir(c_xy = 0.5, c_yx = 0.4))
x, y = columns(first(trajectory(sys, 2000, Ttr = 10000)))
measure = JointDistanceDistribution(D = 5, B = 5)
test = JointDistanceDistributionTest(measure)
JointDistanceDistributionTest{JointDistanceDistribution{Euclidean, Float64}, Random.TaskLocalRNG}(JointDistanceDistribution{Euclidean, Float64}(Euclidean(0.0), 5, 5, -1, 0.0), Random.TaskLocalRNG())

Now, we test for independence in both directions.

independence(test, x, y)
`JointDistanceDistributionTest` independence test
----------------------------------------------------------------------------------
H₀: μ(Δ) = 0 (the input variables are independent)
H₁: μ(Δ) > 0 (there is directional dependence between the input variables)
----------------------------------------------------------------------------------
Hypothetical μ(Δ): 0.0
Observed μ(Δ):     0.4709078138991162
p-value:           1.6810181864279627e-5
  α = 0.05:  ✓ Evidence favors dependence
  α = 0.01:  ✓ Evidence favors dependence
  α = 0.001: ✓ Evidence favors dependence
independence(test, y, x)
`JointDistanceDistributionTest` independence test
----------------------------------------------------------------------------------
H₀: μ(Δ) = 0 (the input variables are independent)
H₁: μ(Δ) > 0 (there is directional dependence between the input variables)
----------------------------------------------------------------------------------
Hypothetical μ(Δ): 0.0
Observed μ(Δ):     0.4652435338034594
p-value:           2.69017598871768e-5
  α = 0.05:  ✓ Evidence favors dependence
  α = 0.01:  ✓ Evidence favors dependence
  α = 0.001: ✓ Evidence favors dependence

As expected, the null hypothesis is rejected in both directions at the pre-determined significance level, and hence we detect directional coupling in both directions.

Non-coupled logistic maps

What happens in the example above if there is no coupling?

sys = system(Logistic2Bidir(c_xy = 0.00, c_yx = 0.0))
x, y = columns(first(trajectory(sys, 1000, Ttr = 10000)));
rxy = independence(test, x, y)
ryx = independence(test, y, x)
pvalue(rxy), pvalue(ryx)
(0.0686678284769523, 0.10720581302326959)

At significance level 0.99, we can't reject the null in either direction, hence there's not enough evidence in the data to suggest directional coupling.

LocalPermutationTest

Conditional mutual information (Shannon, differential)

Chain of random variables $X \to Y \to Z$

Here, we'll create a three-variable scenario where X and Z are connected through Y, so that $I(X; Z | Y) = 0$ and $I(X; Y | Z) > 0$. We'll test for conditional independence using Shannon conditional mutual information (CMIShannon). To estimate CMI, we'll use the Kraskov differential entropy estimator, which naively computes CMI as a sum of entropy terms without guaranteed bias cancellation.

using CausalityTools

X = randn(1000)
Y = X .+ randn(1000) .* 0.4
Z = randn(1000) .+ Y
x, y, z = StateSpaceSet.((X, Y, Z))
test = LocalPermutationTest(CMIShannon(base = 2), Kraskov(k = 10), nshuffles = 30)
test_result = independence(test, x, y, z)
`LocalPermutationTest` independence test
---------------------------------------------------------------------
H₀: "The first two variables are independent, given the 3rd variable"
Hₐ: "The first two variables are dependent, given the 3rd variable"
---------------------------------------------------------------------
Estimated: 0.9213969421996389
Ensemble quantiles (30 permutations):
    (99.9%): -0.08648647147323632
    (99%):   -0.08653371882228578
    (95%):   -0.08756963820227828
p-value:   0.0
α = 0.05:  ✓ Evidence favors dependence
α = 0.01:  ✓ Evidence favors dependence
α = 0.001: ✓ Evidence favors dependence

We expect there to be a detectable influence from $X$ to $Y$, if we condition on $Z$ or not, because $Z$ doesn't influence neither $X$ nor $Y$. The null hypothesis is that the first two variables are conditionally independent given the third, which we reject with a very low p-value. Hence, we accept the alternative hypothesis that the first two variables $X$ and $Y$. are conditionally dependent given $Z$.

test_result = independence(test, x, z, y)
`LocalPermutationTest` independence test
---------------------------------------------------------------------
H₀: "The first two variables are independent, given the 3rd variable"
Hₐ: "The first two variables are dependent, given the 3rd variable"
---------------------------------------------------------------------
Estimated: -0.09728118529104712
Ensemble quantiles (30 permutations):
    (99.9%): -0.11997086678126215
    (99%):   -0.12221744449848704
    (95%):   -0.1385539114467165
p-value:   0.0
α = 0.05:  ✓ Evidence favors dependence
α = 0.01:  ✓ Evidence favors dependence
α = 0.001: ✓ Evidence favors dependence

As expected, we cannot reject the null hypothesis that $X$ and $Z$ are conditionally independent given $Y$, because $Y$ is the variable that transmits information from $X$ to $Z$.

Transfer entropy (Shannon, differential)

Chain of random variables $X \to Y \to Z to W$

Here, we demonstrate LocalPermutationTest with the TEShannon measure with default parameters and the FPVP estimator. We'll use a system of four coupled logistic maps that are linked X → Y → Z → W.

using CausalityTools
using Random; rng = Random.default_rng()
s = system(Logistic4Chain(; xi = rand(4)))
x, y, z, w = columns(first(trajectory(s, 2000)))
test = LocalPermutationTest(TEShannon(), FPVP(), nshuffles = 50)
test_result = independence(test, x, z)
`LocalPermutationTest` independence test
---------------------------------------------------------------------
H₀: "The first two variables are independent"
Hₐ: "The first two variables are dependent"
---------------------------------------------------------------------
Estimated: 0.10570729402665664
Ensemble quantiles (50 permutations):
    (99.9%): -0.4454602991089679
    (99%):   -0.44825828166289167
    (95%):   -0.4590188781970154
p-value:   0.0
α = 0.05:  ✓ Evidence favors dependence
α = 0.01:  ✓ Evidence favors dependence
α = 0.001: ✓ Evidence favors dependence

There is significant transfer entropy from X → Z. We should expect this transfer entropy to be non-significant when conditioning on Y, because all information from X to Z is transferred through Y.

test_result = independence(test, x, z, y)
`LocalPermutationTest` independence test
---------------------------------------------------------------------
H₀: "The first two variables are independent, given the 3rd variable"
Hₐ: "The first two variables are dependent, given the 3rd variable"
---------------------------------------------------------------------
Estimated: -1.2198764008727012
Ensemble quantiles (50 permutations):
    (99.9%): -1.1421298600618275
    (99%):   -1.1437674667616895
    (95%):   -1.1491416893659891
p-value:   1.0
α = 0.05:  ✖ Independence cannot be rejected
α = 0.01:  ✖ Independence cannot be rejected
α = 0.001: ✖ Independence cannot be rejected

As expected, we cannot reject the null hypothesis that X and Z are conditionally independent given Y.

The same goes for variables one step up the chain

test_result = independence(test, y, w, z)
`LocalPermutationTest` independence test
---------------------------------------------------------------------
H₀: "The first two variables are independent, given the 3rd variable"
Hₐ: "The first two variables are dependent, given the 3rd variable"
---------------------------------------------------------------------
Estimated: -1.2039122940560982
Ensemble quantiles (50 permutations):
    (99.9%): -1.112077640126405
    (99%):   -1.1127495489105437
    (95%):   -1.115043136902454
p-value:   1.0
α = 0.05:  ✖ Independence cannot be rejected
α = 0.01:  ✖ Independence cannot be rejected
α = 0.001: ✖ Independence cannot be rejected

SurrogateTest

Distance correlation

using CausalityTools
x = randn(1000)
y = randn(1000) .+ 0.5x
independence(SurrogateTest(DistanceCorrelation()), x, y)
`SurrogateTest` independence test
---------------------------------------------------------------------
H₀: "The first two variables are independent"
Hₐ: "The first two variables are dependent"
---------------------------------------------------------------------
Estimated: 0.429178979414558
Ensemble quantiles (100 permutations):
    (99.9%): 0.084027076408428
    (99%):   0.08277641951651224
    (95%):   0.07679698601758908
p-value:   0.0
α = 0.05:  ✓ Evidence favors dependence
α = 0.01:  ✓ Evidence favors dependence
α = 0.001: ✓ Evidence favors dependence

Partial correlation

using CausalityTools
x = randn(1000)
y = randn(1000) .+ 0.5x
z = randn(1000) .+ 0.8y
independence(SurrogateTest(PartialCorrelation()), x, z, y)
`SurrogateTest` independence test
---------------------------------------------------------------------
H₀: "The first two variables are independent, given the 3rd variable"
Hₐ: "The first two variables are dependent, given the 3rd variable"
---------------------------------------------------------------------
Estimated: 0.005884331955841984
Ensemble quantiles (100 permutations):
    (99.9%): 0.06003425644462099
    (99%):   0.046554971687715385
    (95%):   0.040289761256576566
p-value:   0.38
α = 0.05:  ✖ Independence cannot be rejected
α = 0.01:  ✖ Independence cannot be rejected
α = 0.001: ✖ Independence cannot be rejected

Mutual information (MIShannon, categorical)

In this example, we expect the preference and the food variables to be independent.

using CausalityTools
# Simulate
n = 1000
preference = rand(["yes", "no"], n)
food = rand(["veggies", "meat", "fish"], n)
test = SurrogateTest(MIShannon(), Contingency())
independence(test, preference, food)
`SurrogateTest` independence test
---------------------------------------------------------------------
H₀: "The first two variables are independent"
Hₐ: "The first two variables are dependent"
---------------------------------------------------------------------
Estimated: 0.002287507119021698
Ensemble quantiles (100 permutations):
    (99.9%): 0.009450018397440377
    (99%):   0.0058292437747054
    (95%):   0.00396360134950233
p-value:   0.13
α = 0.05:  ✖ Independence cannot be rejected
α = 0.01:  ✖ Independence cannot be rejected
α = 0.001: ✖ Independence cannot be rejected

As expected, there's not enough evidence to reject the null hypothesis that the variables are independent.

Conditional mutual information (CMIShannon, categorical)

Here, we simulate a survey at a ski resort. The data are such that the place a person grew up is associated with how many times they fell while going skiing. The control happens through an intermediate variable preferred_equipment, which indicates what type of physical activity the person has engaged with in the past. Some activities like skateboarding leads to better overall balance, so people that are good on a skateboard also don't fall, and people that to less challenging activities fall more often.

We should be able to reject places ⫫ experience, but not reject places ⫫ experience | preferred_equipment. Let's see if we can detect these relationships using (conditional) mutual information.

using CausalityTools
n = 10000

places = rand(["city", "countryside", "under a rock"], n);
preferred_equipment = map(places) do place
    if cmp(place, "city") == 1
        return rand(["skateboard", "bmx bike"])
    elseif cmp(place, "countryside") == 1
        return rand(["sled", "snowcarpet"])
    else
        return rand(["private jet", "ferris wheel"])
    end
end;
experience = map(preferred_equipment) do equipment
    if equipment ∈ ["skateboard", "bmx bike"]
        return "didn't fall"
    elseif equipment ∈ ["sled", "snowcarpet"]
        return "fell 3 times or less"
    else
        return "fell uncontably many times"
    end
end;

test_mi = independence(SurrogateTest(MIShannon(), Contingency()), places, experience)
`SurrogateTest` independence test
---------------------------------------------------------------------
H₀: "The first two variables are independent"
Hₐ: "The first two variables are dependent"
---------------------------------------------------------------------
Estimated: 0.913798558297143
Ensemble quantiles (100 permutations):
    (99.9%): 0.0009412682118663583
    (99%):   0.0005850271905672326
    (95%):   0.0003825112305142701
p-value:   0.0
α = 0.05:  ✓ Evidence favors dependence
α = 0.01:  ✓ Evidence favors dependence
α = 0.001: ✓ Evidence favors dependence

As expected, the evidence favors the alternative hypothesis that places and experience are dependent.

test_cmi = independence(SurrogateTest(CMIShannon(), Contingency()), places, experience, preferred_equipment)
`SurrogateTest` independence test
---------------------------------------------------------------------
H₀: "The first two variables are independent, given the 3rd variable"
Hₐ: "The first two variables are dependent, given the 3rd variable"
---------------------------------------------------------------------
Estimated: 0.0
Ensemble quantiles (100 permutations):
    (99.9%): 5.122954902592361e-16
    (99%):   4.80613281793856e-16
    (95%):   4.64168491836522e-16
p-value:   0.78
α = 0.05:  ✖ Independence cannot be rejected
α = 0.01:  ✖ Independence cannot be rejected
α = 0.001: ✖ Independence cannot be rejected

Again, as expected, when conditioning on the mediating variable, the dependence disappears, and we can't reject the null hypothesis of independence.

Transfer entropy (TEShannon)

Pairwise

We'll see if we can reject independence for two unidirectionally coupled timeseries where x drives y.

using CausalityTools
sys = system(Logistic2Unidir(c_xy = 0.5)) # x affects y, but not the other way around.
x, y = columns(first(trajectory(sys, 1000, Ttr = 10000)))

test = SurrogateTest(TEShannon(), KSG1(k = 4))
independence(test, x, y)
`SurrogateTest` independence test
---------------------------------------------------------------------
H₀: "The first two variables are independent"
Hₐ: "The first two variables are dependent"
---------------------------------------------------------------------
Estimated: 1.3748004482945986
Ensemble quantiles (100 permutations):
    (99.9%): -0.3680974059723177
    (99%):   -0.36849935155962027
    (95%):   -0.38057142337793703
p-value:   0.0
α = 0.05:  ✓ Evidence favors dependence
α = 0.01:  ✓ Evidence favors dependence
α = 0.001: ✓ Evidence favors dependence

As expected, we can reject the null hypothesis that the future of y is independent of x, because x does actually influence y. This doesn't change if we compute partial (conditional) transfer entropy with respect to some random extra time series, because it doesn't influence any of the other two variables.

independence(test, x, y, rand(length(x)))
`SurrogateTest` independence test
---------------------------------------------------------------------
H₀: "The first two variables are independent, given the 3rd variable"
Hₐ: "The first two variables are dependent, given the 3rd variable"
---------------------------------------------------------------------
Estimated: 0.35675703047048013
Ensemble quantiles (100 permutations):
    (99.9%): -0.3165451470914358
    (99%):   -0.31728792533741806
    (95%):   -0.3228504922607298
p-value:   0.0
α = 0.05:  ✓ Evidence favors dependence
α = 0.01:  ✓ Evidence favors dependence
α = 0.001: ✓ Evidence favors dependence

SMeasure

using CausalityTools
x, y = randn(3000), randn(3000)
measure = SMeasure(dx = 3, dy = 3)
s = s_measure(measure, x, y)
0.010801005496614738

The s statistic is larger when there is stronger coupling and smaller when there is weaker coupling. To check whether s is significant (i.e. large enough to claim directional dependence), we can use a SurrogateTest, like here.

test = SurrogateTest(measure)
independence(test, x, y)
`SurrogateTest` independence test
---------------------------------------------------------------------
H₀: "The first two variables are independent"
Hₐ: "The first two variables are dependent"
---------------------------------------------------------------------
Estimated: 0.010801005496614738
Ensemble quantiles (100 permutations):
    (99.9%): 0.010839490873976272
    (99%):   0.010835757873726532
    (95%):   0.01062396658014453
p-value:   0.02
α = 0.05:  ✓ Evidence favors dependence
α = 0.01:  ✖ Independence cannot be rejected
α = 0.001: ✖ Independence cannot be rejected

The p-value is high, and we can't reject the null at any reasonable significance level. Hence, there isn't evidence in the data to support directional coupling from x to y.

What happens if we use coupled variables?

z = x .+ 0.1y
independence(test, x, z)
`SurrogateTest` independence test
---------------------------------------------------------------------
H₀: "The first two variables are independent"
Hₐ: "The first two variables are dependent"
---------------------------------------------------------------------
Estimated: 0.4802919709473727
Ensemble quantiles (100 permutations):
    (99.9%): 0.010823601279271962
    (99%):   0.010779027446573896
    (95%):   0.010544734534271965
p-value:   0.0
α = 0.05:  ✓ Evidence favors dependence
α = 0.01:  ✓ Evidence favors dependence
α = 0.001: ✓ Evidence favors dependence

Now we can confidently reject the null (independence), and conclude that there is evidence in the data to support directional dependence from x to z.

PATest

The following example demonstrates how to compute the significance of the PA directional dependence measure using a PATest. We'll use timeseries from a chain of unidirectionally coupled logistic maps that are coupled $X \to Y \to Z \to W$.

Conditional analysis

What happens if we compute$\Delta A_{X \to Z}$? We'd maybe expect there to be some information transfer $X \to Z$, even though the variables are not directly linked, because information is transferred through $Y$.

using CausalityTools
using DelayEmbeddings
using Random
rng = MersenneTwister(1234)

sys = system(Logistic4Chain(xi = [0.1, 0.2, 0.3, 0.4]; rng))
x, y, z, w = columns(first(trajectory(sys, 1000)))
τx = estimate_delay(x, "mi_min")
τy = estimate_delay(y, "mi_min")
test = PATest(PA(ηT = 1:10, τS = estimate_delay(x, "mi_min")), FPVP())
ΔA_xz = independence(test, x, z)
`PATest` independence test
---------------------------------------------------------------------
H₀: "The first two variables are independent"
Hₐ: "The first two variables are dependent"
---------------------------------------------------------------------
p-value:   0.00020077060192156678
α = 0.05:  ✓ Evidence favors dependence
α = 0.01:  ✓ Evidence favors dependence
α = 0.001: ✓ Evidence favors dependence

As expected, the distribution is still significantly skewed towards positive values. To determine whether the information flow between $x$ and $z$ is mediated by $y$, we can compute the conditional distribution $\Delta A_{X \to Z | Y}$. If these values are still positively skewed, we conclude that $Y$ is not a mediating variable. If conditioning on $Y$ causes $\Delta A_{X \to Z | Y}$ to not be skewed towards positive values any more, then we conclude that $Y$ is a mediating variable and that $X$ and $Z$ are linked $X \to Y \to Z$.

measure = PA(ηT = 1:10, τS = estimate_delay(x, "mi_min"), τC = estimate_delay(y, "mi_min"))
test = PATest(measure, FPVP())
ΔA_xzy = independence(test, x, z, y)
`PATest` independence test
---------------------------------------------------------------------
H₀: "The first two variables are independent, given the 3rd variable"
Hₐ: "The first two variables are dependent, given the 3rd variable"
---------------------------------------------------------------------
p-value:   0.9999998147993345
α = 0.05:  ✖ Independence cannot be rejected
α = 0.01:  ✖ Independence cannot be rejected
α = 0.001: ✖ Independence cannot be rejected

We can't reject independence when conditioning on $Y$, so we conclude that $Y$ is a variable responsible for transferring information from $X$ to $Z$.

CorrTest

using CausalityTools
using StableRNGs
rng = StableRNG(1234)

# Some normally distributed data
X = randn(rng, 1000)
Y = 0.5*randn(rng, 1000) .+ X
Z = 0.5*randn(rng, 1000) .+ Y
W = randn(rng, 1000);
1000-element Vector{Float64}:
  0.04119305811415608
 -0.41657367646817706
 -1.6761567284949759
  0.5032612404277547
 -0.5215484260111859
  1.9358224993203865
 -1.0265887777810425
  1.6521804479613016
 -0.8465373053795329
  0.0005699094766695407
  ⋮
 -1.5898712547487501
 -0.7520563510997734
  0.7478677609641371
 -2.191965552657181
 -2.286303058495732
 -0.9855303467641845
  1.0709616459241131
 -0.3575876592835458
  0.480455573541223

Let's test a few independence relationships. For example, we expect that X ⫫ W. We also expect dependence X !⫫ Z, but this dependence should vanish when conditioning on the intermediate variable, so we expect X ⫫ Z | Y.

independence(CorrTest(), X, W)
`CorrTest` independence test
----------------------------------------------------------------------------------
H₀: "The first two variables are independent (given the 3rd variable, if relevant)"
H₁: "The first two variables are dependent (given the 3rd variable, if relevant)"
----------------------------------------------------------------------------------
ρ, (partial) correlation: -0.014905921318266182
p-value (two-sided):      0.6378593400241255
  α = 0.05:  ✖ Independence cannot be rejected
  α = 0.01:  ✖ Independence cannot be rejected
  α = 0.001: ✖ Independence cannot be rejected

As expected, the outcome is that we can't reject the null hypothesis that X ⫫ W.

independence(CorrTest(), X, Z)
`CorrTest` independence test
----------------------------------------------------------------------------------
H₀: "The first two variables are independent (given the 3rd variable, if relevant)"
H₁: "The first two variables are dependent (given the 3rd variable, if relevant)"
----------------------------------------------------------------------------------
ρ, (partial) correlation: 0.8081260288756821
p-value (two-sided):      0.0
  α = 0.05:  ✓ Evidence favors dependence
  α = 0.01:  ✓ Evidence favors dependence
  α = 0.001: ✓ Evidence favors dependence

However, we can reject the null hypothesis that X ⫫ Z, so the evidence favors the alternative hypothesis X !⫫ Z.

independence(CorrTest(), X, Z, Y)
`CorrTest` independence test
----------------------------------------------------------------------------------
H₀: "The first two variables are independent (given the 3rd variable, if relevant)"
H₁: "The first two variables are dependent (given the 3rd variable, if relevant)"
----------------------------------------------------------------------------------
ρ, (partial) correlation: -0.029272091250858005
p-value (two-sided):      0.35544695038151275
  α = 0.05:  ✖ Independence cannot be rejected
  α = 0.01:  ✖ Independence cannot be rejected
  α = 0.001: ✖ Independence cannot be rejected

As expected, the correlation between X and Z significantly vanishes when conditioning on Y, because Y is solely responsible for the observed correlation between X and Y.