18 July 2014

Predictive Causality

We know correlation does not imply causality. What does?

I have seen the following situation countless times:

1. newspaper publishes article with the title ‘humans with long noses are more prone to having babies’, or something equally ridiculous-sounding;
2. invariably one commenter points out that the authors are idiots, because ‘correlation does not imply causation’; and, d'oh!, they only showed correlation

Which begs the question: What implies causation? Somehow, the discussion rarely proceeds to this third step.

It turns out, nothing really implies causation. So, saying that ‘correlation doesn't imply causation’ is kinda meaningless.

We can still gather evidence that makes us more or less sure that a causal relation exists. What kind of evidence? For example, correlation. But, yes, (Pearson) correlation constitutes only weak evidence. There are three problems with it:

1. It only detects linear dependences.
2. It does not say which way the influence goes.
3. It does not account for hidden variables. (Perhaps a third thing causes both two observed things.)

To alleviate the first problem, you can compute mutual information instead.

To alleviate the second problem, you can try to build a predictive model. Say you want to see if $x$ causes $y$. Then you come up with an algorithm which you feed observations of $x$ and out come predictions of $y$. If you did a good job, then you have fairly high confidence that a causal connection exists. This is basically how physics works. There is always the danger that you'll discover a setup in which your algorithm's predictions are bad.

If $x$ and $y$ are time-series and you bring back the assumption that dependency is linear, then there are nice mathematical tools you can use: see Granger causality.

I don't know of any good way to account for hidden variables, apart from ‘test, test again’. A particularly good way to test is to make sure you are in control, and

• try to systematically cover all possible values of the independent variable $x$
• try to systematically cover all possible values of all other possibly relevant variables you can think of

But, in the end, we'll never be absolutely sure that one thing causes another.

Anonymous said...

The phrase "correlation does not imply causation" can be, and has been, used for great evil.

Several years ago, some were denying that there was any possible cause-effect relationship between deployment of a taser and the occasional death that sometimes followed nearly immediately after. They squawked the same "correlation is not causation" refrain as was once used by the tobacco companies. The usage pattern of taser deployments thankfully provided the built-in control. Typically a taser is drawn and warnings issued often for minutes on end, no deaths. Oftentimes the taser is fired but misses, no deaths. Only when the taser actually makes contact are there any mysterious deaths. The pattern is crystal clear and beyond coincidence.

They carried out a mischievous campaign to attribute these apparently cause-free deaths to "excited delirium". Years later, after hundreds of "taser associated" deaths, it is finally widely accepted that tasers, especially the X-26, can cause or contribute to death. Tobacco was dealt wih by the famous Surgeon General Report. Taser's role in death was formalized by the Braidwood Inquiry.

In fact, correlation *does* imply causation (in many cases), it just requires additional evidence before one may claim a rational proof.

Anonymous said...

I've been wanting to read Judea Pearl's book "Causality", about causal Bayesian networks. can you say anything about that?

Chris Hemedinger said...

Sometimes, correlation is enough to cause us to look further. In the tobacco example (cited by another commenter), perhaps that data led to the research that uncovered a precise mechanism by which tobacco use creates conditions that promote lung cancer. From then on it's not statistics, but hard science that shows you the connection.

But other times a good predictive model is enough to make decisions, and we don't need to know the underlying causes. Young males pay more for auto insurance because a predictive model says they are more likely to have auto accidents. "Being a young male" does not cause accidents -- not directly -- but the model works (at least for the insurance companies).