Chain Rule of Probability Theory

Introduction

This is my third post on the probability theory:

The main focus of the series is on easy introduction to probability theory.

A bit of recap: in the last post of Two Random Variables, I discussed probabilities of two random events including joint, conditional and marginal probabilities. That post was about probabilities of two dependent and independent events.

The example was whether Emily and I go to Paris next year. If Emily and I did not know each other, Emily visiting Paris next year would have no impact on my chance of going to Paris (two events were independent). On the other hand, if I knew Emily and she text me about her trip to Paris, that could affect my chance of visiting Paris next year (two events were dependent).

This post focuses on generalisation of multiple events and random variables.

So, I am introducing a new event. An alien is visiting Paris. We need to think about how this event will impact chances of my and Emily's visit to Paris using three random variables.

Alien in Paris

When people hear alien's visit to Paris, they might think it is a negative event. I don't want to impose a bad image of aliens. So, I say that an event of alien's visit to Paris is positive.

Recap: Conditional Probability

Here is a recap of my Paris example. A random variable \(X\) returns 0 for "me not visiting Paris" and 1 for "me visiting Paris". A random variable \(Y\) returns the same values for the decision of Emily's visit to Paris.

If Emily and I don't know each other and one's visit does not have any impact on the chance of the other person visiting Paris, the conditional probability of me going to Paris given Emily going to Paris is simply: \[ P(X=1|Y=1) = P(X=1) \] This is because Emily's visit to Paris has no impact on my decision. The two events are independent.

If Emily texts me that she's going to Paris next year and I want to go there as well, the formula changes: \[ P(X=1|Y=1) = \frac{P(X=1, Y=1)}{P(Y=1)} \] This formula is the ratio of the joint probability \(P(X=1, Y=1)\) and the probability of Emily going to Paris \(P(Y=1)\). The two events are dependent because Emily's decision changes the probability of me going to Paris.

Chain Rule

The joint probability of \(X\) and \(Y\) is expressed as follows: \[ P(X, Y) = P(X|Y)P(Y) \] This is when \(X\) is dependent on the chance of \(Y\). Throughout my posts, \(X\) is me going to Paris and \(Y\) is Emily going to Paris. What if we add a new random variable \(Z\)? The alien visiting Paris is \(Z=1\) and the alien not visiting Paris is \(Z=0\).

We take the probability of \(X\) first. Then, compute the conditional probability of \(X\) given the joint probability of \(Y\) and \(Z\). Next, compute the conditional probability of \(Y\) or \(Z\) given the other variable...

This is the generalised version for computing the joint probability with any number of random variables.

Order of random variables

I introduced the concept of chain rule, as if the order of random variables would not be an issue. Mathematically, \(P(X, Y, Z)\) should always be the same no matter in what order we assemble random variables. However, what influences what can change.

Remember in the Paris example \(X\) is me and \(Y\) is Emily. \(P(X|Y)\) is the probability of me visting Paris given Emily's outcome, while \(P(Y|X)\) shows my influence on Emily's decision.

Conditional Independence

The joint probability of me \(X\) and Emily \(Y\) going to Paris is a simple product of \(P(X)\) and \(P(Y)\) \[ P(X, Y) = P(X|Y)P(Y) = P(X) P(Y) \] This is when Emily's decision to go to Paris \(Y\) has no impact on my chance of visiting Paris \(X\); two events are independent.

Now, the joint probability of me, Emily, and the alien all going to Paris next year is: \[P(X, Y, Z) = P(X|Y, Z) P(Y|Z) P(Z) \] So, the probability of me going to Paris given Emily and the alien, then multiplied by the probability of Emily going to Paris given the alien. Lots of dependency!

What if Emily and I are a big fan of this alien. This alien is very famous and travels to different countries. The popular alien; Master Yoda. The alien is going to Paris next year and both Emily and I want to go there meet the alien (condition). However, Emily and I do not know each other at all (independence).

This is the situation where the two random variables \(X\) and \(Y\) are conditionally independent given \(Z\). Since Emily's decision to go to Paris has no impact on me, the joint probability of the three random variables is: \[P(X, Y, Z) = P(X|Y, Z) P(Y|Z) P(Z) = P(X|Z) P(Y|Z) P(Z) \] Remember that the joint probability of \(X\) and \(Y\) with two independent variables is expressed as: \[ P(X, Y) = P(X|Y)P(Y) = P(X)P(Y) \] This is similar to the equation above except that we don't have \(Z\).

The naming of conditional "independence" is confusing because:

Both \(X\) and \(Y\) are dependent on \(Z\).
But \(X\) and \(Y\) are independent of each other after conditioning on \(Z\).

Example

The joint probability when \(X\) and \(Y\) are independent conditioning on \(Z\) is: \[ P(X, Y, Z) = P(X|Z) P(Y|Z) P(Z) \] Let's say that both Emily and I are big fans of the alien and we go to Paris to see the alien 100%. So I will define the conditional probabilities and the probability of the alien visiting Paris as follows: \[ P(X=1|Z=1) = 1.0 \] \[ P(Y=1|Z=1) = 1.0 \] \[ P(Z=1)=0.8 \] The alien wants to go to Paris to do shopping, but he's only 80% sure. The joint probability of all of us going to Paris next year is: \[ P(X=1, Y=1, Z=1) = 1.0 \times 1.0 \times 0.8 = 0.8 \] These are arbitrary values, but hopefully you can get a sense of conditional independence here.

Conditional Dependence

If we have conditional independence, is there the concept of conditional dependence? There does exist conditional dependence: Wikipedia - Conditional Dependence.

According to the Wikipedia page, two independent events \(X\) and \(Y\) become dependent with presence of the third event \(Z\), showing that \(X\) and \(Y\) being dependent only after conditioning on \(Z\).

Pairwise Independence

When any two events are independent, this is called pairwise independence.

Using the Paris example, Emily \(Y\) and I \(X\), the alien \(Z\) and I \(X\) or Emily \(Y\) and the alien \(Z\) visit Paris next year. However, Paris has a magical field and denies the entry of three of us altogether. Mathematically: \[ P(X, Y) = P(X)P(Y) \] \[ P(X, Z) = P(X)P(Z) \] \[ P(Y, Z) = P(Y)P(Z) \] But \[ P(X=1, Y=1, Z=1) = P(X=1|Y=1, Z=1) = 0 \] because the magic field rejects my entrance to the city when Emily and the alien were already there.

Mutual Independence

The concept of mutual independence is independence of any event. There is no longer a magical field in Paris. None of us, Emily, the alien and I know each other. Any of our visit to Paris will have no impact on the chance of other events.

Using probabilities from the Paris example: \[ P(X=1) = 0.7 \] \[ P(Y=1) = 0.5 \] \[ P(Z=1) = 0.8 \] \[ P(X=1, Y=1, Z=1) = P(X=1) P(Y=1) P(Z=1) = 0.7 \times 0.5 \times 0.8 = 0.28 \] Thanks to mutual independence of these events, we need to just multiply probabilities of each of these people to compute the chance of these people in Paris next year.

Independence Makes Probabilities Easy

One thing very noticeable is that the stronger independence of events, the easier computation of the joint probability of those events.

When we have two independent random variables \(X\) and \(Y\), the joint probability of \(X\) and \(Y\) is: \[ P(X, Y) = P(X) P(Y) \] When we have mutually independent three random variables, the joint probability of \(X\), \(Y\) and \(Z\) is \[ P(X, Y, Z) = P(X) P(Y) P(Z) \] So, the joint probability is just to multiply individual probabilities of each event.

Statistics and machine learning often use an assumption "independent and identically distributed (i.i.d)" random variables. This means that events are assumed to be independent and assumed to be distributed the same way.

In the Paris example, i.i.d is the alien, Emily and I do not know each other, so none of the events influence each other. This is the independent part. All of them only focus only Paris and they never consider Brussels, Rome or London as a destination. This is the identical part.

In statistics and machine learning, it is common to employ the i.i.d assumption for its convenience. Collecting individual data points e.g., individuals going to Paris is easier than observing two or three dependent events e.g., Emily and the alien both in Paris.

Summary

The chain rule of probability looks complex with lots of dependencies between random variables, as the number of random variables grows. The i.i.d assumption comes to save us. We don't need to look at joint presence of two or multiple events with the i.i.d assumption. We look at individual events independently, so that the joint probability of multiple events becomes a simple multiplication of individual probabilities.

Speech & Language Blog