Basic Probability

Introduction

Probability theory is the foundation of machine learning. Knowledge of machine learning is a requirement for working on a speech and language processing project today. So, probability theory is essential for speech and language processing projects!

The objective of this post is to refresh my knowledge of probability theory. I am keen to connect probability theory with real world examples, and to avoid throwing a bunch of theoretical definitions. Feel free to leave comments if my writing is incorrect.

Probability

The probability is a chance of an occurrence of an event. The probability is a value between 0 and 1.

In contrast, human words are not mathematical. Even if I say "I'll go to Paris next year, 100%", I might not go to Paris100% next year.

When I was in Paris last time.

The theoretical and mathematical probability has to be precise unlike human words. A classical example is coin flipping (assuming that this experiment is not rigged). First, I am organising terminology.

Outcome: A single possible result of an experiment (coin flipping).

Heads
Tails

Event: A set of outcomes.

Heads
Heads Heads Tails
Tails Tails Tails Tails Tails Tails

Sample space: A set of possible outcomes we get from an experiment

$ S = \{Heads, Tails\} $ and $ K $ = the number of unique outcomes

Probability: A chance of getting an event (one or multiple outcomes) from an experiment (coin flipping).

$ P(Heads) = 1/2 = 0.5 $
$ P(Tails) = 1/2 = 0.5 $

A probability must meet the following condition:

\[ \sum_{x_i \in S}^KP(x_i) = 1\]

Coming back to the Paris example, saying "let's meet again next year, 100%" is mathematically wrong. Unexpected happens in life and more accurate definition of meeting in Paris could be: \[ P(Meet) + P(No\_more\_holiday) + P(No\_money) + P(Paris\_gone\_from\_Earth) + ... \]

Sample space is enormous in life!

Random Variable

A variable is something that varies. In Python, setting a variable a = 2 and then increment this a += 1 makes a = 3. The value of a varies. The algebraic variable is a single fixed number: $ x = 5; 5x + 2 = 27 $.

I find the naming of random variable frustrating. A random variable is a function. Yes, it is a function :(

In probability theory, we want to express random outcomes with numeric values. So, we want to assign a value to each outcome in sample space: $ S = \{Heads, Tails\} $.

We usually associate heads with positivity and tails with negativity. The random variable for coin flipping can be defined as $ X $. A title case $ X $.

Using this function, let's assign a value to coin flipping outcomes:

\[ X(Heads) = 1; X(Tails) = 0 \]

Coming back again to the Paris example, we can define a new random variable $ Y $ to assign an economic value to each outcome above.

\[ Y(Meet) = 1000; Y(No\_money) = 0; Y(Paris\_gone\_from\_Earth) = -1 trillion \]

If I go to Paris next year again, I will probably spend \$1,000 dollars for hotel, food and transportation. If I have no money and I do not travel to Paris next year, no economic value will be made to the city. If an unexpected arrival of aliens happens to Paris and they destroy the city, that will be a \$1 trillion of economic loss (The economy of Paris (GDP) is \$1 trillion).

We have seen categorical outcomes so far, but a random variable also assigns a numeric value to a numeric outcome. A random variable $ Z(3) = 3 $ is when throwing a die and an outcome is 3. A random variable is a function and it can assign a value 2 to getting an outcome 3 $ Z(3) = 2 $ too. We might want to do this if we get 2 points with the number 3 from a die. This situation could be a specially designed board game, with number 1-3 = 2 points and number 4-6 = 6 points.

Expectation / Expected Value

The expectation or expected value is the number we would expect to see from a random outcome.

The expected value of throwing a die is 3.5. The expected value of coin flipping is 0.5, assuming heads is 1 and tails is 0. How do we calculate this? $ \frac{1+2+3+4+5+6}{6} = 3.5 $ for throwing a die and $ \frac{1 + 0}{2} = 0.5 $ for flipping a coin? Not quite. We are discussing random variables and we need to consider probabilities.

The formula of the expectation is this: \[ \mathbb{E}[X] = \sum_{x_i \in S}^K x_i P(x_i) \] $ S $ is the sample space, and $ x_i $ is an $i$th outcome of $ K $ outcomes in the sample space. We can now consider probabilities of the example of throwing a die and coin flipping. Let's define $ X_{die} $ being a random variable of throwing a die:

\[ \mathbb{E}[X_{die}] = 1 \times \frac{1}{6} + 2 \times \frac{1}{6} + 3 \times \frac{1}{6} + 4 \times \frac{1}{6} + 5 \times \frac{1}{6} + 6 \times \frac{1}{6} = \frac{21}{6} = 3.5 \]

Now $ X_{coin} $ is a random variable of flipping a coin and its expected value is:

\[ \mathbb{E}[X_{coin}] = 1 \times \frac{1}{2} + 0 \times \frac{1}{2} = \frac{1}{2} = 0.5 \]

Hmm, did we still get the same results as previous calculations?

This was confusing to me for the first time. The arithmetic mean / average and the expected value of throwing a die and flipping a coin are the same result.

A better example is the economic value Paris receives next year. Let's limit the number of outcomes to the following three and a random variable : \[ X_{paris}(Meet) = 1000; X_{paris}(No\_money) = 0; X_{paris}(Paris\_gone\_from\_Earth) = -1 trillion \] Do we want to just average those values to calculate the economic value of Paris next year? $ \frac{1000 + 0 - 1 trillion}{3} \approx -1 trillion $. This looks wrong because the chance of aliens destroying Paris should be close to 0.

Let's say that I am going to Paris next year, having a very close to 50:50 chance, and the chance of aliens visiting Paris is very close to one in a trillion. The expected value of the Paris economy next year is:

\[ \mathbb{E}[X_{paris}] = 1000 \times 0.499...95 + 0 \times 0.499...95 + -1 trillion \times 0.00...001 \approx 499 \]

Thanks to the expected value, we can estimate this complex scenario too.

Comment
byu/itiswhatitis985 from discussion
instatistics

This is a very good summary of the expected value vs the average, so I leave this reddit comment here.

Variance

Another concept relating to probability theory is the variance. The expected value tells us the middle ground of values of outcomes from a random variable. The variance tells us how spread values of random outcomes are.

Similar to the expected value, let's consider the case without probabilities first for simplicity. The arithmetic mean of a 6-side die is $ \bar{x} = 3.5 $. As we saw in the section of the Expectation. The formula to calculate the variance without considering probabilities is as follows: \[ V(X) = \frac{1}{K} \sum^{K}_{i=1}(x_i - \bar{x})^2 \] $ K $ is the number of possible outcomes and $ x_i $ is the value from each outcome. We can calculate the variance of the 6-side die. \[ V(X_{die}) = \frac{(-2.5)^2 + (-1.5)^2 + (-0.5)^2 + 0.5^2 + 1.5^2 + 2.5^2}{6} = 17.5 \] Let's then consider 8-side dice with 0 and 7 as additional values. The average of the 8-side die is still 3.5. However, the variance is: \[ V(X_{8SideDie}) = \frac{(-3.5)^2 + (-2.5)^2 + (-1.5)^2 + (-0.5)^2 + 0.5^2 + 1.5^2 + 2.5^2 + 3.5^2}{8} = 42 \] As promised, the variance tells us that the values of the 8-side dice are more spread than the 6-side dice.

You might have noticed that probabilities of getting heads or tails for coin flipping and probabilities of a value for throwing a die are all even (uniform probability). That was the reason why we didn't have to consider probabilities for calculating the arithmetic mean and the variance.

Now, let's consider randomness in calculation of the variance. The complete formula to calculate the variance of a random variable $ X $ is as follows: \[ V(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] \] It's nothing scary if we calm down and decompose this formula.

The $ X $ is just a random variable. The $ X - \mathbb{E}[X] $ means "subtract a value of an outcome from the expected value of the random variable". In case of a die and $ X(1) = 1 $, that would be $ 1 - 3.5 $. Now, $ (X - \mathbb{E}[X])^2 $ is just the squared value of what we have computed so far. Then, we reach the outer $ \mathbb{E} $ that takes the expected value of the squared values we just computed.

Bringing back the notation denoting the average $ \bar{x} $, this could be expressed as: \[ V(X) = \sum_{i=1}^K P(x_i) (x_i - \bar{x})^2 \] This is the complete formula of the variance. We are now able to compute the variance of a random variable that involves different probabilities (like the chance of my / alien's visit to Paris).

Standard Deviation

The standard deviation is the square root of the variance: \[ SD(X) = \sqrt{V(X)} \] The variance is a little hard to interprete due to squared values of every element.

We saw that the variance of the fair die $ V(X_{die}) = 17.5 $ and the 8 side die $ V(X_{8SideDie}) = 42 $. We knew that the variance of the 8 side die was higher than that of the fair die. However, the values of 17.5 and 42 seemed unrelated to values of regular dice.

The standard deviations of these would be $ SD(X_{die}) = 4.18 $ and $ SD(X_{8SideDie}) = 6.48 $, demonstrating that these values are back to the space of 6-side and 8-side dice. This applies to any units including, litters, meters, grams, height, exam scores, and so on; hence a convenient value for interpretation.

Note

I kept the notation $ P $ for probability and $ S $ for sample space, but I saw $ p $ for probability and $ \mathcal{X} $ for sample space too.

This post only considered discrete random variables. Calculating the expected value and the variance of continuous random variables requires an integration instead of summation.

Statistics introduces new notations for the mean and the variance: $ \mu $, $ \sigma $, $ \bar{x} $ etc. I might write another post on statistics in the future.

Reference

For interactive learning of probability theory basics (and beyond)

https://seeing-theory.brown.edu/basic-probability/

I went through many other online materials to refresh my knowledge of probability theory. We can find so many good resources these days! I won't list all of the pages I have looked at. The most important is that I wrote this post using my own words to explain the basic concepts of probability theory (which I believe is the best way of learning).

Speech & Language Blog