Probability Distribution2: Normal distribution

Introduction

The fifth post of my probability series is about continuous probability distributions.

A quick recap: a probability distribution concerns probabilities of all possible outcomes of a sample space. The previous post introduced four different discrete probability distributions.

The current post introduces the most popular continuous probability distribution: normal (Gaussian) distribution. The normal distribution has been an essential tool for Automatic Speech Recognition a long time before deep learning.

Normal distribution

The normal distribution is a bit like the Eiffel Tower. It is a distribution known for its "bell curve". The distribution is pointy in the middle and wider towards the edges of the distribution. The name "Gaussian" distribution is from the mathematician, Gauss.

The properties of the normal distribution are as follows:

Number of trials: 1
Value of random variable: (\(-\infty, \infty\))
Data type: continuous

Unless the upper and lower values are defined in advance, the distribution takes account of all possible values.

Normal distribution vs Multinomial distribution

Before describing the probability density function (PDF) of the normal distribution, I will compare the normal distribution with the multinomial distribution. This should make it clear the difference of distributions dealing with discrete and continuous values. More details of the multinomial distribution are in the last post Probability Distribution1.

The multinomial distribution takes these parameters:

\( n \): the number of trials.
\( p_1, ..., p_k \): The fixed probabilities of each of the \(K\) possible outcomes.

This is intuitive because we use the probability of an outcome as a direct parameter. For example, we can say that the chance of rain tomorrow is 0.3.

The normal distribution is slightly more abstract because it is the continuous measure of values. The two parameters of the normal distribution are:

\( \mu \): the mean (the centre of the distribution).
\( \sigma^2 \): the variance (how spread the values are).

We don't have the exact probability of, for example, 60 minutes as a waiting time to enter the Eiffel Tower. Instead, these two parameters tell us the probability of a value falling within a range e.g., the probability of waiting time to enter the Eiffel Tower \([59-60]\) minutes or \([55-60]\) minutes, depending on how ranges are defined.

(I made a post Basic Probability about the mean and the variance)

Probability density function

The probability density function of the normal distribution is shown below: \[ f(x; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}} \] The math looks scary, but it's not that bad if we decompose every part of the formula:

\( e^{-\frac{(x - \mu)^2}{2\sigma^2}} \): this is the main part of the normal distribution. The figure below illustrates step-by-step transformation of the exponential function and how the bell curve is made by this part of the formula.
\(\sqrt{2\pi\sigma^2}\): this is the normalisation term. The raw function \( e^{-\frac{(x - \mu)^2}{2\sigma^2}} \) has an area of \(\sqrt{2\pi\sigma^2}\). Dividing the curve area by this value ensures integration of the probability density function to be 1.

Firstly, the series of figures below show how this exponential function forms the bell shape (or Eiffel Tower) curve. The full bell curve (the fifth figure) sets mean=1.0 and variance=1.44.

The evolution of the normal distribution: 1. regular exponential function, 2. both sides up by taking the squared values, 3. negative power to flip the function upside down, 4. division by 2 slightly flattens the figure and 5. the real mean=1.0 and variance=1.2^2 to form the shape of the normal distribution.

Hopefully, the five figures demonstrate that the part \( e^{-\frac{(x - \mu)^2}{2\sigma^2}} \) makes the bell shape of the normal distribution. In the last figure, the mean is set to \( \mu=1.0 \) and the variance to \( \sigma^2=1.44 \). That's why the highest point of the normal distribution is located at \( x=1 \), and not in the central of the figure \( x=0 \).

Let's take a look at the influence of different values of variance.

Comparison of three different variance values.

The figure above shows three different values of variance for the same mean \( \mu=1.0 \). As we can see, the smaller variance (values less spread out) leads to a skinnier graph. The bell shape becomes flatter with the larger variance (values more spread out).

Now, the only part left for us to understand is the normalisation part \( \frac{1}{\sqrt{2\pi\sigma^2}} \).

Comparison of the bell curve with and without normalisation.

The left panel of the figure is the normal distribution without applying the normalisation part. The right panel of the figure is the normal distribution with normalisation i.e., the exponential function divided by \( \sqrt{2\pi\sigma^2} \). Here, the number of bins (skinny rectangulars in the figure) is set to 50, so that the bell shape is not completely smooth.

For the distribution to be a probability density function, it must satisfy: \[ \int f(x)dx = 1 \] As the bold texts next to the bins suggest, the left distribution does not meet the condition to be a probability density function. In other words, the bins in the left figure show the normal distribution before applying the normalisation part.

We all learned how to calculate the area of a rectangular at school. That is \( width \times height \). In the left panel, 8 bins from \(x=0 \) to \(x=2\) all exceeded \(y=0.7\). That means that the area of those bins altogether is at least \(2\times0.7=1.4\), which shows violation of the \( \int f(x)dx = 1 \) without applying the normalisation part.

Finally, how do we know the normalisation part is \( \sqrt{2\pi\sigma^2} \)? And how does this part nicely normalise the exponential function \( e^{-\frac{(x - \mu)^2}{2\sigma^2}} \)? Derivation of this is to integrate the exponential function of the normal distribution, using the Gaussian integral: https://en.wikipedia.org/wiki/Gaussian_integral.

Example

Let's say the average waiting time to enter the Eiffel Tower is 60 minutes. The variance is 144 (so the standard deviation is \( \sqrt{144} = 12 \)). \[ \mu = 60 \] \[ \sigma^2 = 144 \] What are the probabilities of waiting time being 60 minutes and 90 minutes with these parameters? Visualisation makes it very easy to understand.

The red line corresponds to the highest point of the probability density function. \[ f(x=60; \mu=60, \sigma^2=144) \approx 0.0332 \] So, the chance of waiting time to enter the Eiffel Tower is \( 3.32 \)%.

The short green line shows the point of the probability density function when waiting time is 90 minutes. \[ f(x=90; \mu=60, \sigma^2=144) \approx 0.0015 \] The chance of 90 minutes to enter the Eiffel Tower is \( 0.015 \)%.

Multivariate Normal Distribution

The normal distribution in the previous section deals with a single phenomenon (e.g., waiting time for the Eiffel Tower). We often need to analyse multiple variables at the same time in out life. The multinomial distribution models multiple discrete random variables. For the multiple continuous random variables, the multivariate normal distribution could be a choice of the distribution.

An input to the multivariate normal distribution is a vector consiting of \(K\) random variables. \[ \mathbf{x} = [ x_1, ..., x_K ] \] The probability density function of the multivariate normal distribution is: \[ f(\mathbf{x}) = \frac{1}{\sqrt{(2\pi)^K \lvert\Sigma\lvert}} e^{-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu})} \] where \( K \) is the dimension size or the number of random variables in \(\mathbf{x}\), \( \mathbf{\mu}\) is the vector of means of each random variable, \( \mathbf{\Sigma} \) is the covariance matrix and \(\lvert\Sigma\lvert \) is the determinant of the covariance matrix.

The underlying principle of this formula is still the same as a single normal distribution: the exponential function is the main distribution and the denominator of the formula is the normalisation part.

The covariance matrix of \(K\) random variables is expressed as follows: \[ \boldsymbol{\Sigma} = \begin{bmatrix} \text{Var}(x_1) & \text{Cov}(x_1,x_2) & \dots & \text{Cov}(x_1,x_K) \\ \text{Cov}(x_2,x_1) & \text{Var}(x_2) & \dots & \text{Cov}(x_2,x_K) \\ \vdots & \vdots & \ddots & \vdots \\ \text{Cov}(x_K,x_1) & \text{Cov}(x_K,x_2) & \dots & \text{Var}(x_K) \end{bmatrix} \] For each pair of two random variables, for example \(x_1\) and \(x_2\), this is how to derive their covariance: \[ \text{Cov}(x_1, x_2) = \frac{1}{n-1}\sum_{i=1}^{n} (x_{1,i} - \mu_1)(x_{2,i} - \mu_2) \] where \(n\) is the number of data points for \(x_1\) and \(x_2\).

As can be seen from the formula, the number of data points of two variables must be the same for computation of their covariance.

Finally, the determinant \( \lvert\Sigma\lvert \) is a scalr (single value) computed from the covariance matrix.

Example

To demonstrate an example of the multivariate normal distribution, I introduce two variables: waiting time to enter the Eiffel Tower and temperature in Paris. \[ \mu_1 = 60; \sigma^2_1=144 \] \[ \mu_2 = 20; \sigma^2_2=25 \] We don't have real data points of the waiting time and temparature in Paris here. So, I will add another parameter. \[ \rho_{12}=0.6 \] This assumes that these two random variables are moderately correlated. This Wikipedia page explains the relationship between correlation, standard deviations and covariance. The covariance of waiting time and temperature is multiplication of correlation and standard deviations of the two variables: \[ Cov(x_1, x_2) = \rho_{12} \sigma_1 \sigma_2 \] The figure below visualises the multivariate normal distribution of this example.

The red line is the normal distribution of wait time and the blue line the normal distribution of temperature. In the middle of the figure, a 3D heatmap visualises the chance of the two random variables occurring together.

The \( \rho_{12} = 0.6 \) assumes a moderate level of correlation between the two variables and the 3D heatmap looks like a diagonal sphere. When the correlation value is 0, the covariance value also becomes 0. Then, the 3D heatmap would be a simple circle.

Summary

This post covered the details of the normal (Gaussian) distribution and the multivariate normal distribution. The math of the normal distribution looks difficult for having the exponential function and the normalisation part. I hope that this post explained the underlying concept of the normal distribution within that mathematical formula.

This Jupyter notebook has everything to produce the figures in this post.

Speech & Language Blog