Probability Distribution1

Introduction

The fourth post of my probability series is about probability distributions.

In the previous posts, I discussed probabilities involving a single and multiple random variables. Briefly, a random variable in an experiment or a trial maps a specific outcome from a sample space to a numeric value.

The example of a probabilistic event which I kept using was whether I to go to Paris next year \(X=1\) or not \(X=0\). A probability of a specific event can be expressed like this: \[ P(X=1) = 0.7 \] The generalised probability is then like this: \[ P(X) \] where \(X\) can be anything in a defined sample space. A probability distribution concerns probabilities of all possible outcomes of a sample space. So, \( P(X=1)=0.7 \) is a part of a probability distribution, but the whole representation of the probability distribution is \( P(X) \).

More concrete examples are: we think about "going to Paris or not", "a chance of getting a number {1,2,3,4,5,6} by throwing a die" and "a chance of {sunny, rain, cloudy, snow}".

A probability distribution is also a great way to visualise probabilities of all possible outcomes such as going to Paris \( P(X=1) = 0.7 \) and \( P(X=0) = 0.3 \).

One important thing to note is that mathematically any probability distribution must sum up to 1. This is true when we use a limited set of weather events "{sunny, rain, cloud, snow}" and exclude "{hail, storm, cats and dogs}". In this defined space, probabilities of {sunny, rain, cloud, snow} must account for everything.

Discrete vs Continuous

Before learning specific probability distributions, we need to know the difference between discrete and continuous values.

Discrete values have a gap between two values.

Paris or no Paris {1, 0}
Numbers on a die {1, 2, 3, 4, 5, 6}
Shoe sizes {5.5, 6.0, 6.5, 7.0}

Regardless of the values being integers or floats, there is a gap between one value and another.

The probabilities of discrete values are expressed with the probability mass function (PMF). \[ p(x) = P(X=x) \] If outcomes are indexed by \(i = 1, ..., K \), then the probability mass function assigns a \( p_i \) to each outcome. \[ P(X=i) = p_i \] The probability mass function also must meet the following conditions: \[ p_i \ge 0 \] \[ \sum^K_{i=1} p_i = 1 \] This might look compliated. But, this is simply saying that "no probability values are negative" and "the sum of all probabilities is 1".

Continuous values on the other hand have no gaps between two values.

Body temperature: [ 35.0 - 38.0 ] & 36.983854918 is not illegal
Height: [ 140 - 220 ] cm & 100.29238 is not illegal
Sound pitch [ 0 - 20000 ] Hz & 7.28438 Hz is not illegal

Though values have no gaps in-between, assigning a probability for a specific value of a body temperature "36.8739" is mathematically impossible. There are an infinite number of body temperature values and the probability of this exact number is almost always \(P(T=36.8739) = 0\). Therefore, continuous values are often considered in a range e.g., 35.0-35.9 for getting a probability distribution.

The probabilities of continuous values are expressed with the probability density function (PDF). \[ f(x) \] If we set the lower and upper limit of human height at 140 and 220 cm, we need to integrate all of the probability values to sum up to 1. This is because the values are continuous, in contrast to summation used for the probability mass function. \[ \int_{140}^{220}f(x)dx = 1 \] So the integrated probabilities of human height from 140 cm to 220 cm must be 1 in this example.

Probability Distributions

A probability distribution considers probabilities of all possible outcomes in an experiment. Having a probability distribution gives us a visual interpretation of an experiment.

Values of random outcomes can be discrete or continuous, and a sample space carries certain properties e.g., whether to go to Paris or not to go to Paris is binary {0, 1}, a fair die has six equally likely outcomes {1, 2, 3, 4, 5, 6}. Therefore, some probability distributions are more suitable for certain experiments.

The following properties of an experiment are important for selecting a right probability distribution:

Number of trials
Sample space
Data type i.e., continuous or discrete

The reminder of this post gives an overview on four probability distributions: Bernouille, binomial, categorical and multinomial distributions.

Bernouille distribution

The Bernouille distribution is suitable for experiments with the following properties:

Number of trials: 1
Sample space: {0, 1}
Data type: discrete

So, examples of when the Bernouille distribution is used are:

Going to Paris (1) or not to going (0)
Coin heads (1) or tails (0)
A card is red (1) or black (0)

The mathematical definition of the Bernouille distribution is below: \[ p(x) = \begin{cases} p & \text{if } x = 1 \\ 1-p & \text{if } x = 0 \end{cases} \] Or here is the version in a single line: \[ p(x) = p^x(1-p)^{1-x}, \quad x \in \{0, 1\} \] We don't need an index for \(p\) when we have only two outcomes. We can infer the probability of \(x = 0\) by subtracting \(p\) from 1.

Bernouille distribution of Paris (1) or no Paris (0)

The bar chart above illustrates the Bernouille distribution of Paris \(P(X=1) = 0.7\) and no Paris \(P(X=0)=0.3\). Typically, the Bernouille distribution shows a single trial of success (1) and failure (0).

We have probabilities of a single trial and this is the reason why the figure shows both 0 and 1 on the x-axis in the figure.

Binomial distribution

The binomial distribution is generalisation of the Bernouille distribution with n trials.

Number of trials: n
Sample space: {0, 1}
Data type: discrete

The binomial distribution is suitable for the following experiments:

10 year track record of going to Paris or not per year
100 coin flips
Drawing a card 150 times with replacement and check red or black

Drawing a card without replacement follows another distribution (hypergeometric distribution) due to changes in a probability of red and black for every trial.

The mathematical definition of the binomial distribution is as follows: \[ p(x) = \binom{n}{x} p^x (1-p)^{n-x}, \quad x \in \{0, 1, \dots, n\} \] where \(n\) is the number of trials and \(x\) the number of successes i.e., the number of times the random variable \(X\) returns 1 during an experiment. We might see the combination expressed as \( {}_nC_{x} \) instead of \( \binom{n}{x} \).

The combination essentially shows the number of possible ways of choosing outcomes from a sample space without considering the order. The formula to calculate the combination is: \[ \binom{n}{x} = \frac{n!}{x!(n-x)!} \] For example, if we have 3 trials and 2 successes: \[ \binom{3}{2} = \frac{2!}{2!(3-2)!} = \frac{3\times2\times1}{(2\times1)\times1} = 3\] and the following events are possible ways to have 2 successes within 3 trials:

{1, 1, 0}
{1, 0, 1}
{0, 1, 1}

When the number of trials \(n=1\), the formula is identical to the Bernouille distribution: \[ \binom{1}{1} = 1 \] \[ p(x) = p^x (1-p)^{1-x}, \quad x \in \{0, 1\} \] This confirms that the binomial distribution is the general version of Bernouille.

Binomial distribution of the chance of going to Paris for 10 years

The figure illustrates the binomial distribution of me going to Paris with the chance of \( P(X=1)=0.7 \) per year for the next 10 years \( n=10 \).

Two things to note in this figure:

The probability of 7 successful trips is the highest.
The x-axis shows the number of successful trials \( x \) instead of success and failure probabilities of a single trial.

Categorical distribution

The categorical distribution is an extension of the Bernouille distribution concerning more than 2 outcomes for a single trial:

Number of trials: 1
Sample space: {0, ..., K}
Data type: discrete

Examples suitable for the categorical distribution include:

Throwing a die {1, 2, 3, 4, 5, 6}
Travelling to {Paris, Tokyo, Buenos Aires}
The weather tomorrow will be {Sunny, Rainy, Cloudy, Snowy}

Does extending the Bernouille mean that we need to modify this formula: \[ p(x) = \begin{cases} p & \text{if } x = 1 \\ 1-p & \text{if } x = 0 \end{cases} \] to this one? \[ p(x) = \begin{cases} p_1 & \text{if } x = 1 \\ p_2 & \text{if } x = 2 \\ ... \\ p_K & \text{if } x = K \end{cases} \] The extended formula looks awkward. Suppose \( \mathbf{x} \) is a one hot vector of \( K \) elements \( [ x_0, ..., x_K ] \), and \( x_i \) denotes i-th element being {0, 1}, we can express the categorical distribution with this formula: \[ p(x) = \prod_{i=1}^K p_i^{x_i} \] For example, if we have \[ \mathbf{x} = [ 0, 1, 0] \] \[ \mathbf{p} = [ 0.3, 0.3, 0.4 ] \] these conditions let us focus on the probability of the 2nd outcome. \[ p(\mathbf{x}=[0, 1, 0]) = 0.3^0 \times 0.3^1 \times 0.4^0 = 1 \times 0.3 \times 1 = 0.3 \]

Categorical distribution of Paris, Tokyo and Buenos Aires

The figure shows an example of the categorical distribution. This example demonstrates probabilities of a single trip to another city: Paris, Tokyo and Buenos Aires. We have three outcomes on the x-axis.

Multinomial distribution

The multinomial distribution is the distribution for multiple trials with multiple outcomes. This is the most generalised version of the all four distributions of this post.

The followings are properties of the multinomial distribution:

Number of trials: n
Sample space: {0, ..., K}
Data type: discrete

Examples suitable for the multinomial distribution:

10 year travel record of {Paris, Tokyo, Buenos Aires}
5 day weather record {Sunny, Snowy, Rainy, Cloud}
Throwing a die 8 times

It is worth noting that each trial is independent in the multinomial distribution. We cannot travel to Paris and Tokyo in one trial for example.

The probability mass function of the multinomial distribution is: \[ p(x) = \frac{n!}{x_1! x_2! \dots x_K!} \prod_{i=1}^K p_i^{x_i} \] The second part of the formula \( \prod_{i=1}^K p_i^{x_i} \) is what we saw in the categorical distribution. The first part is new, and we can guess that this is relevant to the combination similar to the binomial distribution. This is the generalised version of the binomial combination, or referred to as the multinomial coefficients.

The binomial distribution has only two possible outcomes, and the number of successes i.e., \( X=1 \) is sufficient to infer the number of failures with \( (n-x)! \) where \( n \) is the number of trials and \( x \) the number of successes. For the multinomial distribution, we need to know the number of times every possible outcome occurs. \( x_i \) denotes the number i-th outcomes.

Here is an example. We keep a track record of trips for 2 years (\(n = 2\)). We go once to Paris, never to Tokyo, and once to Buenos Aires ( \(\mathbf{x} = [1, 0, 1] \)). The multinomial coefficient for this condition is: \[ \frac{n!}{x_1! x_2! \dots x_K!} = \frac{2!}{1! 0! 1!} = \frac{2\times 1}{1 \times 1 \times 1} = \frac{2}{1} = 2 \] We indeed have two ways to express this:

{Paris, Buenos Aires}
{Buenos Aires, Paris}

The probability when \( \mathbf{x} = [1, 0, 1] \) and \( \mathbf{p} = [0.3, 0.3, 0.4 ] \) is then: \[ p(\mathbf{x} = [1, 0, 1]) = \frac{n!}{x_1! x_2! \dots x_K!} \prod_{i=1}^K p_i^{x_i} = 2 (0.3^1 \times 0.3^0 \times 0.4^1) = 2 \times 0.3 \times 1 \times 0.4 = 0.24 \] The visualisation of the multinomial distribution using this trip example can be illustrated as follows:

The multinomial distribution indeed considers every possible outcome within the number of trials \(n=2\). The sum of these probabilities is 1 and this satisfies the condition \( \sum^K_{i=1} p_i = 1 \).

Visualising every possible outcome with \(k=3; n=2\) is still possible in one figure. If we increase the number of trials \(n\) or possible outcomes \(k\), what would happen?

The formula to calculate the number of possible outcomes for given \(k\) and \(n\): \[ \binom{n+k-1}{k-1} \] If we increase the number of trials to \(n=3\): \[ \binom{3+3-1}{3-1} = \frac{5!}{2!(5-2)!} = \frac{5\times4}{2\times1} = 10 \] The number of possible outcomes is 10 when \(k=3; n=3\) and 66 when \(k=3; n=10\). Having all those outcomes in a single figure would clutter visualisation a little. It is guaranteed, however, that the probability mass function of the multinomial distribution still accounts for probabilities of all possible outcomes for all the trials.

Summary

This post summarised the four probability distributions: Bernouille, binomial, categorical and multinomial. All of these are discrete probability distributions expressed by the probability mass function (PMF).

The figure above illustrates the relationship of the four probability distributions discussed in this post. We started with the simplest probability distribution with a single trial and a binary outcome (Bernouille), and ended with the most generalised version (multinomial).

For anyone interested, this notebook should produce all of the figures in the post: https://github.com/yasumori/blog/blob/main/2026/2026_03_probability_distributions1.ipynb

Speech & Language Blog