Probability Distribution3: Beta and Dirichlet distribution

Introduction

The sixth post of my probability theory series focuses on the Beta distribution.

The beta distribution has the multivariate version called the Dirichlet distribution. The Dirichlet distribution used to be very popular in Bayesian and natural language processing literature before the LLM era.

A common one-linear explanation of the beta distribution is "a distribution over a probability". I hope all readers are confused, so was I when I heard this for the first time.

Here is an example. I visited paris every year in the past 10 years. That makes my chance of visiting Paris next year 100% according to my travel history.

The beta distribution asks this question: "how confident are we with this 100% chance of me visiting Paris next year"? It is a distribution ("our confidence") over probabilities ("I'll visit Paris next year with 10%, 50% or 100% of a chance").

Unlike the Bernouille distribution (introduced in Probability Distribution1), the beta distribution leaves some uncertainty. Even if I have a 100% record of visiting Paris every year in the past 10 years, I might still not visit the city next year.

Beta Distribution

The characteristics of the beta distribution are the followings:

Number of trials: \(n\) trials
Value of random variable: [0, 1]
Data type: continuous

The beta distribution is about uncertainty of a random event. We can use 1 trial, 10 trials or 1,000 trials to estimate a probability of a random event. The input value of a random variable as we saw in Introduction is \( x \in [0, 1] \). Input is a probability and the beta distribution tells us how certain this probability is. This is referred to as the probability density. Uncertainty of the probability is a continuous value.

The probability density function of the beta distribution is expressed as follows: \[ f(x; \alpha, \beta) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{\mathrm{B}(\alpha, \beta)} \] More details of the two parameters \( \alpha \) and \( \beta \) are discussed later in this post. The function \( \mathrm{B} \) is the beta function: \[ \mathrm{B}(\alpha, \beta) = \int_0^1 t^{\alpha-1}(1-t)^{\beta-1} dt \] where \( t \) is bounded to \( [0, 1] \) with integral because input is a probability \( x \in [0, 1] \).

Similar to the probability distribution of the normal distribution, the numerator \( x^{\alpha-1}(1-x)^\beta-1 \) shapes the graph of the beta distribution. The denominator \( \mathrm{B}(\alpha, \beta) \) is a normalisation part. The normalisation part ensures that the integration of the probability density function (the area under the curver) becomes 1 i.e., satisfying a requirement of the probability density function.

\( \alpha \) and \( \beta \)

The two parameters \( \alpha \) and \( \beta \) determine the shape of the beta distribution. These two parameters enable the distribution to form a different shape flexibly.

The six different settings of alpha and beta values.

The figure above shows 6 different shapes of the beta distribution with various \( \alpha \) and \( \beta \) values. Three things to observe here:

The x-axis is a sweep from 0.0 to 1.0: \( x \in [0, 1] \).
The values of y-axis vary a lot in each sub-plot, showing difference in density.
The different alpha and beta values lead to varied shapes of the distribution in each sub-plot.

The summary of what alpha and beta parameters do is below with an example of whether I go to Paris 100% next year:

\(\alpha=1, \beta=1\): the distribution is a uniform distribution. Every possiblity from 0% to 100% is equally likely. Despite my 100% history of visiting Paris, my visit next year is uncertain.
\(\alpha=10, \beta=1\): a higher alpha value makes the distribution more certain towards occurrence of a high probability. My 100% history of visiting Paris is highly likely accurate.
\(\alpha=1, \beta=10\): a higher beta value makes the distribution more certain towards occurrence of a low probability. My 100% history of visiting Paris is highly likely incorrect.
\(\alpha\gt1, \beta\gt1, \alpha=\beta\): the distribution is a bell shape with the highest in the middle. The chance of me going to Paris being 50:50 is the most likely.
\(\alpha\lt1, \beta\lt1, \alpha=\beta\): the distribution shows an U shape. My 100% history of visiting Paris is either dead right or highly incorrect.
\(\alpha\lt1, \beta\lt, \alpha\lt\beta\): the same as \(\alpha=1, \beta=10\) but the curve towards 0% is almost vertical. The distribution is so confident that my 100% track record of travelling to Paris is incorect.

Example

A classic example to demonstrate how the beta distribution works is coin flipping. But, I don't like this example and I would like to create an example which we are more familiar with in our everyday situation. So, my example is this: how likely is your new colleague (classmate) going to do well on his new job?

When your new colleague just starts a job, you have no prior knowledge about this person. Is this person going to complete a task well with respect to quality and speed? Let's see.

Three stages of the new colleague example.

The three sub-plots above illustrate the transition of the beta distribution with your new colleague's success rate with respect to completion of a job. Initially, we have no information about the new colleague. The chance of success or failure is all flat \(\alpha =1\) and \(\beta = 1 \).

A year has passed since arrival of this colleague. He has been doing very well. He has completed 9 out of 10 tasks successfully. We expect the 90% success rate (red line in the middle plot) from him with \( \alpha = 10 \) and \( \beta = 2 \). The density is 4.26. The success rate though still has some wiggle room to be 0.7 to 1.0, as we can observe in the sub-plot e.g,. relatively high density around \(x=0.7\).

More years have passed. He is not successful like the beginning. 99 successes vs 49 misses. We expect the success rate of 66% from him with \( \alpha = 100 \) and \( \beta = 50 \). Our expectation on this guy to complete tasks is now almost fixed at 66% chance. 1 more success or 1 more failure would not shift this distribution drastically. To have a higher success rate from him, he needs more consecutive successes to improve his reputation.

A message here is that when you start your new job, you should try your hardest to make a good impression. Your reputation would not be ruined easily if your initial impression is excellent.

Joking aside, the real message is this. If the Bernouille distribution (which is described in this post: Probability Distribution1) says the probability is 0.8, that's 0.8. There's no wiggle room. The beta distribution on the other hand considers uncertainty of this probability of 0.8. The beta distribution can say that 0.8 chance is most likely, but 0.7 or 0.9 is also likely. Hence, a "distribution over a probability".

Dirichlet Distribution

The Bernouille distribution is to the categorical distribution what the beta distribution is to the Dirichlet distribution.

Remembering the relationship between the Bernouille distribution and the categorical distribution in this post (Probability Distribution1), the Dirichlet distribution is the multivariate version of the beta distribution.

The properties of the Dirichlet distribution are the followings:

Number of trials: \(n\) trials
Value of random variables: [0, 1] and \( \sum^K_{i=1} x_i = 1 \)
Data type: continuous

The Dirichlet distribution is the generalised version of the beta distribution. When \( K=2 \), the distribution is identical to the Beta distribution with \( x \) and \( 1 - x \) in its numerator.

The probability density function of the Dirichlet distribution is the following: \[ f(x_1, \dots, x_K; \alpha_1, \dots, \alpha_K) = \frac{1}{\mathrm{B}(\boldsymbol{\alpha})} \prod_{i=1}^K x_i^{\alpha_i - 1} \] where \( \boldsymbol{\alpha} \) is a vector of concentration parameters for \( K \) variables of the Dirichlet distribution. The concentration parameters work out as \( \alpha \) and \( \beta \) of the beta distribution, but this time we have \( K \) differnt \( \alpha \) parameters.

As before, the numerator of the equation characterises the shape of the distribution and the denominator is there to satisfy the condition of the probability density function whose integration must be always 1.0.

Example

The math of the Dirichlet looks intimidating with the vector and the product symbol. It is actually straightforward and nothing too different from the beta distribution.

I'll bring back the bar chart of the categorical distribution from this post Probability Distribution1. This bar chart shows probabilities of going to Paris, Tokyo and Buenos Aires. The probabilities of going to these cities are 0.3, 0.3 and 0.4, respectively. The categorical distribution has no "wiggle room".

Let's say we have the same three travel destinations, Paris, Tokyo, Buenos Aires, but initially we don't know where we want to go. We have \( \boldsymbol{\alpha} = [1, 1, 1] \). These can be illustrated in the figure below.

Each of those is the marginal distribution (the distribution focusing only on a single parameter) of the three destinations. Here, the marginal distribution is just the beta distribution, for example, with going to Paris or \( \alpha = 1 \) and the other two cities (Tokyo and Buenos Aires) have counts of two or \( \beta = 2 \) on the left sub-plot. We have a equal chance of visiting one of the cities and the mean probability is 33%. We don't have a firm decision of where to go yet, so each sub-plot has lower density towards higher probabilities on the x-axis.

Things progressed a little. Paris got one vote as a travel destination. The parameters change to \( \boldsymbol{\alpha} = [2, 1, 1] \). The sub-plot for Paris has more "wiggle room" for higher probabity density on the x-axis, while the lower probabilities of visiting the other two cities has a higher density.

Eventually, we have more travel experiences to these three cities. Paris still is our favourite. We now see higher probability density towards lower probabilities on the sub-plots of Tokyo and Buenos Aires. As we get more data, we know which one is our favourite city and more likely to visit next time.

The Ternary plots below are another way to illustrate this example of the Dirichlet distribution and the three destinations can fit in one plot. It is a heatmap and the region in yellow means the high desnity.

The left sub-plot has \( \boldsymbol{\alpha} = [1, 1, 1] \) and each destination is equally likely (all in yellow). The middle sub-plot uses parameters \( \boldsymbol{\alpha} = [5, 3, 2] \). It favours towards Paris, but the other two cities are still likely travel destinations. If the travel destination experiment runs a lot more times like in the right sub-plot, we eventually see a very small wiggle room.

Summary

This post described the beta distribution and the Dirichlet distribution. These two distributions are the distribution over a probability / probabilities. The beta and Dirichlet distributions do not tell us a single probability of an event, but confidence of a probability of a certain event e.g., what is the 70% chance of a new colleague completing his next task given his past 8 successes and 2 failures.

This notebook has all the lines of code to produce the figures.

Speech & Language Blog