Maximum Likelihood Estimation (MLE) and Maximum A Posteriori Estimation (MAP)

Introduction

I have made several posts about probabilities and distributions. 
I used arbitrary probabilities and parameters in the series. Something like, "the chance of me going to Paris next year is 70% (Two Random Variables)", "the mean wait time to enter the Eiffel tower is 60 minutes and variance is 144 (Probability Distribution2: Normal distribution) and "the rate of successful task completion by your new colleague is 66% (Probability Distribution3: Beta and Dirichlet distribution)". All those numbers are arbitrary and random.

The goal of this post is to learn how to estimate probabilities and parameters, observing real data points in our hands.

The first section of this post shows two important properties for parameter estimation methods:
  • Point estimate vs Density estimate
  • No prior knowledge vs Use of prior knowledge
Following the properties of parameter estimation methods, I will discuss two methods: Maximum likelihood estimation (MLE) and Maximum a Posteriori Estimation (MAP).

As usual, I am aiming to have an easy introduction to the topic in my blog, and I hope you'd enjoy reading this :)

Point estimate vs Density estimate

A point estimate provides a singel probability or a parameter as a result of analysing data, while a density estimate outputs a distribution of probabilities or parameters.
Illustration of a point estimate and a density estimate given data points.

For example, we should get Heads or Tails equally likely by flipping a coin. We flip it 10 times and get Heads 4 times and Tails 6 times. A point estimate says that the probability of getting Heads from this coin is \( \frac{4}{10} = 0.4 \).

A density estimate on the other hand says "the data shows \( 0.4 \) is the most likely probability but there is still a wiggle room for this probability. It could be \( 0.45 \) or even \( 0.5 \)". 

Prior knowledge

Prior knowledge is information about an experiment before starting trials, or information we accumulate during trials of an experiment.

We use prior knowledge all the time in our every day life:
Prior knowledge1: A newly built house is more expensive.
This information is useful to estimate a house price.

Prior knowledge2: Fresh coffee is hot.
A barista just made coffee. We use this prior knowledge to avoid drinking hot coffee right away.

Prior knowledge3: Rome during the holiday season is crowded. 
We decide if we want to visit the city during a high season.
An image of crowded Rome (by ChatGPT). 

Some parameter estimation methods exploit prior knowledge while others do not.

Parameter estimation methods in this post

This post covers the following two methods for estimating parameters from data:
  • Maximum likelihood estimation (MLE)
  • Maximum a posteriori estimation (MAP)
It might seem that I am mixing the term "parameter" and "probability".
  • A probability of me going to Paris next year \( P(Paris) = 0.7 \) from my travel history.
  • The height distribution of one classroom \( \mu = 175 \) and \( \sigma^2 = 49 \) from individual medical records.
Both probabilities and parameters (of a distribution) are some numeric values that represent data points. I will call both of them "parameters" in this post.

Maximum Likelihood Estimation (MLE)

Maximum likelihood estimation (MLE) is the simplest estimation method. MLE has the following properties:
  • Point estimate.
  • No prior information.
  • MLE works the best when a large amount of data is available.

Example1

I went to Paris 7 times at least once a year in the past 10 years. The probability of me going to Paris next year is: \[ P(Paris) = \frac{7}{10} = 0.7 \]

Example2

There are three boys and their height are 170 cm, 175 cm and 180 cm. The mean of their height is: \[ \mu_{height} = \frac{170+175+180}{3} = 175 \] So, it's fairly simple.

The real goal here, however, is to understand the generalised formula of MLE: \[ \hat{\theta}_{MLE} = \arg\max_{\theta} P(D|\theta) \] where \( \theta \) is the parameter, \(D\) is data and \(\hat{\theta}\) is the estimated parameter. As the name suggests, we need to "maximise" the "likelihood" for parameter estimation using MLE.

MLE Pikachu Encounter Rate

Let's define the unknown parameter (probability) of encountering Pikachu as \( \theta \). The law of probability says that the probability of encountering Pikachu and the probability of encountering other Pokemons (\(1-\theta\)) must sum up to 1.

We observe Pokemons we encounter. We saw wild Pokemons 10 times in the same area. This is our data:
  • Pikachu appeared 1 time.
  • Non-Pikachu appeared 9 times.
The likelihood of seeing Pikachu can be expressed as follows: \[ L(\theta) = \theta^1 \times (1-\theta)^9 \] This is to multiply the parameter of seeing Pikachu 1 time by the parameter of seeing something else 9 times.

We then take the logarithm of the likelihood equation: \[ l(\theta) = \log(\theta) + 9 \log(1-\theta) \] I have more information about logarithms in this post: Decibel and Logarithms. Taking the log changes multiplication to addition and exponential to a coefficient. This is much more convenient to calculate the likelihood without multiplication (especially when we need to multiply so many numbers).

Finding the parameter \( \theta \) that maximises this likelihood is to take the derivative of this parameter. In the post Differentiation and Optimisation, I wrote more about differentiation calculus. \[ \frac{d}{d\theta} [\log(\theta)+9 \log(1-\theta)] = 0 \] \[ \frac{1}{\theta} + 9 [\frac{1}{1-\theta}\cdot(-1)] = 0 \] \[ \frac{1}{\theta} - \frac{9}{1-\theta} = 0 \] \[ \frac{1}{\theta} = \frac{9}{1-\theta} \] \[ 1-\theta = 9\theta \] \[ 10\theta = 1 \] \[ \theta = \frac{1}{10} = 0.1 \] That was a lengthy analytical solution for MLE. However, we can confirm that the final parameter is the same as simply calculating the probability by this: \[ P(Pikachu) = \frac{1}{10} = 0.1 \]

Maximum A Posteriori Estimation (MAP)

The Maximum a Posteriori (MAP) estimation is a slightly more advanced parameter estimation approach. MAP has the following properties:
  • Point estimate.
  • Uses prior information.
  • Works well with a small amount of data, but prior information "disappears" with a large amount of data.
The main difference between MLE and MAP is whether the estimation method uses prior information or not.

The general formula to solve the MAP estimate is below: \[ \hat{\theta}_{MAP} = \arg\max_{\theta} P(D|\theta) P(\theta) \] Similar to the MLE formula, we need to estimate the parameter \(\theta\) that maximises the likelihood of data \(D\). An addition to the MLE formula is the prior information \(P(\theta)\).

Let's use another example of the Pikachu encounter rate to see how this prior interacts with \(P(D|\theta)\).

MAP Pikachu Encounter Rate

The encounter rate of Pikachu was 1 out of 10 times, or 10%. We learned this from the MLE Pikachu example. Let's use this as prior information.

Let's say that new data shows 1 out of 5 Pokemon is Pikachu. MLE would jump to a conclusion without considering the previous data: \[ P_{MLE}(Pikachu) = \frac{1}{5} = 0.2 \] How does the MAP estimation approach this?
  • Prior: We encounter Pikachu 1 out of 10 times
  • New data: We encounter Pikachu 1 out of 5 times 
Could we say that the previous likelihood (prior) of encountering Pikachu was \( \frac{1}{10}=0.1 \) and the current data shows the likelihood \( \frac{1}{5}=0.2 \), and then \( 0.1 \times 0.2=0.02 \)? This is clearly wrong. If we merge data of the two trials, that will produce the likelihood \( \frac{1+1}{10+5} = \frac{2}{15} = 0.13 \). Simply multiplying the two likelihoods results in a value too small to be true.

The MAP estimate integrates the prior smoothly using the concept of conjugate priors. A conjugate prior is a specific distribution and when multiplied by the likelihood, the result is a posterior distribution of the same form. In our Pikachu encounter example, the likelihood is a binomial and we can pick a Beta distribution as our prior. Thanks to the conjugate prior, the MAP estimate for this example would look like this: \[ Posterior (Beta) \propto Likelihood (Binomial) \times Prior (Beta) \] Note that the MAP estimate is a point estimate and the resulting value is at the peak of the posterior distribution.

We can now plug in all the values to compute the MAP estimate: \[ P(D|\theta)P(\theta) = \theta^1 \times (1-\theta)^4 \times \theta^{1-1} (1-\theta)^{9-1} \] The prior exponent includes a subtraction of \(1\) because of the mathematical definition of the Beta distribution kernel: \[ x^{\alpha-1}(1-x)^{\beta-1} \] Remember the laws of exponent: \[ 2^3 \times 2^2 = 2^5 \] More generally: \[ a^m \times a^n = a^{m+n} \] and then the first formula can be simplified to \[ P(D|\theta)P(\theta) = \theta^{1} \times (1-\theta)^{12} \] We already solved this for the MLE, and let's do the same again. \[ \frac{d}{d\theta} [\log(\theta)+12 \log(1-\theta)] = 0 \] \[ \frac{1}{\theta} + 12 [\frac{1}{1-\theta}\cdot(-1)] = 0 \] \[ \frac{1}{\theta} - \frac{12}{1-\theta} = 0 \] \[ \frac{1}{\theta} = \frac{12}{1-\theta} \] \[ 1-\theta = 12\theta \] \[ 13\theta = 1 \] \[ \theta = \frac{1}{13} = 0.0769... \] 

Similar to MLE, there exists a "shortcut" formula for the MAP: \[ \hat{\theta}_{MAP} = \frac{\alpha-1}{\alpha+\beta-2} \] In our example:
  • \(\alpha=2\): the total number of times we encountered Pikachu 
  • \(\beta=13\): the total number of times we encountered Non-Pikachu 
\[ \frac{2-1}{2+13-2} = \frac{1}{13} \]

Summary

This post explored two properties of parameter estimation: point vs density estimate and use of a prior knowledge. Using our Pikachu encounter example, we demonstrated how MLE provides a point estimate while MAP integrates prior beliefs to refine that estimate.

We haven't seen a density estimation yet. The Bayesian estimation produces a full density instead of a single point value. I will dive into the Bayesian estimation in the next post. 

Comments

Popular posts from this blog

Digital Signal Processing Basics

SLT 2022 Notes

Sound frequency