Differentiation and optimisation

Introduction

Differentiation calculus is a core part of machine learning optimisation. Differentiation is often taught at school as finding a slope in geometry. While a theoretical concept of differentiation is important, I feel that connection between differentiation at school and differentiation for machine learning optimisation is not linked well.

This post aims to bridge the gap between school math and machine learning optimisation, focusing on how the mathematical concept (differentiation) is essential for the practical algorithm (gradient descent) to support machine learning. The post begins with a very basic concept of slope finding.

Finding a slope between two points

The general formula to find a slope \(a\) of a function \( y = ax \) is: \[ a = \frac{y_2 - y_1}{x_2-x_1} \] The figure below illustrates rise (vertical) and run (horizontal) changes between two points, Point 1 \((x = 0, y=0)\) and Point 2 \((x=2, y=4)\).

The slope of this function is \( a = \frac{4-0}{2-0} = \frac{4}{2} = 2 \). The funciton is \( y = 2x \).

Computing the slope is therefore to compute rise over run between two points. It's all good up to this point.

Finding a slope for the parabolic function

We're good with a linear function. What about finding a slop of the parabolic function? Intuitively, a slope changes at different pairs of two points.

In the figure above, we have four points: \((x = 2, y=4)\) \((x = 2.5, y=6.25)\) \((x = 3, y=9)\) \((x = 4, y=16)\). If we set our base point at Point 1 and take Point 2, Point 3 and Point 4 individually, each slope will be a different value: \[ a_{point2} = \frac{6.25-4}{2.5-2} = \frac{2.25}{0.5} = 4.5 \] \[ a_{point3} = \frac{9-4}{3-2} = \frac{5}{1} = 5 \] \[ a_{point4} = \frac{16-4}{4-2} = \frac{12}{2} = 6 \] This confirms that slopes of Point 1 and Point 2, Point 1 and Point 3 and Point 1 and Point 4 are all different.

So, unlike a linear function, a slope of a parabolic function varies, depending on two points to calculate its slope. Can we find a single general function that can find a slope of a parabolic function at any point? Differentiation calculus is for this. A derivative is a function that finds a slope at any point.

Let's generalise what we have done with Point 1 to Point 4. Let's define Point 1 as \((x, f(x))\). Let's also define a horizontal change as \(h\). We can compute the rise over run to find a slope like this: \[ \frac{f(x+h) - f(x)}{(x+h)-x} = \frac{f(x+h) - f(x)}{h} \] This looks abstract. We can replace \(f(x) = 4\) that is the y coordinate of Point 1 and \(f(x+h) = 9\) that is the y coordinate of Point 3, and the horizontal change between Point 1 and Point 3 is \(h=1\): \[ \frac{f(x+h) - f(x)}{h} = \frac{9-4}{1} = 5 \] Now, we get the same value of \(a_{point3}\), and we know that this is the general formula to calculate the slope of two points.

At school, during the lesson of differentiation calculus, we were told that the horizontal change \(h\) moves closer and closer to zero. Eventually, we calculate the slope of the parabolic function from a single point.

The horizontal change must not be "0", but "closer to 0". This is because mathematically the denominator \(h\) cannot be 0. Division by 0 results in undefined. So, we compute the slope between two very very very very short points in theory. In practice, such a value is a slope at a single point of the function. This is called the instantaneous rate of change.

Visualisation of the instantaneous rate of change is presented below:

The base point is still the same as Point 1 \(x=2, y=4\). The figure takes the rate change to \(x=2.0001\), \(x=2.1\), \(x=3\) and \(x=4\). This figure shows that changes in slopes become milder and milder as the \(h\) gets closer to 0. We can still see slopes of orange and yellow lines are different. The green and black lines are almost identical, indicating that as \(h\) moves towards Point 1, the slope looks as if it is computed at a single point, Point 1.

Formally, the derivative of the parabolic function is as follows: \[ y = f(x) = x^2 \] \[\begin{aligned} \frac{dy}{dx} &= \lim_{h \to 0}\frac{(x+h)^2-x^2}{h} \\ &= \lim_{h \to 0}\frac{x^2+2xh+h^2-x^2}{h} \\ &= \lim_{h \to 0}\frac{2xh+h^2}{h} \\ &= \lim_{h \to 0}(2x+h) \\ &= 2x \end{aligned}\] The derivative tells us that a slope at any point of \(x\) for the function \( y = x^2 \) is \(2x\).

We can check if this \( y' = 2x \) is true with the figure below:

When \( x = 5 \) "red dotted line", its slope is 10: \[ y' = 2 \times 5 = 10\]

When \( x = -3 \) "green dotted line", its slope is -6: \[ y' = 2 \times -3 = -6 \]

When \( x = 0 \) "purple dotted line", its slope is 0: \[ y' = 2 \times 0 = 0 \]

Indeed, the derivative of \( y = x^2 \) is \( y' = 2x \) and it tells us the slope of any point of the parabolic function.

Why is the derivative important for optimisation?

The post is just a school math lesson, up to this point. I'll write a relationship between finding a derivative (slope finding) and machine learning optimisation in this section.

Machine learning often uses a *loss function for model optimisation. A loss function measures the difference between model predictions and labels in training data. Here is a simple example of a loss function. If a machine learning model predicts \(1\) and the corresponding label is \(3\), the error rate is \(3-1=2\).

Let's define a very simple linear model with a single weight. \[ y = wx \] where \( w \) is a model coefficient or a weight. Let's also create training examples: \((1, 1), (2, 2), (3, 3) \). With these training examples, the correct weight of the model is \( w = 1 \). If the weight is \( w = 2 \) or \( w = -2 \), the model will make prediction errors. Visualisation of the two wrong weights is below:

The model with \(w = 1\) correctly predicts the three given data points. One wrong model \(w = 2\) goes a little higher than the correct model. The other wrong model \(w=-2\) is further apart from correct predictions.

Having this in our mind, let's define a loss (error) function with respect to the weight \(w\). Here, Mean Squared Error (MSE) is a loss function: \[ L(w) = \frac{1}{N} \sum_{i=1}^{N} (wx_i - y_i)^2 \] where \(w\) is the model weight, \(x_i\) is the ith data point, \(y_i\) is the ith target and \(N\) is the number of training examples aka data points.

The loss when \(w=2\) is: \[ L(w=2) = \frac{1}{3} [(2-1)^2 + (4-2)^2 + (6-3)^2] = \frac{14}{3} \approx 4.66 \] and the loss when \(w=-2\) is: \[ L(w=-2) = \frac{1}{3} [(-2-1)^2 + (-4-2)^2 + (-6-3)^2] = \frac{9+64+81}{3} = \frac{154}{3} \approx 51.33 \] Good news. We see the relationship between differentiation calculus and machine learning optimisation now.

As \(w\) moves away with the value of \(1\) from the ideal model weight i.e., \(w=1\), the error value grows quadratically. The parabolic function visualises the MSE loss. Yes, the parabolic function!

Instead of \(y\) and \(x\), we have error values in the virtical line and weight values in the horizontal line.

We can think about two things:

The derivative of this function tell us a slope at any point of the function.
The model weight \(w\) is a perfect fit when the error value is 0 i.e., at the bottom of this function.

Differentiation of the Mean Squared Error with respect to the weight \(w\) involves the Chain Rule. Let's focus on a single data point: \[ L = (wx-y)^2 \] Temporarily replace \((wx-y)\) with \(u\). \[ L = u^2 \] Then, we apply the Chain Rule with respect to the weight: \[ \frac{dL}{dw} = \frac{dL}{du} \frac{du}{dw} \] The outer derivative is: \[ \frac{dL}{du} = 2u = 2(wx-y) \] The inner derivative is: \[ \frac{du}{dw} = (wx-y) = x \] Combining the outer and inner derivatives together: \[ L'(w) = \frac{2}{N} \sum_{i=1}^{N} (wx_i-y_i)x_i \]

The figure above shows slopes of four weights. The black dots correspond to the previous weights \(w=-2\) and \(w=2\). The blue dots are new ones at \(w=1\) and \(w=4\). We can see that \(w=4\) and \(w=-2\) have the steepest slope of \(28\) and \(-28\), respectively. The slope at \(w=2\) is midler with the value of \(9.33\). When the slope is 0 at \(w=1\), the error term \(L(w)\) is also 0. The slopes descend towards the optimal weight.

The summary of this post:

The derivative of a parabolic function can find a slope at any point of the function.
The loss function for machine learning optimisation can be expressed in the parabolic function.
The slope = 0 is often a point of the best weight for a machine learning model, because the error is also 0.
And that's why finding the slope = 0 through differentiation calculus is important for machine learning.

What we saw is a descending slope, but the most common optimisation algorithm for machine learning is called "gradient descent". Typically, machine learning models have multiple weights in the form of vectors or matrices. The optimisation process keeps track of multiple slopes. **And multiple slopes for weight vectors or matrices are referred to as "gradients".

Hopefully, this post makes it clear the connection between differentiation calculus and machine learning optimisation.

Note

*I know that unsupervised learning like clustering does not use a loss function. But, most of the ML problems we do are deep neural networks anyway.

**A slope is a scalar, and a gradient is a vector.

Speech & Language Blog