Gradient Descent is an optimization method used in neural network, where the weight parameters $\boldsymbol{w}$ are updated recursively by subtracting a small percentage $\alpha$ of the gradient of the loss function $\nabla \mathcal{L}$, in order to minimize the loss function.

# Mathematics

Let’s prove that the gradient descent method would lead to smaller losses after each step. Without loss of generality, we assume that the initial weight parameter $\boldsymbol{w}_0$ is a two dimensional vector, $(x_0, y_0)$.

Given a new vector $(x, y)$ that are close to $(x_0, y_0)$, the Taylor series expansion of the loss function $\mathcal{L}$ can be approximated by the first-order partial derivatives only:

Since we would like to find $x$ and $y$ that minimize $\mathcal{L}(x, y)$, which is the same as minimizing the dot product above, with the constraint that $(x, y)$ having close Eulidean distance to $(x_0, y_0)$ in order to satisfy the Taylor approximation.

To achieve this we select the vector $(x, y)$ such that:

Given an $\epsilon$, we choose $\alpha$ such that the above constraint is satisfied. The negative sign ensure that the dot product is minimized. We therefore have proved that $\mathcal{L}(x, y)$ will descend at each iteration provided that the loss function is differentiable and a sufficiently small $\alpha$ is used.

# Example

Let’s say we have a training set of 4 that maps three binary features to a binary response.

We first observe that the first feature has a 100% correlation with the response and can reasonably be used for future predictions. Now we construct a neural network to see if it can capture this relationship. First create a neural network class and randomly initialize three weights between $-1$ and $1$ for each feature.

Define the sigmoid activation function $\sigma$:

Define the loss function as the mean square error:

Calculate the gradient w.r.t. weights $w$:

Forward and backward propogation.

Testing our initial hypothesis.

We can see that the neural network learns to put substantial weights on the first feature and makes very accurate predictions in-sample with $10,000$ iterations.