# Understand Gradient Descent in Neural Network

`Gradient Descent`

is an optimization method used in neural network, where the weight parameters are updated recursively by subtracting a small percentage of the gradient of the loss function , in order to minimize the loss function.

# Mathematics

Let’s prove that the gradient descent method would lead to smaller losses after each step. Without loss of generality, we assume that the initial weight parameter is a two dimensional vector, .

Given a new vector that are close to , the Taylor series expansion of the loss function can be approximated by the first-order partial derivatives only:

Since we would like to find and that minimize , which is the same as minimizing the dot product above, with the constraint that having close Eulidean distance to in order to satisfy the Taylor approximation.

To achieve this we select the vector such that:

Given an , we choose such that the above constraint is satisfied. The negative sign ensure that the dot product is minimized. We therefore have proved that will descend at each iteration provided that the loss function is differentiable and a sufficiently small is used.

# Example

Let’s say we have a training set of 4 that maps three binary features to a binary response.

We first observe that the first feature has a 100% correlation with the response and can reasonably be used for future predictions. Now we construct a neural network to see if it can capture this relationship. First create a neural network class and randomly initialize three weights between and for each feature.

1 | import numpy as np |

Define the `sigmoid`

activation function :

1 | def sigmoid(self, x): |

Define the loss function as the mean square error:

Calculate the gradient w.r.t. weights :

1 | def gradient(self, x, y, y_hat): |

Forward and backward propogation.

1 | def forward_propogation(self, x): |

Testing our initial hypothesis.

1 | if __name__ == "__main__": |

1 | random synoptic weights |

We can see that the neural network learns to put substantial weights on the first feature and makes very accurate predictions in-sample with iterations.