Skip to main content

Introduction To Gradient Descent algorithm and its variants

Introduction To Gradient descent algorithm and its variants

gradient descent algorithm
gradient descent algorithm
source: Imad Dabbura
Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function (commonly called as loss/cost functions in machine learning and deep learning). To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. If, instead, one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent. Gradient descent is also known as steepest descent.

Before going through the variants of gradient descent let’s first understand how gradient descent algorithm works. For this let’s take an example of logistic regression model. For convenience, let’s suppose that our model has only two parameters: one weight w and bias b. For this the algorithm can be written as:

i. Initialize our w and b with random guesses.(e.g. say w=1, b=1).

ii. Select a value for learning rate α. learning rate is a factor that determines how big or small the step should be on each iterations. If the learning rate is too small, it would take long time to converge and thus will be computationally too expensive. In the same way if the value of learning rate is too large, it will overshoot the minimum and fail to converge. 

iii. The data should be normalized to a suitable scale if is highly varying. It will decrease the chances of error. There are various techniques of normalization. For their detailed introduction you can visit  this page.

iv. Suppose our cost function/ loss function ( for brief about loss/cost functions visit here.) which is to be minimized be J(w,b). On each iteration, we take partial derivative of cost function J(w,b) with respect to the parameters (w,b):


v. We then update our previous weight w and bias b  as shown below:


vi. We continue this process until the cost function converges.


Gradient Descent algorithm
Gradient Descent algorithm

Now let’s discuss in brief about the variants of gradient descent algorithms.

Batch Gradient Descent

It is also simply called gradient descent. In this algorithm, we consider all of our examples (whole data set) on each iteration when we update our parameters.

Update equation:

Advantages:

i. It has unbiased estimate of gradient i.e. lower chances of error.
ii. We can use a fixed learning rate during training for our entire example.

Disadvantages:

i. It takes long time to converge if we have large data set.
ii. It is computationally expensive.

Mini-batch Gradient Descent

In this algorithm, instead of going through entire examples (whole data set), we perform gradient descent algorithm taking several mini-batches. So even we have large number of training examples, it is processed in batches of certain example (batch size). We divide our data set into several mini batches say n batches with certain batch size. 

The batch size can be tuned by us. Generally it is chosen as power of 2, examples 32, 64, 128 etc. It is done because some hardware such as GPUs achieves better runtime with batch size as power of 2. 

Update equation:

Advantages

i. It is faster than the batch gradient descent.
ii. Random selection of examples will help to avoid examples that are very similar and do not contribute much to the learning.

Disadvantages

i. Mini-batch requires the configuration of an additional “mini-batch size” hyperparameter for the learning algorithm.
ii. Error information must be accumulated across mini-batches of training examples like batch gradient descent.


Stochastic Gradient Descent

Stochastic gradient descent, often abbreviated SGD, is a variation of the gradient descent algorithm that calculates the error and updates the model for each example in the training dataset. In other words, can say it is a mini batch gradient descent with batch size 1 and has batches equal to the number of training examples.

Update equation:

Advantages:

i. It is simple to understand and implement, especially for beginners.
ii. This increase the model updates frequency that can result in faster learning on some problems.

Disadvantages:

i. Updating the model frequently can be computationally too much expensive in comparison to the above variants.
ii. The frequent updates may result in a noisy gradient signal, that may cause the model parameters and turn the model error to jump around.

Comments

Popular posts from this blog

Understanding KNN(K-nearest neighbor) with example

Understanding KNN(K-nearest neighbor) with example.  It is probably, one of the simplest but strong supervised learning algorithms used for classification as well regression purposes. It is most commonly used to classify the data points that are separated into several classes, in order to make prediction for new sample data points. It is a non-parametric and lazy learning algorithm. It classifies the data points based on the similarity measure (e.g. distance measures, mostly Euclidean distance). Assumption of KNN : K- NN algorithm is based on the principle that, “the similar things exist closer to each other or Like things are near to each other.” In this algorithm ‘K’ refers to the number of neighbors to consider for classification. It should be odd value.  The value of ‘K’ must be selected carefully otherwise it may cause defects in our model. If the value of ‘K’ is small then it causes Low Bias, High variance i.e. over fitting of model. In the same way if ‘K’ is v...

Implementation Of KNN (From Scratch in PYTHON)

Implementation Of KNN (From Scratch in PYTHON) Implementation Of KNN (From Scratch in PYTHON) KNN classifier is one of the simplest but strong supervised machine learning algorithm. It can  be used for both classification and regression problems. There are some libraries in python to implement KNN, which allows a programmer to make KNN model easily without using deep ideas of mathematics. But if we try to implement KNN from scratch it becomes a bit tricky. Before getting into the program lets recall the algorithm of KNN: Algorithm for K-NN: 1.     Load the given data file into your program 2.     Initialize the number of neighbor to be considered i.e. ‘K’ (must be odd). 3.     Now for each tuples (entries or data point) in the data file we perform:                        i.     Calculate distance between the data point (tuple) to be classified and each data points in the gi...

Supervised Machine Learning

Supervised Machine Learning What Is Supervised Learning?  It is the machine learning algorithm that learns from labeled data. After the data is analyzed and learned, the algorithm determines which label should be given to new data supplied by the user based on pattern and associating the patterns to the unlabeled new data. Supervised Learning algorithm has two categories i.e Classification & Regression Classification predicts the class or category in which the data belongs to. e.g.: Spam filtering and detection, Churn Prediction, Sentiment Analysis, image classification. Regression predicts a numerical value based on previously observed data. e.g.: House Price Prediction, Stock Price Prediction. Classification Classification is one of the widely and mostly used techniques for determining class the dependent belongs to base on the one or more independent variables. For simple understanding, what classification algorithm does is it simply makes a decision boundary between data po...