Skip to main content

Loss Functions| Cost Functions in Machine Learning


Loss Functions| Cost Functions in Machine Learning

Every Machine Learning algorithm (Model) Learns by process of optimizing loss functions (or Error/Cost functions). Loss functions are the functions that deal with the evaluation of how accurate the given prediction is made. If the prediction is made far away from the actual or true value i.e. prediction deviates more from actual value, then the loss function gives high numeric value. For model to produce good prediction, it must have low deviation from actual value i.e. low loss. We use some optimization techniques like gradient descent algorithm, to reduce the loss in our prediction.

There are several loss functions in machine learning. Now, the question arises, can we use any of loss function in our machine learning algorithm? The answer is no. No in the sense, if we use loss function randomly, then we may face problems in calculation of loss as well as might produce some error if loss function is more sensitive of outlier. So it is important to know about the loss function before using them to calculate the loss in our prediction. It should be selected on the basis of machine learning algorithm we are using. There are several factors that govern the selection of Loss function for your problem like, algorithm you are using, ease of evaluation of probability and derivative, presence of outlier etc.

Depending upon the type of evaluation model i.e. classification or regression, the loss functions can also be divided into two types: classification or regression loss function (actually there is not such classification. we made this classification for the ease of understanding only).

In classification we try to predict the class or label of any supplied tuple (set of features) on the basis of given dataset for modeling. It means categorical value (eg male or female, dead or alive etc.) is predicted.

In regression we predict the continuous value for any given set of feature on the basic of given dataset for modeling.

·       Classification Loss Functions


Some of the classification loss functions are:

1.     Hinge loss/SVM loss


Hinge Loss is a loss function that is used for the training classifier models in machine learning. More precisely, it is used for maximum-margin classification algorithm (i.e. SVM).

Let, T be the target output such that T = (-1 or +1) and classifier score be Y, then hinge loss for the prediction is given as,
SVM loss
Hinge Loss
source: Machine Learning talks

It should be noted that y is not a class label but is raw (i.e. numeric output) given by classifiers decision surface.

For example, Linear SVM: Y = W.X + b; (W,b) are the weight and biases which are parameters of Hyperplane and X is feature to classify.

Interpretation for Hinge Loss:

We can see that if T and Y are of same sign (i.e. classified in right class) and |Y|>= 1, then loss, L(Y) =0. It means the classification accuracy is high. On the other hand the loss, L(Y) gradually increases if T and Y are of opposite sign (Wrong classification) and if they have same sign but |Y|<1 (called as Low Margin error).


2.     Cross-entropy Loss/ Negative Log Likelihood


Cross-entropy loss/ Negative log Likelihood, is loss function that measures the probability prediction of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the True value or actual label. A perfect model would have a log loss of 0 and it’s high value suggests high error in our predictive model.

The general Mathematical expression for Cross-entropy Loss is,

Negative log likelihood
Cross-entropy loss
source:Machine Learning Talks

Where, M= total number of classes to be classified e.g. if we have class label as cat, dog, rat, then M=3.

Y= binary indicator (1 or 0) if the class label ‘c’ is correctly classified for an observation ‘o’.

P= Predicted probability for an observation ‘o’ is of class ‘c’.


·       Regression Loss Functions


Some of the Regression Loss Function is:

1.     Mean Square Error/Quadratic Loss/L2 Loss (MSE):


It is given as the average of squared difference between the actual value and predicted value by learning model in regression.

Mathematically it is given as,
L2 error, Quadratic error
mean squared error
source:Machine Learning Talks

Where, T is true value i.e. actual value and Y predicted value.

The optimization of MSE is done by using gradient descent algorithm. It is more sensitive to outlier (as it includes squared difference) than MAE.


2.     Mean Absolute Error/L1 Loss (MAE):


It is given as the average of absolute difference between the actual value and predicted value by learning model in regression.

Mathematically it is given as,
L1 error
Mean absolute error
source: Machine Learning Talks

Where, T is true value i.e. actual value and Y predicted value.

The optimization of MAE is done by using gradient descent algorithm.


3.     Huber loss:

This Loss function is commonly used for regression problems.

Mathematically Huber loss is given as,
Huber Loss
Huber loss
source:Machine Learning Talks
Where, δ gives deviation from mean.


Comments

Popular posts from this blog

Understanding KNN(K-nearest neighbor) with example

Understanding KNN(K-nearest neighbor) with example.  It is probably, one of the simplest but strong supervised learning algorithms used for classification as well regression purposes. It is most commonly used to classify the data points that are separated into several classes, in order to make prediction for new sample data points. It is a non-parametric and lazy learning algorithm. It classifies the data points based on the similarity measure (e.g. distance measures, mostly Euclidean distance). Assumption of KNN : K- NN algorithm is based on the principle that, “the similar things exist closer to each other or Like things are near to each other.” In this algorithm ‘K’ refers to the number of neighbors to consider for classification. It should be odd value.  The value of ‘K’ must be selected carefully otherwise it may cause defects in our model. If the value of ‘K’ is small then it causes Low Bias, High variance i.e. over fitting of model. In the same way if ‘K’ is v...

Supervised Machine Learning

Supervised Machine Learning What Is Supervised Learning?  It is the machine learning algorithm that learns from labeled data. After the data is analyzed and learned, the algorithm determines which label should be given to new data supplied by the user based on pattern and associating the patterns to the unlabeled new data. Supervised Learning algorithm has two categories i.e Classification & Regression Classification predicts the class or category in which the data belongs to. e.g.: Spam filtering and detection, Churn Prediction, Sentiment Analysis, image classification. Regression predicts a numerical value based on previously observed data. e.g.: House Price Prediction, Stock Price Prediction. Classification Classification is one of the widely and mostly used techniques for determining class the dependent belongs to base on the one or more independent variables. For simple understanding, what classification algorithm does is it simply makes a decision boundary between data po...

Implementation Of KNN (From Scratch in PYTHON)

Implementation Of KNN (From Scratch in PYTHON) Implementation Of KNN (From Scratch in PYTHON) KNN classifier is one of the simplest but strong supervised machine learning algorithm. It can  be used for both classification and regression problems. There are some libraries in python to implement KNN, which allows a programmer to make KNN model easily without using deep ideas of mathematics. But if we try to implement KNN from scratch it becomes a bit tricky. Before getting into the program lets recall the algorithm of KNN: Algorithm for K-NN: 1.     Load the given data file into your program 2.     Initialize the number of neighbor to be considered i.e. ‘K’ (must be odd). 3.     Now for each tuples (entries or data point) in the data file we perform:                        i.     Calculate distance between the data point (tuple) to be classified and each data points in the gi...