Skip to main content

Easy Explanation of Normalization/Standardization in Machine Learning(using Python)

Easy Explanation of Normalization/Standardization in Machine Learning(using Python)

What is Normalization or Standardization?

 Normalization or standardization is defined as the process of rescaling original data without changing its behavior or nature. We define new boundary (most common is (0,1),(-1,1)) and convert data accordingly. This technique is useful in classification algorithms involving neural network or distance based algorithm (e.g. KNN, K-means). 

Some normalization techniques are:

a)     Min-Max Normalization/standardization: It performs linear transformation on original data. Let (X1,X2) be min and max boundary of an attribute and (Y1,Y2) be the new scale at which we are normalizing, then for Vi  value of attribute, the normalized value Ui is given as,
Min-Max Normalization
Min-Max Normalization
     Min-max normalization preserves the relationship among the original data values. If in future the input values comes to be beyond the limit of normalization, then it will encounter an error known as "out-of-bound error."

     Let's see an example: Suppose the minimum and maximum values for price of house be $125,000 and $925,000 respectively. We need to normalize that price range in between 0,1, . We can use min- max normalization to transform any value between them (say, 300,000). In this case we use above formula with,
Vi=300,000
X1= 125,000
X2= 925,000
Y1= 0
Y2= 1

In python:     
Here is an example to scale a toy data matrix to the [0, 1] range:
from sklearn import preprocessingimport numpy as np

X_train = np.array([[
1., -1.2.],
                    
[ 2.0.0.],
                    
[ 0.1., -1.]])
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
print(X_train_minmax)
 output:
[[0.5        0.         1.        ]
 [1.         0.5        0.33333333]
 [0.         1.         0.        ]]


b)     Z-score Normalization/standardization( Zero mean normalization /standardization) : In this technique, the values are normalized based on the mean and standard deviation of attribute A. For V value of attribute A, normalized value Ui is given as,
Z-score Normalization(Zero mean normalization)
Z-score Normalization(Zero mean normalization)
  where Avg(A) and Std(A) represents the average and standard deviation of values attribute A respectively.

Let's see an example: Suppose that the mean and standard deviation of values for attribute income $54,000 and $16,000 respectively. With z-score normalization, a value of $73,000 for income is transformed to (73,000-54,000)/16,000=1.225.


In Python:
from sklearn.preprocessing import StandardScaler

X=[[
101,105,222,333,225,334,556],[105,105,258,354,221,334,556]]

print(
"Before standardisation X values are ", X)
sc_X = StandardScaler()
X = sc_X.fit_transform(X)

print(
"After standardisation X values are ", X)
 output:
Before standardization X values are  
[[101, 105, 222, 333, 225, 334, 556],
 [105, 105, 258, 354, 221, 334, 556]]
After standardization X values are
[-1.  0. -1. -1.  1.  0.  0.]
 [ 1.  0.  1.  1. -1.  0.  0.]]



c)     Decimal Normalization/standardization: In this method, we normalize the given value by moving the decimal points of the value. The number of decimal points to move is defined by the maximum absolute value of given data set.If  V value of attribute A, then normalized value Ui is given as, 
Decimal Normalization
Decimal Normalization
Where, j is the smallest integer such that max|Ui|<1.

     Lets understand it by an example: Suppose we have data set in which the value ranges from -9900 to 9877.  In this case the maximum absolute value is 9900. So to perform decimal normalization, we divide each of values in data set by 10000 i.e j=4.(since it near to 9900).



Why is normalization important?

Let’s understand it by an example. Suppose we are making some predictive model using dataset that contains the net worth of citizens of any country. For this data set we find that there is large variation in data. If we feed this data to train any model, then it may generate some undesirable results. So, to get rid of that we opt normalization.

I have written this article taking reference of book DATA MINING Concepts and techniques by Jiawei Han, Micheline Kamber, and Jian Pei. You can download this book free here

Also read- Understand KNN with examples


Comments

Post a Comment

Popular posts from this blog

Understanding KNN(K-nearest neighbor) with example

Understanding KNN(K-nearest neighbor) with example.  It is probably, one of the simplest but strong supervised learning algorithms used for classification as well regression purposes. It is most commonly used to classify the data points that are separated into several classes, in order to make prediction for new sample data points. It is a non-parametric and lazy learning algorithm. It classifies the data points based on the similarity measure (e.g. distance measures, mostly Euclidean distance). Assumption of KNN : K- NN algorithm is based on the principle that, “the similar things exist closer to each other or Like things are near to each other.” In this algorithm ‘K’ refers to the number of neighbors to consider for classification. It should be odd value.  The value of ‘K’ must be selected carefully otherwise it may cause defects in our model. If the value of ‘K’ is small then it causes Low Bias, High variance i.e. over fitting of model. In the same way if ‘K’ is v...

Supervised Machine Learning

Supervised Machine Learning What Is Supervised Learning?  It is the machine learning algorithm that learns from labeled data. After the data is analyzed and learned, the algorithm determines which label should be given to new data supplied by the user based on pattern and associating the patterns to the unlabeled new data. Supervised Learning algorithm has two categories i.e Classification & Regression Classification predicts the class or category in which the data belongs to. e.g.: Spam filtering and detection, Churn Prediction, Sentiment Analysis, image classification. Regression predicts a numerical value based on previously observed data. e.g.: House Price Prediction, Stock Price Prediction. Classification Classification is one of the widely and mostly used techniques for determining class the dependent belongs to base on the one or more independent variables. For simple understanding, what classification algorithm does is it simply makes a decision boundary between data po...

Identifiers in C++ : Rules for an Identifier

Rules for an identifiers Identifiers in C++ : Rules for an Identifier Identifiers: Identifiers are the names given to the different entities of the program such as functions, variables, constant, classes, structures, etc. Since identifiers refers to a particular entity of a program so it must be unique for every entities. Rules for an Identifier: An Identifier can only have alphanumeric characters(a-z , A-Z , 0-9) and underscore (_) . The    starting character of    identifier can only contain alphabet (a-z , A-Z) or underscore (_). Identifiers are case sensitive (Similar is in C). For example  name  and  Name  are two different identifiers. Keywords are not allowed to be used as Identifiers. No special characters or Symbols, such as semicolon, period, whitespaces, slash or comma are permitted to be used in or as Identifier.