Data pre-processing is a data mining technique which is used to transform raw data into a useful format.
Steps Involved in Data Pre-processing:
1. Data Cleaning
“The idea of imputation is both seductive and dangerous” (R.J.A Little & D.B. Rubin)
One of the most common problems I have faced in Exploratory Analysis is handling the missing values. I feel like that there is NO good way to deal with missing data. There are loads of different solutions for data imputation depending on the kind of problem — Time series Analysis, ML, Regression etc. and it is much more difficult to choose between them. So, let’s explore the most commonly used methods and try to find some solutions that fit our needs.
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling of missing data, noisy data etc.
- Missing Data
This situation arises when some data is missing in our dataset. Before jumping to the various methods of handling missing data, we have to understand the reason why data goes missing.
- Data Goes Missing Randomly :
Missing at random means that the case in which a data point is missing, the reason for missing data is not related to the observed dataset.
- Data Goes Missing not Randomly :
Two possible cases for missing data can be – it depends on the hypothetical value or, it is dependent on some other variable’s values. People with high salaries generally do not want to reveal their incomes in surveys, this can be an example for first case and, we can think that the missing value was actually quite large and can fill it with some hypothetical value. And, an instance for latter case can be, females generally don’t want to reveal their ages! Here, the missing value in age column is impacted by gender column.
If data goes missing randomly, it is safe to remove the tuples with occurrences of missing values, while in the other case removing observations with missing values can produce a bias in the model. So we have to be quite bold while removing some tuples.
P.S. – Data imputation does not guarantee better results.
- Dropping Observations
Tuple deletion removes all data for an observation that has one or more missing values. If the missing data is limited to a small number of observations, you may just opt to eliminate those cases from the dataset. However, in most cases, it can produce bias in the analysis because we can never be totally sure that the data has gone missing randomly.
mydata.dropna(inplace=True) - Dropping Variables
The better choice always is keep data than discarding it. Sometimes you can drop variables (columns) if the data for that particular column is missing for more than 60% rows but only if that column is insignificant. But, still, dropping tuples is always preferred choice over dropping columns.
del mydata.column_name
mydata.drop(‘column_name’, axis=1, inplace=True) - Fill the missing values
There are various ways to do this task. You can choose to fill the missing values manually, by using mean, mode or median.
Utilising the overall mean, median or mode is a very straight-forward imputation method. It is quite fast to perform, but has clear disadvantages, one of them being that mean imputation reduces variance in the dataset.
from sklearn.preprocessing import Imputer
values = mydata.values
imputer = Imputer(missing_values=’NaN’, strategy=’mean’)
transformed_values = imputer.fit_transform(values)
# strategy can be changed to “median” and “most_frequent” - Regression:
Data can be made smooth by fitting it into a regression function. The regression used can be linear (having one independent variable) or multiple (having multiple independent variables).
To start, most significant variables are identified using a correlation matrix. They are used as independent variables in a regression equation. The dependent variable is the one which has got missing values. Tuples having complete data are used to generate the regression equation; the equation is then used to predict missing values for dependent variable.
It provides good estimates for missing values. However, there are several disadvantages of this model which tend to overshadow the advantages. First, since the inserted values were predicted from other variables they fit together very easily and so standard error becomes biased. Another one, we also assume that there is a linear relationship between the variables used in the regression equation when there may not be one. - KNN (K Nearest Neighbours)
In this method, k neighbours are chosen based on some distance measure and their average is used as an hypothetical value which can be used to fill up the missing data. KNN can predict both discrete values (most frequent value among the k nearest neighbours) and continuous values (mean among the k nearest neighbours)
Different formulas / concepts are used for calculating distance according to the type of data:- Continuous Data: Most commonly used distance formulas are – Euclidean, Manhattan and Cosine
- Categorical Data: Hamming distance is generally used for categorical imputation. It iterates through all the categorical attributes and for each, counts one if the value is not the same between two tuples for that variable. The number of attributes for which the value was different is considered as the Hamming distance.
One of the drawbacks of the KNN algorithm is that it becomes time-consuming when we try to analyse large datasets because it searches for similar instances through the entire dataset. Moreover, if we are dealing with high-dimensional data, KNN’s accuracy can severely have a downfall because there seems to be little difference between the nearest and farthest neighbour in multiple dimensions.
from fancyimpute import KNN
# Use 5 nearest rows which have a feature to fill in each row’s missing features
knnOutput = KNN(k=5).complete(mydata)
- Imputation of Categorical Variables
- Mode imputation is one method but it will definitely introduce bias
- Missing values can be utilized as a separate category themselves or use them as a different level in our dataset
- We can use predictive models like logistic regression to estimate values that can substitute the missing data. We can divide our data set in two separate parts: One part with no missing values for the variable (training data) and the other one having missing values for the variable under consideration (test data).
- Multiple Imputation – Fill the missing entries of the incomplete data sets m times (say, m=3) by using values from that data set only. Doing this, gives us m different data sets. Analyse each of these m data sets separately. And, integrate the m analysis results to get a final result
Multiple imputation should be the most preferred method for imputation as:
- – It is easy to use
- – And, generate no biases
Among all the methods discussed above, multiple imputation and KNN are widely used, and multiple imputation being simpler is more preferred.
2. Data Transformation
It is done in order to transform the data in appropriate form suitable for mining process. This involves following ways:
a. Label Encoding / OneHot Encoding
Choosing the right encoding method plays a major role in your prediction model.
We often need to convert text features present in our dataset to its numeric representation. The two most common ways to do this is to use either Label Encoder or OneHot Encoder. However, everyone is not equipped with what impact their choice of encoding has on their ML model, the accuracy of the model may shift by large numbers by using the right encoding at the right scenario.
Label Encoder
Label Encoding in Python can be performed using Sklearn Library which gives us a tool for encoding categorical features into numeric values. LabelEncoder encodes labels with a value between 0 and n-1 (where n is the number of distinct categories present in a column). If a category / label repeats, it is assigned the same value which was assigned to it earlier in the column.
Consider below example:
If we have to pass this data to some ML model, we need to encode the Country column to its numeric representation by using Label Encoder. After fitting our data into Label Encoder we will see something like this:
The categorical values have been converted into numeric values.
That’s all label encoding is about. But, label encoding introduces a new problem. We have used Label Encoder on country column to convert the country names into numerical data. Since there are different numbers in the same column, the model will misunderstand the data to be in some kind of order, 0 < 1 < 2 < 3. And, if we see, in reality, there is no relation at all between the rows.
The model may derive a correlation like as the country number increases the population also increases but this clearly will not be the scenario in this data set or some other even. To overcome this issue, we use OneHot Encoder.
One Hot Encoder
To avoid the above stated problem, where we start confusing our model into thinking that there is some order or relation present, we have to ‘OneHotEncode’ that particular column.
What One Hot Encoding does is, it takes a column which has been label encoded and then splits the column into multiple columns. The numbers are replaced by 1 and 0, depending on which row had what value originally. In our example, we’ll get four new columns, one for each country — Japan, U.S, India, and China.
For rows, which we had the country column value as Japan, the new ‘Japan’ column will have a ‘1’ and the other three columns will have ‘0’s. Similarly, for rows, which had the value as U.S, the ‘U.S’ column will have a ‘1’ and the other three columns will have ‘0’s and so on.
Feature Scaling
Feature Scaling is a technique to standardize / normalize the data in a fixed range. It is used to handle highly varying magnitudes or units. Normally, a machine learning algorithm tends to consider bigger magnitudes as greater values, regardless of their units.
For Example, an algorithm considers the value 3000 meters to be greater than 5 km but that’s actually opposite and, hence, the algorithm will give wrong predictions. So, to tackle this issue, we use Feature Scaling to bring all values to same range of magnitudes.
The two most important techniques for feature scaling:
- Min-Max Normalization: This technique re-scales a feature to a range of 0 and 1.
- Standardization: This technique re-scales a feature to a distribution range specified with mean value kept as 0 and variance equal to 1 (such as, -1.0 to 1.0)
Train Test Split
sklearn.model_selection.train_test_split is a quick utility that can be used to spli arrays or matrices (input data) into random train and test subsets.
train_test_split(*arrays, **options) accepts a few parameters:
*arrays : lists, numpy arrays, scipy-sparse matrices or pandas dataframes can be given as inputs but they should have same length or same shape (for axis=0)
test_size : It represents the proportion of the dataset to include in the test split and should be between 0.0 and 1.0. By default, it is set to 0.25 but is adjusted according to train_size, if that is provided.
train_size : It represents the proportion of the dataset to include in the train split and should be between 0.0 and 1.0. By default, it is set to 0.75 but is adjusted according to test_size, if that is provided.
random_state : Here, you can supply a seed to the random number generator to keep your outputs same whenever you run your notebook
shuffle : By default, our dataset is shuffled and then split into train and test subsets but you can turn shuffle off by passing shuffle=False as a parameter
returns : 2 lists or a sparse matrix (output type will be same as the input type) for every input are generated, one for train and other for test subset in case of lists.
Attribute Selection
In this strategy, new attributes are constructed from the given set of attributes to help the mining process.
3. Data Reduction
Data mining is a technique that is used to handle huge amount of data. Analysis of huge volume of data can become hard, sometimes. In order to ease this, we use data reduction techniques. It aims to reduce data storage and analysis costs.
The various steps to perform data reduction are:
- Data Cube Aggregation : Aggregation operation is applied to data for construction of the data cube.
- Numerosity Reduction : This enables us to store the model of data instead of whole data, for example: Regression Models.
- Dimensionality Reduction :This reduces the size of data by using encoding mechanisms. It can be lossy or lossless. If after reconstruction from compressed data, original data can be retrieved, such reduction is called lossless reduction else it is called lossy reduction. The two effective methods of dimensionality reductio