Machine LearningNumpyPandasPython

How to perform Feature Engineering in Machine Learning?

In this tutorial, you are going to learn

 

1. What is Feature Engineering?

2. How to download the dataset?

3. How to explore the dataset?

4. How to check for null values for effective feature engineering?

5. How to handle numerical columns with null values?

6. How to handle categorical columns with null values?

7. How to fill null values in the dataset?

8. How to remove unwanted columns?

9. How to perform data normalization?

10. How to convert categorical columns to numerical?

11. How to split the dataset into training and testing?

12. how to implement a LightGBM model for classification?

 

Feature Engineering

 
Feature engineering directly influences the result of the model. A good amount of time spent on feature engineering can result in wonders. It involves a few steps which can be followed to obtain the desired results. These steps are broadly classified in the following points.

1. Cleaning the Data.

2. Processing the data. 

3. Normalizing the data. 

4. Model implementation and Prediction
 

1. Import the Libraries

 
The first step is to import all the necessary libraries.
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from lightgbm import LGBMClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import seaborn as sns

2. Download Dataset

 

We are going to download the Titanic dataset. In the dataset we have 10 columns. In which “Survived” is our target column.

!wget https://raw.githubusercontent.com/mananparasher/PML-Machine-Learning-Datasets/master/titanic_dataset.csv

Once we have the Pandas DataFrame, we can use inbuilt methods such as                           

head( ) : To give us top five results.

df=pd.read_csv("titanic_dataset.csv")
df.head(5)
importance of feature engineering in machine learning

3. Explore the Dataset

 

info( ) method gives us the information about the columns in our DataFrame, their data types, the total memory consumption for the dataset.

df.info()
df.describe()
how to explore dataset in pandas dataframe?

Let’s see data distribution for some of the columns in our DataFrame. 

sns.pairplot(df[["Fare", "Pclass", "Survived"]], diag_kind="kde")
data distribution using seaborn

4. Checking for Null Values

 

info( ) : method gives us the count of the values in each column. It does not include null values. In this way, we can see if our dataset has null values.

df.info()

4.1 Handling Numerical Columns

 

There are a lot of ways to handle numerical null values mainly

1. Drop the rows with null values.

2. Fill the values with mean, median, or mode.

3. Use a machine learning algorithm like Linear Regression to fill the values.

4. Use an advanced approach like K-Nearest Neighbors to fill the missing values.

5. Using imputer to fill the missing values.

df['Age']=df['Age'].fillna(df['Age'].mean())
df=df.drop(columns=['Cabin'])
df.head(5)
how to drop columns in feature engineering?

1. We have filled the missing values in the “Age” column with mean values.

2. Since the “Cabin” column is not important, we have removed it.

4.2 Handling Categorical Columns

 

To handle categorical columns we have fewer options such as

1. Drop the rows with missing values.

2. Populate the value using the most occurring value.

3. Impute the values using KNN imputer.

4. Fill the value using “Others”. 

4.2.1 Filling Categorical Column Null Values

 

We are going to fill the missing values in the “Embarked” column with “Others”. we will use the fillna( ) method for the same.

df['Embarked']=df['Embarked'].fillna('Others')

5. Removing Unwanted Columns

 

Once we check every column, we can conclude that a few columns are unnecessary. Removing the unwanted columns helps in achieving better accuracy.

print("Name Column",len(df['Name'].unique()))
print("Ticket Column",len(df['Ticket'].unique()))
print("PassengerID Column",len(df['PassengerId'].unique()))
df=df.drop(columns=['Name','Ticket','PassengerId'])
df.head(5)

We have dropped “Name”, “Ticket” and “PassengerId” columns from our dataset.

drop columns in pandas dataframe

6. Data Normalization

 

After data cleaning and pre-processing the next major step in feature engineering is data normalization. We cannot give the values in the dataset directly to our model because with these values the model may or may not be able to converge. 

6.1 Converting Categorical Columns

 

Categorical columns have to be converted into numerical columns. There are a lot of options methods for this

1. Label Encoder( ) : Encodes the target values with 0 -no_of_classes-1.

2. Ordinal Encoder( ) : Encodes the target column with an integer array.

One Hot Encoder ( ) :  Encodes the target column into a one-hot numerical array.

In this example, we are going to use Ordinal Encoder.

ordinalencoder=OrdinalEncoder()
df[['Embarked','Sex']]=ordinalencoder.fit_transform(df[['Embarked','Sex']])
df[['Embarked','Sex']]

6.2 Normalizing Numerical Columns

 

Numerical columns have to be normalized 

1. Standard Scaler( ) : Returns the value using (x-x[‘mean’])/x[‘std’] where x is the target column.

2. Min Max Encoder( ) : Transforms the value to a given range.

In this example, we are going to use a Standard Scaler.

standardccaler=StandardScaler()
df[['Pclass','Age','SibSp','Parch','Fare']]=standardccaler.fit_transform(df[['Pclass','Age','SibSp','Parch','Fare']])
df[['Pclass','Age','SibSp','Parch','Fare']]

7. Splitting Data

 

Once the data is normalized then we need to split the data into training and testing dataset.

y=df.pop('Survived')
X_train, X_val, y_train, y_val = train_test_split(df, y, test_size=0.05, random_state=42)

8. Model Implementation 

 

.We are going to implement a Lightgbm Classifier model for this example. The advantage of using a Lightgbm model is that we can specify the categorical features by using the parameter “categorical_feature”. 

In this way, the model doesn’t treat the categorical features in terms of numbers.

lgbmclassifier = LGBMClassifier()
lgbmclassifier.fit(X_train,y_train,categorical_feature=['Embarked','Sex'])
how to implement lightgbm classification model?
predictions=lgbmclassifier.predict(X_val)
predictions
print(classification_report(predictions,y_val))
how to calculate accuracy in machine learning model?

 8. Results 

We are achieving an accuracy of 87% in this model. 

Following the steps in feature engineering, we can achieve really good accuracy.

 Summary

 

1. wget : To download the data.

2. info( ) : To check for null values for effective feature engineering.

3. pairplot( ) : To visualize the data distribution the dataset. 

4. mean( ) : To calculate mean of a column.

5. fillna( ) : To fill the missing values in the dataset.

6. Label Encoder( ) : To encode the target values with 0 -no_of_classes-1.

7. Ordinal Encoder( ) : To encode the target column with an integer array.

8  One Hot Encoder ( ) :  To encode the target column into a one-hot numerical array.

9. Standard Scaler( ) : To normalize using (x-x[‘mean’])/x[‘std’].

10. Min Max Encoder( ) : To normalize  the value to a given range.

11. LGBMClassifier( ) : To implement a LightGBM Classifier model.

 You can find the Github link here.

2 thoughts on “How to perform Feature Engineering in Machine Learning?

  1. Thanks for the post. However, I don’t think you are allowed to do normalizing for the whole data before splitting to train and test. You can only apply (not learn) the transformation during testing. Or am I missing something here?

    1. Hi Janne,

      We can do it both ways.

      If we want to use an encoder after splitting the data, we just need to make ensure we are training the encoder on the training dataset and applying it on both the train and test dataset.

Leave a Reply

Your email address will not be published. Required fields are marked *