In this tutorial, you are going to learn
1. What is Feature Engineering?
2. How to download the dataset?
3. How to explore the dataset?
4. How to check for null values for effective feature engineering?
5. How to handle numerical columns with null values?
6. How to handle categorical columns with null values?
7. How to fill null values in the dataset?
8. How to remove unwanted columns?
9. How to perform data normalization?
10. How to convert categorical columns to numerical?
11. How to split the dataset into training and testing?
12. how to implement a LightGBM model for classification?
Feature Engineering
1. Cleaning the Data.
2. Processing the data.
3. Normalizing the data.
4. Model implementation and Prediction
1. Import the Libraries
import pandas as pd from sklearn.preprocessing import OrdinalEncoder from sklearn.preprocessing import StandardScaler from lightgbm import LGBMClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report import seaborn as sns
2. Download Dataset
We are going to download the Titanic dataset. In the dataset we have 10 columns. In which “Survived” is our target column.
!wget https://raw.githubusercontent.com/mananparasher/PML-Machine-Learning-Datasets/master/titanic_dataset.csv
Once we have the Pandas DataFrame, we can use inbuilt methods such as
head( ) : To give us top five results.
df=pd.read_csv("titanic_dataset.csv") df.head(5)

3. Explore the Dataset
info( ) method gives us the information about the columns in our DataFrame, their data types, the total memory consumption for the dataset.
df.info()

df.describe()

Let’s see data distribution for some of the columns in our DataFrame.
sns.pairplot(df[["Fare", "Pclass", "Survived"]], diag_kind="kde")

4. Checking for Null Values
info( ) : method gives us the count of the values in each column. It does not include null values. In this way, we can see if our dataset has null values.
df.info()

4.1 Handling Numerical Columns
There are a lot of ways to handle numerical null values mainly
1. Drop the rows with null values.
2. Fill the values with mean, median, or mode.
3. Use a machine learning algorithm like Linear Regression to fill the values.
4. Use an advanced approach like K-Nearest Neighbors to fill the missing values.
5. Using imputer to fill the missing values.
df['Age']=df['Age'].fillna(df['Age'].mean()) df=df.drop(columns=['Cabin']) df.head(5)

1. We have filled the missing values in the “Age” column with mean values.
2. Since the “Cabin” column is not important, we have removed it.
4.2 Handling Categorical Columns
To handle categorical columns we have fewer options such as
1. Drop the rows with missing values.
2. Populate the value using the most occurring value.
3. Impute the values using KNN imputer.
4. Fill the value using “Others”.
4.2.1 Filling Categorical Column Null Values
We are going to fill the missing values in the “Embarked” column with “Others”. we will use the fillna( ) method for the same.
df['Embarked']=df['Embarked'].fillna('Others')
5. Removing Unwanted Columns
Once we check every column, we can conclude that a few columns are unnecessary. Removing the unwanted columns helps in achieving better accuracy.
print("Name Column",len(df['Name'].unique())) print("Ticket Column",len(df['Ticket'].unique())) print("PassengerID Column",len(df['PassengerId'].unique()))

df=df.drop(columns=['Name','Ticket','PassengerId']) df.head(5)
We have dropped “Name”, “Ticket” and “PassengerId” columns from our dataset.

6. Data Normalization
After data cleaning and pre-processing the next major step in feature engineering is data normalization. We cannot give the values in the dataset directly to our model because with these values the model may or may not be able to converge.
6.1 Converting Categorical Columns
Categorical columns have to be converted into numerical columns. There are a lot of options methods for this
1. Label Encoder( ) : Encodes the target values with 0 -no_of_classes-1.
2. Ordinal Encoder( ) : Encodes the target column with an integer array.
3 One Hot Encoder ( ) : Encodes the target column into a one-hot numerical array.
In this example, we are going to use Ordinal Encoder.
ordinalencoder=OrdinalEncoder() df[['Embarked','Sex']]=ordinalencoder.fit_transform(df[['Embarked','Sex']]) df[['Embarked','Sex']]

6.2 Normalizing Numerical Columns
Numerical columns have to be normalized
1. Standard Scaler( ) : Returns the value using (x-x[‘mean’])/x[‘std’] where x is the target column.
2. Min Max Encoder( ) : Transforms the value to a given range.
In this example, we are going to use a Standard Scaler.
standardccaler=StandardScaler() df[['Pclass','Age','SibSp','Parch','Fare']]=standardccaler.fit_transform(df[['Pclass','Age','SibSp','Parch','Fare']]) df[['Pclass','Age','SibSp','Parch','Fare']]

7. Splitting Data
Once the data is normalized then we need to split the data into training and testing dataset.
y=df.pop('Survived') X_train, X_val, y_train, y_val = train_test_split(df, y, test_size=0.05, random_state=42)
8. Model Implementation
.We are going to implement a Lightgbm Classifier model for this example. The advantage of using a Lightgbm model is that we can specify the categorical features by using the parameter “categorical_feature”.
In this way, the model doesn’t treat the categorical features in terms of numbers.
lgbmclassifier = LGBMClassifier() lgbmclassifier.fit(X_train,y_train,categorical_feature=['Embarked','Sex'])

predictions=lgbmclassifier.predict(X_val) predictions

print(classification_report(predictions,y_val))

8. Results
We are achieving an accuracy of 87% in this model.
Following the steps in feature engineering, we can achieve really good accuracy.
Summary
1. wget : To download the data.
2. info( ) : To check for null values for effective feature engineering.
3. pairplot( ) : To visualize the data distribution the dataset.
4. mean( ) : To calculate mean of a column.
5. fillna( ) : To fill the missing values in the dataset.
6. Label Encoder( ) : To encode the target values with 0 -no_of_classes-1.
7. Ordinal Encoder( ) : To encode the target column with an integer array.
8 One Hot Encoder ( ) : To encode the target column into a one-hot numerical array.
9. Standard Scaler( ) : To normalize using (x-x[‘mean’])/x[‘std’].
10. Min Max Encoder( ) : To normalize the value to a given range.
11. LGBMClassifier( ) : To implement a LightGBM Classifier model.
2 thoughts on “How to perform Feature Engineering in Machine Learning?”
Thanks for the post. However, I don’t think you are allowed to do normalizing for the whole data before splitting to train and test. You can only apply (not learn) the transformation during testing. Or am I missing something here?
Hi Janne,
We can do it both ways.
If we want to use an encoder after splitting the data, we just need to make ensure we are training the encoder on the training dataset and applying it on both the train and test dataset.