Machine LearningNumpyPandasPython

Feature Importance using XGBoost

Feature Importance using XGBoost

In this tutorial, you are going to learn

 

1. How to import the XGboost library?

2. How to Import the dataset?

3. How to process the dataset for the machine learning model?

4. How to convert categorical data into numerical data?

5. How to split the data into testing and training dataset?

6. How to implement an XGBoost machine learning model?

7. How to predict output using a trained XGBoost model?

8. What is Feature Importance? 

9. How to find most the important features using the XGBoost model?

10. How to build an XGboost Model using selected features?

 

XGboost Model

 
Gradient Boosting technique is used for regression as well as classification problems. The model improves over iterations. The model works in a series of fashion. The weak learners learn from the previous models and create a better-improved model. This is achieved using optimizing over the loss function.
 
XGBoost algorithm is an advanced machine learning algorithm based on the concept of Gradient Boosting. It provides better accuracy and more precise results.

1. Import Libraries

 
The first step is to import all the necessary libraries.
import pathlib
import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from matplotlib import pyplot
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

2. Import Dataset

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00459/avila.zip
!unzip avila.zip

Once we have the Pandas DataFrame, we can use inbuilt methods such as                           

read_csv( ) : To read a CSV file into a pandas DataFrame.                                                       

columns_names=['intercolumnar_distance','upper_margin','lower_margin','exploitation','row_number','modular',\
               'interlinear_spacing','weight','peak_number','modular_ratio','class']
  

dataset = pd.read_csv('avila/avila-ts.txt',names=columns_names,na_values = "?", comment='\t',
                      sep=",", skipinitialspace=True)

dataset
how to import csv file into pnadas dtaframe?

3. Data Processing

 

Once we have the dataset, we need to build the training data i.e. X and the target variable i.e. y.

y=dataset[['class']]
dataset=dataset.drop(columns=['class'])
X=dataset

The drop function removes the column from the dataframe.

4. Convert Categorical Columns to Numerical

 

To convert the categorical data into numerical, we are using Ordinal Encoder. Ordinal Encoder assigns unique values to a column depending upon the unique number of categorical values present in that column.

For example, if a column has two values [‘a’,’b’], if we pass the column to Ordinal Encoder, the resulting column will have values[0.0,1.0]. Were 0.0 represents the value ‘a’ and 1.0 represents the value b.

ordinalencoder = OrdinalEncoder()
ordinalencoder.fit(y)
y = ordinalencoder.transform(y)
y=y.flatten()
y

5. Split Data

 

We are using Scikit-Learn train_test_split( ) method to split the data into training and testing data. The “test_size” parameter determines the split percentage.

X_train, X_val, y_train, y_val = train_test_split(dataset, y, test_size=0.05, random_state=42)

6. Model Implementation

 

To implement a XGBoost model for classification, we will use XGBClasssifer( ) method.

model = XGBClassifier()
model.fit(X_train, y_train)

7. Predict Output

 

Output can be predicted using a trained model using predict( ) method.

prediction=model.predict(X_val)
print(classification_report(prediction,y_val))
predict output using trained xgboost model?

8. Feature Importance

 

Feature Importance is defined as the impact of a particular feature in predicting the output. We can find out feature importance in an XGBoost model using the feature_importance_ method.

indices = np.argsort(model.feature_importances_)[::-1]

features = []
for i in range(10):
    features.append(X.columns[indices[i]])

fig, ax = plt.subplots(figsize=(15,5))     

sns.barplot(x=features, y=model.feature_importances_[indices[range(10)]],\
label="Importtant Categorical Features", palette=("Blues_d"),ax=ax).\
set_title('Categorical Features Importance')

ax.set(xlabel="Columns", ylabel = "Importance")
how to find feature importance using xgboost?

Visualizing the results of feature importance shows us that “peak_number” is the most important feature and “modular_ratio” and “weight” are the least important features.

9. Model Implementation with Selected Features

 

We know the most important and the least important features in the dataset. Now we will build a new XGboost model using only the important features. 

X_new=X[['intercolumnar_distance','upper_margin','lower_margin','exploitation','row_number','modular',\
               'interlinear_spacing','peak_number']]
X_train_new, X_val_new, y_train_new, y_val_new = train_test_split(X_new, y, test_size=0.05, random_state=42)
model_sel_features = XGBClassifier()
model_sel_features.fit(X_train_new, y_train_new)
prediction=model_sel_features.predict(X_val_new)
print(classification_report(prediction,y_val_new))
output using trained xgbboost model with selected features?

10. Results

 

We see that using only the important features while training the model results in better Accuracy. Hence feature importance is an essential part of Feature Engineering.

 Summary

 

1. drop( ) : To drop a column in a dataframe.

2. OrdinalEncoder( ): To convert categorical data into numerical data.

3. train_test_split( ):How to split the data into testing and training dataset?

4. XGBClassifier( ) : To implement an XGBoost machine learning model.

5. predict( ): To predict output using a trained XGBoost model.

6. feature_importances_ : To find the most important features using the XGBoost model.

7. classification_report( ) : To calculate Precision, Recall and Acuuracy.

 You can find the Github link here.

Leave a Reply

Your email address will not be published. Required fields are marked *