Machine LearningNumpyPandasPython

How to find the best categorical features in the dataset?

how to calculate categorical feature importance in machine learning2?

In this tutorial, you are going to learn

 

1. How to import the necessary libraries?

2. How to Import the dataset?

3. How to explore the dataset?

4. How to convert Categorical Columns to Numerical Columns?

5. How to select the best Categorical Features?

6. How to plot a bar graph?

 

1. Import the Libraries

The first step is to import all the necessary libraries.
import pathlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OrdinalEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

2. Import the Dataset

 

We are importing the Car Data dataset. We are getting the data in .data format, so we need to convert the data into a pandas data frame.

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data
column_names = ['price','buying','maint','doors','persons','lug_boot','safety']  

dataset = pd.read_csv('car.data', names=column_names,
                      na_values = "?", comment='\t',
                      sep=",", skipinitialspace=True)

3. Explore the Dataset

 

We can see that we have a total of 7 columns in our dataset. In which all columns are of the object type.

The safety column is the target Column.

dataset.info()
how to get pandas dataframe summary?
dataset['safety'].unique()

Our Target column has four unique string values. We need to convert our target column into a numerical one.

def categorical_to_numericla(value):
  if value=='unacc':
    return(0)
  elif value=='acc':
    return(1)
  elif value=='good': 
    return(2)
  else:
    return(3)

dataset['safety']=dataset['safety'].apply(lambda x: categorical_to_numericla(x))

We have used a lambda function to convert the categorical column into a numerical one. We have defined a function “categorical_to_numerical” which return numerical values based on the string value it gets.

y=dataset.pop('safety')
X=dataset

4. Convert Categorical Columns into Numerical Columns

 

The input to the Ordinal Encoder is categorical features. The features are then converted into ordinal integers from values (0-number of categories -1).

ordinalencoder = OrdinalEncoder()
ordinalencoder.fit(X)
X_transformed = ordinalencoder.transform(X)
X_transformed
how to convert categorical features into numerical features in scikit-learn?

5. SelectKBest Implementation

 

Chi-square test removes the irrelevant features from the dataset which are not dependent on the target column. 

SelectKBest algorithm selects the features based on the best K scores in our case it is based on the chi-square function.

selectkbest = SelectKBest(score_func=chi2, k=3)
selectkbest.fit(X_transformed, y)
best_columns = selectkbest.transform(X)
best_columns

6. Visualize with Barplot

 

We can visualize the results using a bar graph to get better insights.

indices = np.argsort(selectkbest.scores_)[::-1]

features = []
for i in range(6):
    features.append(X.columns[indices[i]])

fig, ax = plt.subplots(figsize=(20,10))     

sns.barplot(x=features, y=selectkbest.scores_[indices[range(6)]],\
label="Importtant Categorical Features", palette=("Blues_d"),ax=ax).\
set_title('Categorical Features Importance')

ax.set(xlabel="Columns", ylabel = "Importance")
how to calculate categorical feature importance in machine learning?

 Summary

 

1. read_csv( ) : To read a csv file into a pandas dataframe.

2.  apply( ) : To apply a lambda function on a pandas dataframe.

3. OrdinalEncoder( ) : To convert categorical columns into numerical columns.

4. SelectKBest( ) : To select the K best features in the dataset.

5. barplot( ) : To plot data in the form of a bar chart.

 You can find the Github link here.

Leave a Reply

Your email address will not be published. Required fields are marked *