Machine LearningNumpyPandasPython

Introduction to LightGBM. How to implement a LightGBM model?

lightgbm

In this tutorial, you are going to learn

 

1. What is Feature Engineering?

2. How to download the dataset?

3. How to explore the dataset?

4. How to process data to feed into the LightGBM model?

5. How to normalize the dataset?

6. How to split the dataset into training, validation, and testing?

7. How to create the LightGBM dataset?

8. How to specify parameters for the LightGBM model?

9. How to train a LightGBM model?

10. How to predict output using a trained LightGBM model?

11. How to find out the precision, recall, and accuracy of the model? 

12. How to save a trained LightGBM model?

 

LightGBM

 

LightGBM is a gradient boosting model that uses tree-based algorithms. It is much faster than the usual tree-based algorithms like Decision Trees, Random Forests, etc. It has the following advantages over the traditional machine learning algorithms.

1. Faster training speed with better efficiency.
 
2. Lower memory usage.


3. Supports GPU processing.

4. Highly scalable and efficiently handles large datasets.
 

Why LightGBM is a better choice?

 

As the model is trained in most of the decision tree learning algorithms, the tree grows level wise.

Introduction to LightGBM. How to implement a LightGBM model?

Whereas in the case of a LightGBM model, the tree grows leaf wise. This is the reason why its accuracy is way better than other algorithms as leaf-wise algorithms achieve lower loss score.

1. Import the Libraries

 
The first step is to import all the necessary libraries.
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from lightgbm import LGBMClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import seaborn as sns

2. Download Dataset

 

We are going to download the Titanic dataset. In the dataset, we have 10 columns. In which “Survived” is our target column.

!wget https://raw.githubusercontent.com/mananparasher/PML-Machine-Learning-Datasets/master/titanic_dataset.csv

Once we have the Pandas DataFrame, we can use inbuilt methods such as                          

 head( ) : To give us the top five results.

df=pd.read_csv("titanic_dataset.csv")
df.head(5)
importance of feature engineering in machine learning

3. Explore the Dataset

 

info( ) method gives us the information about the columns in our DataFrame, their data types, the total memory consumption for the dataset.

df.info()

4. Data Processing

 

We are using the fillna( ) method to fill the missing values.

df['Age']=df['Age'].fillna(df['Age'].mean())
df['Embarked']=df['Embarked'].fillna('Others')
df=df.drop(columns=['Cabin','Name','Ticket','PassengerId'])
df.head(5)
drop columns in pandas dataframe

We have dropped “Name”, “Ticket” and “PassengerId” columns from our dataset.

5. Data Normalization

 

 After data cleaning and pre-processing the next major step in feature, engineering is data normalization. We are using OrdinalEncoder( ) to encode categorical data and Standard Scaler( ) to normalize the numerical data.

ordinalencoder=OrdinalEncoder()
df[['Embarked','Sex']]=ordinalencoder.fit_transform(df[['Embarked','Sex']])

standardccaler=StandardScaler()
df[['Pclass','Age','SibSp','Parch','Fare']]=standardccaler.fit_transform(df[['Pclass','Age','SibSp','Parch','Fare']])

df.head(5)
normalizing data in lightgbm model

6. Splitting Data

 

Once the data is normalized then we need to split the data into training, validation and testing dataset.

y=df.pop('Survived')

X_2, X_val, y_2, y_val = train_test_split(df, y, test_size=0.05, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_2, y_2, test_size=0.05, random_state=42)

7. Dataset for LIghtGBM

 

Dataset( ) method is used to create a Dataset to feed into the LightGBM model. The input parameters for the method are

1. Data : The feature column

2. Label : The target column.

categorical_columns=['Embarked','Sex']

training_data = lgb.Dataset(X_train, label=y_train)
validation_data = lgb.Dataset(X_val, label=y_val)

training_data
lightgbm dataset

8. Model Parameters

 

We can set some basic parameters for the model. 

1. “num_leaves” : Represents the number of leaves the tree will create while training.

2. “objective” : To specify if classification is binary of multi-class classification.

3. “metric” : Represents parameter for loss function.

param = {'num_leaves': 100, 'objective': 'binary'}
param['metric'] = 'auc'

9. Model Training 

LightGBM model has train( )method to train the model. The various parameters passed to the model are as follows

1. “num_round” : The number of training rounds for the model.

2. “param” : The parameters to be passed to the model.

3. “valid_sets” : The validation dataset.

4. “categorical_feature” : The categorical features in the dataset.

num_round = 10
boostermodel = lgb.train(param, training_data, num_round, \
                         valid_sets=validation_data,categorical_feature=categorical_columns)
how to train lightgbm model?

10. Model Predictions 

 

We can predict the output using a trained LightGBM model using the predict( ) method. The output will be in prediction probabilities, hence we need to use the round( ) function for binary classification.

predictions=boostermodel.predict(X_test)
predictions=predictions.round(0)
predictions
how to predict output using trained lightgbm model?

11. Classification Report

print(classification_report(predictions,y_test))
how to calculate accuracy in machine learning model?

12. Saving Model

 

We can save the model using the save_model( ) method.

boostermodel.save_model('model.txt')

 Summary

 

1. wget : To download the data.

2. info( ) : To check for null values for effective feature engineering.

3. pairplot( ) : To visualize the data distribution the dataset. 

4. mean( ) : To calculate mean of a column.

5. fillna( ) : To fill the missing values in the dataset.

6. Ordinal Encoder( ) : To encode the target column with an integer array.

7. Standard Scaler( ) : To normalize using (x-x[‘mean’])/x[‘std’].

8. “num_leaves” : Represents the number of leaves the tree will create while training.

9. “objective” : To specify if classification is binary of multi-class classification.

10. “metric” : Represents parameter for loss function.

11. “num_round” : The number of training rounds for the model.

12. “param” : The parameters to be passed to the model.

13. “valid_sets” : The validation dataset.

14. “categorical_feature” : The categorical features in the dataset.

15. predict( ) : To predict the output using a trained LightGBM model.

16. round( ) : To normalize the probabilities in  [0,1].

17. save_model( ) : To save a trained LightGBM model.

 You can find the Github link here.

Leave a Reply

Your email address will not be published. Required fields are marked *