In this tutorial, you are going to learn
1. What is Feature Engineering?
2. How to download the dataset?
3. How to explore the dataset?
4. How to process data to feed into the LightGBM model?
5. How to normalize the dataset?
6. How to split the dataset into training, validation, and testing?
7. How to create the LightGBM dataset?
8. How to specify parameters for the LightGBM model?
9. How to train a LightGBM model?
10. How to predict output using a trained LightGBM model?
11. How to find out the precision, recall, and accuracy of the model?
12. How to save a trained LightGBM model?
LightGBM
LightGBM is a gradient boosting model that uses tree-based algorithms. It is much faster than the usual tree-based algorithms like Decision Trees, Random Forests, etc. It has the following advantages over the traditional machine learning algorithms.
3. Supports GPU processing.
Why LightGBM is a better choice?
As the model is trained in most of the decision tree learning algorithms, the tree grows level wise.

Whereas in the case of a LightGBM model, the tree grows leaf wise. This is the reason why its accuracy is way better than other algorithms as leaf-wise algorithms achieve lower loss score.

1. Import the Libraries
import pandas as pd from sklearn.preprocessing import OrdinalEncoder from sklearn.preprocessing import StandardScaler from lightgbm import LGBMClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report import seaborn as sns
2. Download Dataset
We are going to download the Titanic dataset. In the dataset, we have 10 columns. In which “Survived” is our target column.
!wget https://raw.githubusercontent.com/mananparasher/PML-Machine-Learning-Datasets/master/titanic_dataset.csv
Once we have the Pandas DataFrame, we can use inbuilt methods such as
head( ) : To give us the top five results.
df=pd.read_csv("titanic_dataset.csv") df.head(5)

3. Explore the Dataset
info( ) method gives us the information about the columns in our DataFrame, their data types, the total memory consumption for the dataset.
df.info()

4. Data Processing
We are using the fillna( ) method to fill the missing values.
df['Age']=df['Age'].fillna(df['Age'].mean()) df['Embarked']=df['Embarked'].fillna('Others') df=df.drop(columns=['Cabin','Name','Ticket','PassengerId']) df.head(5)

We have dropped “Name”, “Ticket” and “PassengerId” columns from our dataset.
5. Data Normalization
After data cleaning and pre-processing the next major step in feature, engineering is data normalization. We are using OrdinalEncoder( ) to encode categorical data and Standard Scaler( ) to normalize the numerical data.
ordinalencoder=OrdinalEncoder() df[['Embarked','Sex']]=ordinalencoder.fit_transform(df[['Embarked','Sex']]) standardccaler=StandardScaler() df[['Pclass','Age','SibSp','Parch','Fare']]=standardccaler.fit_transform(df[['Pclass','Age','SibSp','Parch','Fare']]) df.head(5)

6. Splitting Data
Once the data is normalized then we need to split the data into training, validation and testing dataset.
y=df.pop('Survived') X_2, X_val, y_2, y_val = train_test_split(df, y, test_size=0.05, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X_2, y_2, test_size=0.05, random_state=42)
7. Dataset for LIghtGBM
Dataset( ) method is used to create a Dataset to feed into the LightGBM model. The input parameters for the method are
1. Data : The feature column
2. Label : The target column.
categorical_columns=['Embarked','Sex'] training_data = lgb.Dataset(X_train, label=y_train) validation_data = lgb.Dataset(X_val, label=y_val) training_data

8. Model Parameters
We can set some basic parameters for the model.
1. “num_leaves” : Represents the number of leaves the tree will create while training.
2. “objective” : To specify if classification is binary of multi-class classification.
3. “metric” : Represents parameter for loss function.
param = {'num_leaves': 100, 'objective': 'binary'} param['metric'] = 'auc'
9. Model Training
LightGBM model has train( )method to train the model. The various parameters passed to the model are as follows
1. “num_round” : The number of training rounds for the model.
2. “param” : The parameters to be passed to the model.
3. “valid_sets” : The validation dataset.
4. “categorical_feature” : The categorical features in the dataset.
num_round = 10 boostermodel = lgb.train(param, training_data, num_round, \ valid_sets=validation_data,categorical_feature=categorical_columns)

10. Model Predictions
We can predict the output using a trained LightGBM model using the predict( ) method. The output will be in prediction probabilities, hence we need to use the round( ) function for binary classification.
predictions=boostermodel.predict(X_test) predictions=predictions.round(0) predictions

11. Classification Report
print(classification_report(predictions,y_test))

12. Saving Model
We can save the model using the save_model( ) method.
boostermodel.save_model('model.txt')
Summary
1. wget : To download the data.
2. info( ) : To check for null values for effective feature engineering.
3. pairplot( ) : To visualize the data distribution the dataset.
4. mean( ) : To calculate mean of a column.
5. fillna( ) : To fill the missing values in the dataset.
6. Ordinal Encoder( ) : To encode the target column with an integer array.
7. Standard Scaler( ) : To normalize using (x-x[‘mean’])/x[‘std’].
8. “num_leaves” : Represents the number of leaves the tree will create while training.
9. “objective” : To specify if classification is binary of multi-class classification.
10. “metric” : Represents parameter for loss function.
11. “num_round” : The number of training rounds for the model.
12. “param” : The parameters to be passed to the model.
13. “valid_sets” : The validation dataset.
14. “categorical_feature” : The categorical features in the dataset.
15. predict( ) : To predict the output using a trained LightGBM model.
16. round( ) : To normalize the probabilities in [0,1].
17. save_model( ) : To save a trained LightGBM model.