Building Decision Tree model in python from scratch – Step by step


In previous post, we created our first Machine Learning model using Logistic Regression to solve a classification problem. We used “Wisconsin Breast Cancer dataset” for demonstration purpose. Now, in this post “Building Decision Tree model in python from scratch – Step by step”, we will be using IRIS dataset which is a standard dataset that comes with Scikit-learn library. Let’s have a quick look at IRIS dataset.

The IRIS dataset

The IRIS dataset is a multi-class classification dataset introduced by British statistician and biologist Ronald Fisher in 1936. This dataset has 150 observations which consists 50 samples of each of three species of Iris flower which are “setosa“, “versicolor” or “virginica“. It is a standard, cleansed and preprocessed multivariate dataset which comes preloaded with Scikit-learn library. Each sample has four input features which are:

  1. Sepal length (cm)
  2. Sepal width (cm)
  3. Petal length (cm), and
  4. Petal width (cm)

The target variable defines the species of the iris flower which can be “setosa“, “versicolor” or “virginica“. We need to create a classifier (using Decision Tree Classifier) which can be used to predict the species of the iris flower for unseen data based on the given input features – sepal length, sepal width, petal length, and petal width.

Let’s have a look at the sample data:

Top 10 sample rows

Top 10 sample rows

Data preprocessing and Exploration

Now, we are going to load and analyze this dataset in python using pandas library which is a very powerful and handy library used for data analysis.

from sklearn import datasets #import datasets from sklearn library
import pandas as pd #import pandas under alias pd
data = datasets.load_iris() #load Iris dataset in a variable named data

Using above code, we have loaded the Iris dataset into a variable named “data”. It is of type <class ‘sklearn.utils.Bunch’>. Bunch is a dictionary like object which has five keys/properties.

  1. DESCR – Displays the full description of the dataset
  2. data – Contains input features data in a numpy array with shape (150, 4)
  3. feature_names – Contains the name of the features in a python list.
  4. target – Contains the target values (dependent variable values) for each 150 rows – shape (150, )
  5. target_names – Contains name of the target classes in a string array

We can access these properties using syntax like data.property_name> or data[‘<property_name>’].

Now, let’s create a pandas dataframe using Iris data.

df = pd.DataFrame(data.data, columns = data.feature_names) #create a dataframe df with features as column name
print(df.head()) #print top 5 rows of the dataframe

Output:

Sample dataframe rows

Sample dataframe rows

Let’s use .info() method on the dataframe to get the column names, data types, non null value counts along with memory usage. Also, use .describe() method to get the statistical summary of each column.

print(df.info()) #print column name, datatypes and not null value counts for each column
print(df.describe()) #print statistical summary of the columns

Output:

Quick info and statistical summary of the dataframe

Quick info and statistical summary of the dataframe

Now, let’s use .isnull() and .isna() methods to verify the Null and NaN(Not a number) values in this dataset:

print(df.isnull().sum()) #Print the sum of all null values
print(df.isna().sum()) #Print the sum of all NaN values

Output:

Count of Null and NaN values

Count of Null and NaN values

The dataframe does not have any null or NaN values and all the input features in this dataset are numeric (Though, CART supports categorical variables as an input feature). So, we can say that this dataset is satisfying the tidy data principles and it can be used in a Machine Learning model. Before fitting this data into the model, let’s do some EDA on this dataset.

EDA(Exploratory Data Analysis)

As all the input features of this dataset are numeric, we can draw a scatter matrix plot which displays the correlation between each feature of the dataset. To draw a scatter matrix plot, we can use this code.

import matplotlib.pyplot as plt
_ = pd.plotting.scatter_matrix(df, c = data.target, figsize = [6, 6], s = 25, marker = 'D')
plt.show()

Output:

Scatter matrix

Scatter matrix

In above image, we can see that the petal length and petal width are highly correlated.

Now, let’s draw the histogram of each feature.

_ = df.hist(bins = 4, figsize = (6, 6))
plt.show()

Output:

Histogram of each feature

Histogram of each feature

We can also apply some more EDA techniques on this dataframe (like box plot, violin plot, and strip plot) before fitting this data into our Machine learning model. Visit this link to know more on EDA(Exploratory data analysis) techniques.

Classification and Regression Tree – CART

Classification and Regression Tree or CART is a supervised Machine Learning algorithm which is used to solve classification (categorical output) and regression (continuous output) tasks. It uses Decision Tree which consists a hierarchy of nodes. Each node either involves a question or prediction. There can be three types of node:

  1. Root node: It has no parent node and involves a question which gives rise to two children nodes
  2. Internal node: It has one parent node and involves a question which gives rise to two children nodes
  3. Leaf node: It has one parent node but no children node (because it involves no question). It is also known as the prediction node.

During training, Decision trees learn the patterns so that it can produce the purest leaf (a leaf node which is predominant by one class). Decision trees implicitly perform feature engineering and it can handle both numerical and categorical data as input features. It also eliminates the data normalization/standardization process (used to bring all the input features on same scale). In addition, Decision trees can also capture non-linear relationships. A typical Decision Tree model looks like this.

CART prediction model

CART prediction model

To know more about Decision Trees models, click here.

Building Decision Tree Classification model using scikit-learn

As like our previous model, we need to split the given dataset in two parts, training data and test data. The training data will be used to train the model and the test data will be used to evaluate the model performance on unseen data. We can use this code to train and test the model performance using Decision Tree Classifier.

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
seed = 22 #set seed value for reproducibility
x_data = data.data #Assign input features in x_data
y_data = data.target #Assign target/dependent variable values in y_data
#Now, split the x and y data into train and test dataset.
# Use stratify = y to have the same proportion of the classes in the training sample as the input dataset
(x_train, x_test, y_train, y_test) = train_test_split(\
x_data, y_data, random_state = seed, stratify = y_data, test_size = 0.30)

#Instantiate decision tree classifier
dt = DecisionTreeClassifier(criterion='gini', max_depth = 2, \
min_samples_leaf = 0.10, random_state = seed)
dt.fit(x_train, y_train) #Train the model
y_pred = dt.predict(x_test) #Predict the values on test data

While instantiating the Decision Tree, we have used criterian = ‘gini’ and max_depth = 2 which are hyper parameters (A parameter value which is required before fitting the data into Machine Learning model). We can use GridSearchCV or RandomizedSearchCV techniques in order to get the optimal values of these parameters.

Evaluate the model performance

Now, let’s evaluate our model performance. As our classification model is a balanced classification problem (each class has 50 samples in the input dataset), we can use the accuracy matrix as a performance measurement matrix.

from sklearn.metrics import accuracy_score
print('Accuracy of the model is {0}'.format(accuracy_score(y_test, y_pred)))
Accuracy score

Accuracy score

The accuracy of our model is 93.3% approximately which is an acceptable score. In addition to the accuracy matrix, we can also use a confusion matrix to measure the model performance.

from sklearn.metrics import confusion_matrix
print(pd.DataFrame(confusion_matrix(y_test, y_pred), \
            index = ['Actual setosa', 'Actual versicolor', 'Actual virginica'], \
            columns = ['Pred setosa', 'Pred versicolor', 'Pred virginica']))
Confusion matrix

Confusion matrix

As we have three classes in the target, we have a matrix of dimension 3 x 3 in the output. The diagonal values of the matrix are representing the accurate predictions and the non-diagonal values are representing the incorrect predictions.

We can also print the classification report which is especially useful when we have imbalanced class problem in the input dataset.

from sklearn.metrics import classification_report
print('Classification report is {0}'.format(classification_report(y_test, y_pred)))
Classification report

Classification report

Thanks for the reading. Please share your inputs in comment.

Rate This
[Total: 0    Average: 0/5]

Gopal Krishna Ranjan

About Gopal Krishna Ranjan

Gopal has 8 years of industry experience in Software development. He has a head down experience in Data Science, Database, Data Warehouse, Big Data and cloud technologies and has implemented end to end solutions. He has extensively worked on SQL Server, Python, Hadoop, Hive, Spark, Azure, Machine Learning, and MSBI (SSAS, SSIS, and SSRS). He also has good experience in windows and web application development using ASP.Net and C#.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.