Category : Data Science


Building Decision Tree model in python from scratch – Step by step

In previous post, we created our first Machine Learning model using Logistic Regression to solve a classification problem. We used “Wisconsin Breast Cancer dataset” for demonstration purpose. Now, in this post “Building Decision Tree model in python from scratch – Step by step”, we will be using IRIS dataset which is a standard dataset that comes with Scikit-learn library. Let’s have a quick look at IRIS dataset.

The IRIS dataset

The IRIS dataset is a multi-class classification dataset introduced by British statistician and biologist Ronald Fisher in 1936. This dataset has 150 observations which consists 50 samples of each of three species of Iris flower which are “setosa“, “versicolor” or “virginica“. It is a standard, cleansed and preprocessed multivariate dataset which comes preloaded with Scikit-learn library. Each sample has four input features which are:

  1. Sepal length (cm)
  2. Sepal width (cm)
  3. Petal length (cm)
More

Building first Machine Learning model using Logistic Regression in Python – Step by Step

When we start working on a Machine Learning/Data Science project, first we need to perform some data cleaning and data transformation to get a tidy dataset. Then, we need to perform some EDA(Exploratory Data Analysis) to find trends, patterns, and outliers in the given data. Once, we have a machine interpretable data in place, we choose an algorithm and train the model. Then, we evaluate it on the test data. Next, we can tune the hyper parameters of the model and retrain it to get a robust model. Once the model performance is acceptable, we deploy it to make predictions. Typically, we follow these steps in a Machine Learning model creation:

In this post “Building first Machine Learning model using Logistic Regression in Python“, we are going to create our first machine learning predictive model in a step by step way. We will be using scikit-learn library … More


Exploratory Data Analysis (EDA) using Python – Second step in Data Science and Machine Learning

In the previous post, “Tidy Data in Python – First Step in Data Science and Machine Learning”, we discussed the importance of the tidy data and its principles. In a Machine Learning project, once we have a tidy dataset in place, it is always recommended to perform EDA (Exploratory Data Analysis) on the underlying data before fitting it into a Machine Learning model. Let’s start understanding the importance of EDA and some basic EDA techniques which are very useful.

What is Exploratory Data Analysis (EDA)

Exploratory Data Analysis or EDA, is the process of organizing, plotting and summarizing the data to find trends, patterns, and outliers using statistical and visual methods. It takes input data from a tabular format and represents it in a graphical format which makes it more human interpretable. It is an important step in a Machine Learning/Data Science project which should be performed before … More


What is Machine learning and why is it gaining so much popularity?

Well now a days everyone seems to be talking about machine learning and its applications/uses, but have we ever thought how all of a sudden ML has become so popular? If I tell you that work on AI started way back in 1950 and Machine learning started to grow rapidly in 1990, what has suddenly given a boost to Machine Learning?

In this blog, I will give you answers to these questions but let us first have a look at what machine learning is.

We will start from basics and understand what a Program is.In simple terms,a program is predefined set of rules or instructions. When data is fed to the computer, it processes the data using these rules. That sounds pretty cool, but then came this question of can’t a computer be just fed with the data and it decides rules and give us the answers. This would make … More


Tidy Data in Python – First Step in Data Science and Machine Learning 1

Most of the Data Science / Machine Learning projects follow the Pareto principle where we spend almost 80% of the time in data preparation and remaining 20% in choosing and training the appropriate ML model. Mostly, the datasets we get to create Machine Learning models are messy datasets and cannot be fitted into the model directly. We need to perform some data cleaning steps in order to get a dataset which then can be fitted into the model. We need to make sure that the data we are inputting into the model is a tidy data. Indeed, it is the first step in a Machine Learning / Data Science project. We may need to repeat the data cleaning process many times as we face new challenges and problems while cleaning the data. Data cleaning is one of the most important and time taking process a Data Scientist performs before … More