# Category : Data Science

## Relationship between Binomial and Poisson distributions

In this post, we are going to discuss the Relationship between Binomial and Poisson distributions. We know that Poisson distribution is a limit of Binomial distribution for a large n (number of trials) and small p (independent probability for each trial) values. A large number of trials n with very small probability p indicates a rare event in a binomial distribution. Considering this, we will simulate these distributions and then we will create a  CDF (cumulative distributed function) plot of Binomial and Poisson distributions. It will help us to understand the similarity between a Poisson experiment and a rare event Binomial experiment.

In this post, we will not be going into the mathematical details of Binomial and Poisson distributions. However, we will be using NumPy’s random module available in Python to simulate these distributions using a technique called bootstrapping.

## Relationship between Binomial and Poisson distributions

Let’s start by understanding the … More

## Convert Jupyter notebooks to PDF

Jupyter lab is the next-generation web-based UI experience for Jupyter notebook users. It facilitates a tab-based programming interface that is highly extensible and interactive. It supports 40+ programming languages. We have already discussed how we can use Jupyter notebooks for interactive data analysis with SQL Server. With the help of Jupyter notebooks, we can keep headings, comments, code, output, and advanced charts and visuals in a single document in an orderly fashion. It helps Data Scientists and Data Analysts to have highly interactive presentations. In case you have already installed Jupyter notebooks and want to know how we can change the home directory for Jupyter notebooks, visit the blog “Change Jupyter Notebook startup folder on Windows and Mac OS “. Let’s discuss how we can Convert Jupyter notebooks to PDF documents directly from the web-browser or using nbconvert command from command prompt.

During … More

## Building Decision Tree model in python from scratch – Step by step

In previous post, we created our first Machine Learning model using Logistic Regression to solve a classification problem. We used “Wisconsin Breast Cancer dataset” for demonstration purpose. Now, in this post “Building Decision Tree model in python from scratch – Step by step”, we will be using IRIS dataset which is a standard dataset that comes with Scikit-learn library. Let’s have a quick look at IRIS dataset.

### The IRIS dataset

The IRIS dataset is a multi-class classification dataset introduced by British statistician and biologist Ronald Fisher in 1936. This dataset has 150 observations which consists 50 samples of each of three species of Iris flower which are “setosa“, “versicolor” or “virginica“. It is a standard, cleansed and preprocessed multivariate dataset which comes preloaded with Scikit-learn library. Each sample has four input features which are:

1. Sepal length (cm)
2. Sepal width (cm)
3. Petal length (cm)
More

## Building first Machine Learning model using Logistic Regression in Python – Step by Step

When we start working on a Machine Learning/Data Science project, first we need to perform some data cleaning and data transformation to get a tidy dataset. Then, we need to perform some EDA(Exploratory Data Analysis) to find trends, patterns, and outliers in the given data. Once, we have a machine interpretable data in place, we choose an algorithm and train the model. Then, we evaluate it on the test data. Next, we can tune the hyper parameters of the model and retrain it to get a robust model. Once the model performance is acceptable, we deploy it to make predictions. Typically, we follow these steps in a Machine Learning model creation:

In this post “Building first Machine Learning model using Logistic Regression in Python“, we are going to create our first machine learning predictive model in a step by step way. We will be using scikit-learn library … More

## Exploratory Data Analysis (EDA) using Python – Second step in Data Science and Machine Learning

In the previous post, “Tidy Data in Python – First Step in Data Science and Machine Learning”, we discussed the importance of the tidy data and its principles. In a Machine Learning project, once we have a tidy dataset in place, it is always recommended to perform EDA (Exploratory Data Analysis) on the underlying data before fitting it into a Machine Learning model. Let’s start understanding the importance of EDA and some basic EDA techniques which are very useful.

## What is Exploratory Data Analysis (EDA)

Exploratory Data Analysis or EDA, is the process of organizing, plotting and summarizing the data to find trends, patterns, and outliers using statistical and visual methods. It takes input data from a tabular format and represents it in a graphical format which makes it more human interpretable. It is an important step in a Machine Learning/Data Science project which should be performed before … More

## What is Machine learning and why is it gaining so much popularity?

Well now a days everyone seems to be talking about machine learning and its applications/uses, but have we ever thought how all of a sudden ML has become so popular? If I tell you that work on AI started way back in 1950 and Machine learning started to grow rapidly in 1990, what has suddenly given a boost to Machine Learning?

In this blog, I will give you answers to these questions but let us first have a look at what machine learning is.

We will start from basics and understand what a Program is.In simple terms,a program is predefined set of rules or instructions. When data is fed to the computer, it processes the data using these rules. That sounds pretty cool, but then came this question of can’t a computer be just fed with the data and it decides rules and give us the answers. This would make … More

## Tidy Data in Python – First Step in Data Science and Machine Learning1

Most of the Data Science / Machine Learning projects follow the Pareto principle where we spend almost 80% of the time in data preparation and remaining 20% in choosing and training the appropriate ML model. Mostly, the datasets we get to create Machine Learning models are messy datasets and cannot be fitted into the model directly. We need to perform some data cleaning steps in order to get a dataset which then can be fitted into the model. We need to make sure that the data we are inputting into the model is a tidy data. Indeed, it is the first step in a Machine Learning / Data Science project. We may need to repeat the data cleaning process many times as we face new challenges and problems while cleaning the data. Data cleaning is one of the most important and time taking process a Data Scientist performs before … More