Category : Analytics/ML


Relationship between Binomial and Poisson distributions

In this post, we are going to discuss the Relationship between Binomial and Poisson distributions. We know that Poisson distribution is a limit of Binomial distribution for a large n (number of trials) and small p (independent probability for each trial) values. A large number of trials n with very small probability p indicates a rare event in a binomial distribution. Considering this, we will simulate these distributions and then we will create a  CDF (cumulative distributed function) plot of Binomial and Poisson distributions. It will help us to understand the similarity between a Poisson experiment and a rare event Binomial experiment.

In this post, we will not be going into the mathematical details of Binomial and Poisson distributions. However, we will be using NumPy’s random module available in Python to simulate these distributions using a technique called bootstrapping.

Relationship between Binomial and Poisson distributions

Let’s start by understanding the … More


Convert Jupyter notebooks to PDF

Jupyter lab is the next-generation web-based UI experience for Jupyter notebook users. It facilitates a tab-based programming interface that is highly extensible and interactive. It supports 40+ programming languages. We have already discussed how we can use Jupyter notebooks for interactive data analysis with SQL Server. With the help of Jupyter notebooks, we can keep headings, comments, code, output, and advanced charts and visuals in a single document in an orderly fashion. It helps Data Scientists and Data Analysts to have highly interactive presentations. In case you have already installed Jupyter notebooks and want to know how we can change the home directory for Jupyter notebooks, visit the blog “Change Jupyter Notebook startup folder on Windows and Mac OS “. Let’s discuss how we can Convert Jupyter notebooks to PDF documents directly from the web-browser or using nbconvert command from command prompt.

Convert Jupyter notebooks to PDF

During … More


Interactive Data Analysis with SQL Server using Jupyter Notebooks

In this post “Interactive Data Analysis with SQL Server using Jupyter Notebooks“, we will demonstrate how we can use Jupyter Notebooks for interactive data analysis with SQL Server. Jupyter notebooks are one of the most useful tools for any Data Scientist/Data Analyst. It supports 40+ programming languages and facilitates web-based interactive programming IDE. We can put comments, headings, codes, and output in one single document. This document maintains the context to the original data source which means we can re-execute the code whenever we need it. This feature facilitates Data scientists/Data analysts to play with the code during the presentations. Also, these notebooks are very handy in sharing and can be shared easily across the teams.

What is Jupyter Lab

Jupyter Lab is the next-generation web-based tool for Jupyter notebooks. It enables tab based programming model which is highly extensible. We can arrange multiple windows … More


Python use case – Export SQL table data to excel and CSV files – SQL Server 2017

In this post, we are going to discuss how we can export SQL Server table data to an Excel file or to a CSV file using Python’s pandas library. Prior to SQL Server 2017, we could use one of the below methods to export data from SQL Server to Excel or CSV file:

  1. Create an SSIS package to export the data from SQL Server – This option can be a good choice if we want to reuse the export process again and again. Also, if we want to put moderate/complex transformations during data export, this option can be a better choice.
  2. Use SQL Server Import/export wizard – SQL Server provides in-built data export/import wizard which can be used in case we want to export data with no/minimal transformations.
  3. OS-based copy-paste functionality – We can simply copy the query output from SQL Server using Ctrl + C option and then open
More

Read and write data to SQL Server from Spark using pyspark 1

Apache Spark is a very powerful general-purpose distributed computing framework. It provides a different kind of data abstractions like RDDs, DataFrames, and DataSets on top of the distributed collection of the data. Spark is highly scalable Big data processing engine which can run on a single cluster to thousands of clusters. To follow this exercise, we can install Spark on our local machine and can use Jupyter notebooks to write code in an interactive mode. In this post “Read and write data to SQL Server from Spark using pyspark“, we are going to demonstrate how we can use Apache Spark to read and write data to a SQL Server table.

Read SQL Server table to DataFrame using Spark SQL JDBC connector – pyspark

Spark SQL APIs can read data from any relational data source which supports JDBC driver. We can read the data of a SQL Server table … More


Install Spark on Windows (Local machine) with PySpark – Step by Step

Apache Spark is a general-purpose big data processing engine. It is a very powerful cluster computing framework which can run from a single cluster to thousands of clusters. It can run on clusters managed by Hadoop YARN, Apache Mesos, or by Spark’s standalone cluster manager itself. To read more on Spark Big data processing framework, visit this post “Big Data processing using Apache Spark – Introduction“. Here, in this post, we will learn how we can install Apache Spark on a local Windows Machine in a pseudo-distributed mode (managed by Spark’s standalone cluster manager) and run it using PySpark (Spark’s Python API).

Install Spark on Local Windows Machine

To install Apache Spark on a local Windows machine, we need to follow below steps:

Step 1 – Download and install Java JDK 8

Java JDK 8 is required as a prerequisite for the Apache Spark installation. We … More


Change Jupyter Notebook startup folder on Windows and Mac OS 1

Once we have installed the Jupyter notebook, we can start it by executing “jupyter notebook” command in the command prompt on a Windows machine or in the terminal on a Mac machine. Jupyter notebook is a very useful web-based application which can be used to write programs in many programming languages like Python, R, Scala, Julia, and etc. The notebooks created in jupyter can be shared easily with other users over email, Git, and DropBox. We can use jupyter notebooks to write code in an interactive mode which can be very handy to re-run individual snippets. It is nicely integrated with Big Data ecosystem and with cloud platforms also.

When we start the jupyter notebook server, it shows the notebooks from the current working directory from which the notebook server is started. That is why the default working directory of a Jupyter notebook server is … More


Python use case – Save each worksheet as a separate excel workbook 1

In this post “Python use case – Save each worksheet as a separate excel workbook“, we are going to learn that how we can create a separate workbook for each worksheet of a given excel file. We will be copying data, values, formatting and all other settings of the sheet in the newly created workbook. Assume that we have an excel workbook having five sheets named “Sheet1“, “Sheet2“, “Sheet3“, “Sheet4“, and “Sheet5” which is as this.

Now, we want to create 5 different excel workbooks (one for each worksheet).

We are going to use Excel application to copy data, values, formatting and other settings of each sheet as a new workbook in a python script. Below is the python code.

To execute the above python script, we can call this .py file using command prompt window as this.… More


Building Decision Tree model in python from scratch – Step by step

In previous post, we created our first Machine Learning model using Logistic Regression to solve a classification problem. We used “Wisconsin Breast Cancer dataset” for demonstration purpose. Now, in this post “Building Decision Tree model in python from scratch – Step by step”, we will be using IRIS dataset which is a standard dataset that comes with Scikit-learn library. Let’s have a quick look at IRIS dataset.

The IRIS dataset

The IRIS dataset is a multi-class classification dataset introduced by British statistician and biologist Ronald Fisher in 1936. This dataset has 150 observations which consists 50 samples of each of three species of Iris flower which are “setosa“, “versicolor” or “virginica“. It is a standard, cleansed and preprocessed multivariate dataset which comes preloaded with Scikit-learn library. Each sample has four input features which are:

  1. Sepal length (cm)
  2. Sepal width (cm)
  3. Petal length (cm)
More

Building first Machine Learning model using Logistic Regression in Python – Step by Step

When we start working on a Machine Learning/Data Science project, first we need to perform some data cleaning and data transformation to get a tidy dataset. Then, we need to perform some EDA(Exploratory Data Analysis) to find trends, patterns, and outliers in the given data. Once, we have a machine interpretable data in place, we choose an algorithm and train the model. Then, we evaluate it on the test data. Next, we can tune the hyper parameters of the model and retrain it to get a robust model. Once the model performance is acceptable, we deploy it to make predictions. Typically, we follow these steps in a Machine Learning model creation:

In this post “Building first Machine Learning model using Logistic Regression in Python“, we are going to create our first machine learning predictive model in a step by step way. We will be using scikit-learn library … More