Author : Gopal Krishna Ranjan


Gopal Krishna Ranjan

About Gopal Krishna Ranjan

Gopal has 8 years of industry experience in Software development. He has a head down experience in Data Science, Database, Data Warehouse, Big Data and cloud technologies and has implemented end to end solutions. He has extensively worked on SQL Server, Python, Hadoop, Hive, Spark, Azure, Machine Learning, and MSBI (SSAS, SSIS, and SSRS). He also has good experience in windows and web application development using ASP.Net and C#.

Big Data processing using Apache Spark – Introduction

What is Spark

Apache spark is an open source general purpose distributed cluster computing framework. It is an unified computing engine for big data processing. Spark is designed for lightning fast cluster computing especially for fast computation. An application can run up to 100 times faster than Hadoop MapReduce using Spark in-memory cluster computing. Also, Spark can run up to 10 times faster than Hadoop MapReduce when running on disk.

Why Spark

We can use Spark for any kind of big data processing ranging from SQL to streaming and machine learning running from a single machine to thousands of servers. It supports widely used programming languages like Python, Java, Scala, and R by exposing a set of high level API libraries. Spark can run on clusters managed by Hadoop YARN, Apache Mesos, or it can run standalone also. It provides many features like fast computational speed, multiple language support, … More


Understanding Map join in Hive

Apache Hive is a big data query language which is used to read, transform and write large datasets in a distributed environment. It has a SQL like syntax which gets translated into a MapReduce job in order to execute on Hadoop clusters. In Hadoop ecosystem, we use Hive for batch processing to extract, transform and load the data into a data warehouse system or in a file system which can be HDFS, Amazon S3, Azure Blob or Azure DataLake. However, Hive is not meant for OLTP tasks as it has high latency. In this post, we are going to learn Map Join which can be used to improve the performance of a hive query. We will also discuss the parameters required in order to enable the Map join along with its limitations.

What is Map join in Hive

Join clause in hive is used to combine records from two tables … More


Python use case – Save each worksheet as a separate excel workbook

In this post “Python use case – Save each worksheet as a separate excel workbook“, we are going to learn that how we can create a separate workbook for each worksheet of a given excel file. We will be copying data, values, formatting and all other settings of the sheet in the newly created workbook. Assume that we have an excel workbook having five sheets named “Sheet1“, “Sheet2“, “Sheet3“, “Sheet4“, and “Sheet5” which is as this.

Now, we want to create 5 different excel workbooks (one for each worksheet).

We are going to use Excel application to copy data, values, formatting and other settings of each sheet as a new workbook in a python script. Below is the python code.

To execute the above python script, we can call this .py file using command prompt window as this.… More


Building Decision Tree model in python from scratch – Step by step

In previous post, we created our first Machine Learning model using Logistic Regression to solve a classification problem. We used “Wisconsin Breast Cancer dataset” for demonstration purpose. Now, in this post “Building Decision Tree model in python from scratch – Step by step”, we will be using IRIS dataset which is a standard dataset that comes with Scikit-learn library. Let’s have a quick look at IRIS dataset.

The IRIS dataset

The IRIS dataset is a multi-class classification dataset introduced by British statistician and biologist Ronald Fisher in 1936. This dataset has 150 observations which consists 50 samples of each of three species of Iris flower which are “setosa“, “versicolor” or “virginica“. It is a standard, cleansed and preprocessed multivariate dataset which comes preloaded with Scikit-learn library. Each sample has four input features which are:

  1. Sepal length (cm)
  2. Sepal width (cm)
  3. Petal length (cm)
More

Building first Machine Learning model using Logistic Regression in Python – Step by Step

When we start working on a Machine Learning/Data Science project, first we need to perform some data cleaning and data transformation to get a tidy dataset. Then, we need to perform some EDA(Exploratory Data Analysis) to find trends, patterns, and outliers in the given data. Once, we have a machine interpretable data in place, we choose an algorithm and train the model. Then, we evaluate it on the test data. Next, we can tune the hyper parameters of the model and retrain it to get a robust model. Once the model performance is acceptable, we deploy it to make predictions. Typically, we follow these steps in a Machine Learning model creation:

In this post “Building first Machine Learning model using Logistic Regression in Python“, we are going to create our first machine learning predictive model in a step by step way. We will be using scikit-learn library … More


Exploratory Data Analysis (EDA) using Python – Second step in Data Science and Machine Learning

In the previous post, “Tidy Data in Python – First Step in Data Science and Machine Learning”, we discussed the importance of the tidy data and its principles. In a Machine Learning project, once we have a tidy dataset in place, it is always recommended to perform EDA (Exploratory Data Analysis) on the underlying data before fitting it into a Machine Learning model. Let’s start understanding the importance of EDA and some basic EDA techniques which are very useful.

What is Exploratory Data Analysis (EDA)

Exploratory Data Analysis or EDA, is the process of organizing, plotting and summarizing the data to find trends, patterns, and outliers using statistical and visual methods. It takes input data from a tabular format and represents it in a graphical format which makes it more human interpretable. It is an important step in a Machine Learning/Data Science project which should be performed before … More


Quick guide to Bash commands for Big Data Analysis

In this post “Quick guide to Bash commands for Big Data Analysis”, we are going to explore some basic Bash/Linux commands which are very useful in data analysis. Bash is a command line interpreter for the GNU OS(a UNIX like free OS) which typically runs in a command line window. It accepts the command submitted by the end user and transforms it into a machine understandable format and sends it to the kernel. If we want to execute a batch of bash commands in a go, we can wrap a set of bash commands in a text file and save it with a .sh extension and then we can call this file in order to execute it.

Since, Hadoop was developed on top of Linux OS, mostly Linux based machines are used in production environment. Therefore, in order to interact with Hadoop clusters, we must have a good understanding of … More


Python use case – Resampling time series data (Upsampling and downsampling) – SQL Server 2017

Resampling time series data in SQL Server using Python’s pandas library

In this post, we are going to learn how we can use the power of Python in SQL Server 2017 to resample time series data using Python’s pandas library. Sometimes, we get the sample data (observations) at a different frequency (higher or lower) than the required frequency level. In such kind of scenarios, we need to modify the frequency of the given samples as per the frequency of the required outcome. Modifying the frequency of time series data using T-SQL query becomes a tedious task especially when we need to perform upsampling as we need to generate more rows than what we have in the sample dataset. The Python’s pandas module has in-built capabilities for frequency conversion. With the help of pandas resample method, we can increase or decrease the time series observation frequencies with only few lines of … More


Tidy Data in Python – First Step in Data Science and Machine Learning 1

Most of the Data Science / Machine Learning projects follow the Pareto principle where we spend almost 80% of the time in data preparation and remaining 20% in choosing and training the appropriate ML model. Mostly, the datasets we get to create Machine Learning models are messy datasets and cannot be fitted into the model directly. We need to perform some data cleaning steps in order to get a dataset which then can be fitted into the model. We need to make sure that the data we are inputting into the model is a tidy data. Indeed, it is the first step in a Machine Learning / Data Science project. We may need to repeat the data cleaning process many times as we face new challenges and problems while cleaning the data. Data cleaning is one of the most important and time taking process a Data Scientist performs before … More


Conditional group by in SQL Server

In this post, we are going to learn how we can apply a conditional GROUP BY clause on a column, based on the values of another column. Assume that we have a table named tbl_EmpSaleDetail which contains the sales records for each employee. Let’s have a look at the table data.

In the above table, we have these columns:

EmpName – Stores the name of the employee

SaleDate – Date of sale

SaleAmount – Amount of the sale

IsActive – Indicates whether the employee is active or not.

Now, we need this output.

In this output, we can see that all the data of inactive employees have been aggregated to a single row labeled as “–Inactive Employees Sales–” (Highlighted in red). However, the sum of the sales of the active employees are aggregated individually. Before writing the conditional group by query, lets create the sample table … More