Gopal Krishna Ranjan

RDD, DataFrame, and DataSet – Introduction to Spark Data Abstraction

Leave a Comment / Spark / Gopal Krishna Ranjan / May 31, 2019 / big data processing

Apache Spark is a general purpose distributed computing engine used for Big Data processing – Batch and stream processing. It provides high level APIs like Spark SQL, Spark Streaming, MLib, and GraphX to allow interaction with core functionalities of Apache Spark. Spark also facilitates several core data abstractions on top of the distributed collection of […]

RDD, DataFrame, and DataSet – Introduction to Spark Data Abstraction Read More »

Big Data processing using Apache Spark – Introduction

Leave a Comment / Spark / Gopal Krishna Ranjan / Apr 30, 2019 / Hadoop, MapReduce

What is Spark Apache spark is an open source general purpose distributed cluster computing framework. It is an unified computing engine for big data processing. Spark is designed for lightning fast cluster computing especially for fast computation. An application can run up to 100 times faster than Hadoop MapReduce using Spark in-memory cluster computing. Also,

Big Data processing using Apache Spark – Introduction Read More »

Understanding Map join in Hive

Leave a Comment / Hadoop, Hive / Gopal Krishna Ranjan / Mar 31, 2019 / Hadoop, HiveQL, MapReduce, performance tuning, query hint

Apache Hive is a big data query language which is used to read, transform and write large datasets in a distributed environment. It has a SQL like syntax which gets translated into a MapReduce job in order to execute on Hadoop clusters. In Hadoop ecosystem, we use Hive for batch processing to extract, transform and

Understanding Map join in Hive Read More »

Python use case – Save each worksheet as a separate excel workbook

1 Comment / Python / Gopal Krishna Ranjan / Feb 28, 2019 / python, python use case, utility

In this post “Python use case – Save each worksheet as a separate excel workbook“, we are going to learn that how we can create a separate workbook for each worksheet of a given excel file. We will be copying data, values, formatting and all other settings of the sheet in the newly created workbook.

Python use case – Save each worksheet as a separate excel workbook Read More »

Building Decision Tree model in python from scratch – Step by step

1 Comment / Data Science, Machine Learning, Python / Gopal Krishna Ranjan / Jan 28, 2019 / data analysis, data science - step by step, machine learning - step by step, python

In previous post, we created our first Machine Learning model using Logistic Regression to solve a classification problem. We used “Wisconsin Breast Cancer dataset” for demonstration purpose. Now, in this post “Building Decision Tree model in python from scratch – Step by step”, we will be using IRIS dataset which is a standard dataset that

Building Decision Tree model in python from scratch – Step by step Read More »

Building first Machine Learning model using Logistic Regression in Python – Step by Step

Leave a Comment / Data Science, Machine Learning, Python / Gopal Krishna Ranjan / Dec 31, 2018 / data science - step by step, machine learning - step by step, python

This post briefs how to create our first machine learning predictive model using Logistic regression in Python. When we start working on a Machine Learning project, first, we perform some data wrangling and transformation to get the tidy dataset. Then, we perform some EDA to find trends, patterns, and outliers in the given dataset. Once, we have machine-interpretable data

Building first Machine Learning model using Logistic Regression in Python – Step by Step Read More »

Exploratory Data Analysis (EDA) using Python – Second step in Data Science and Machine Learning

2 Comments / Analytics/ML, Data Analysis, Data Science, Machine Learning, Python / Gopal Krishna Ranjan / Nov 27, 2018 / data analysis, data preprocessing, data science - step by step, EDA, machine learning - step by step, python

In the previous post, “Tidy Data in Python – First Step in Data Science and Machine Learning”, we discussed the importance of the tidy data and its principles. In a Machine Learning project, once we have a tidy dataset in place, it is always recommended to perform EDA (Exploratory Data Analysis) on the underlying data

Exploratory Data Analysis (EDA) using Python – Second step in Data Science and Machine Learning Read More »

Quick guide to Bash commands for Big Data Analysis

1 Comment / Big Data/Cloud / Gopal Krishna Ranjan / Oct 31, 2018 / bash command, linux

In this post “Quick guide to Bash commands for Big Data Analysis”, we are going to explore some basic Bash/Linux commands which are very useful in data analysis. Bash is a command line interpreter for the GNU OS(a UNIX like free OS) which typically runs in a command line window. It accepts the command submitted

Quick guide to Bash commands for Big Data Analysis Read More »

Python use case – Resampling time series data (Upsampling and downsampling) – SQL Server 2017

Leave a Comment / Data Analysis, Machine Learning, Python, SQL Server / Gopal Krishna Ranjan / Sep 24, 2018 / data science - step by step, machine learning - step by step, pandas, python, python use case sql, sql server 2017

Resampling time series data in SQL Server using Python’s pandas library In this post, we are going to learn how we can use the power of Python in SQL Server 2017 to resample time series data using Python’s pandas library. Sometimes, we get the sample data (observations) at a different frequency (higher or lower) than

Python use case – Resampling time series data (Upsampling and downsampling) – SQL Server 2017 Read More »

Tidy Data in Python – First Step in Data Science and Machine Learning

1 Comment / Analytics/ML, Data Analysis, Data Science, Machine Learning, Python / Gopal Krishna Ranjan / Aug 20, 2018 / data analysis, data cleaning, data science - step by step, data wrangling, machine learning - step by step

Most of the Data Science / Machine Learning projects follow the Pareto principle where we spend almost 80% of the time in data preparation and remaining 20% in choosing and training the appropriate ML model. Mostly, the datasets we get to create Machine Learning models are messy datasets and cannot be fitted into the model

Tidy Data in Python – First Step in Data Science and Machine Learning Read More »