Daily Archives: Nov 27, 2018


Exploratory Data Analysis (EDA) using Python – Second step in Data Science and Machine Learning

In the previous post, “Tidy Data in Python – First Step in Data Science and Machine Learning”, we discussed the importance of the tidy data and its principles. In a Machine Learning project, once we have a tidy dataset in place, it is always recommended to perform EDA (Exploratory Data Analysis) on the underlying data before fitting it into a Machine Learning model. Let’s start understanding the importance of EDA and some basic EDA techniques which are very useful.

What is Exploratory Data Analysis (EDA)

Exploratory Data Analysis or EDA, is the process of organizing, plotting and summarizing the data to find trends, patterns, and outliers using statistical and visual methods. It takes input data from a tabular format and represents it in a graphical format which makes it more human interpretable. It is an important step in a Machine Learning/Data Science project which should be performed before … More


Partitioning and Bucketing in Hive

In this article, we will discuss two important concepts “Partitioning and Bucketing” in Hive. These are used to improve query performance and it is important to understand them so that you can apply them efficiently. So let’s start with Partitioning.

Partitioning in Hive

Partitioning is a technique which is used to enhance query performance in hive. It is done by restructuring data into sub directories. Let us understand this concept with an example.

Suppose we have a large file of 10 GB having geographical data for a customer. Now we want to  extract a record for a particular country and for a particular employeId. In order to do so, It will perform a table scan to read all the rows and then pick only those records that satisfy the given predicate.

Now if we partition that table by country and run the query, it will not scan the … More