Category : Big Data/Cloud


Partitioning and Bucketing in Hive

In this article, we will discuss two important concepts “Partitioning and Bucketing” in Hive. These are used to improve query performance and it is important to understand them so that you can apply them efficiently. So let’s start with Partitioning.

Partitioning in Hive

Partitioning is a technique which is used to enhance query performance in hive. It is done by restructuring data into sub directories. Let us understand this concept with an example.

Suppose we have a large file of 10 GB having geographical data for a customer. Now we want to  extract a record for a particular country and for a particular employeId. In order to do so, It will perform a table scan to read all the rows and then pick only those records that satisfy the given predicate.

Now if we partition that table by country and run the query, it will not scan the … More


Quick guide to Bash commands for Big Data Analysis

In this post “Quick guide to Bash commands for Big Data Analysis”, we are going to explore some basic Bash/Linux commands which are very useful in data analysis. Bash is a command line interpreter for the GNU OS(a UNIX like free OS) which typically runs in a command line window. It accepts the command submitted by the end user and transforms it into a machine understandable format and sends it to the kernel. If we want to execute a batch of bash commands in a go, we can wrap a set of bash commands in a text file and save it with a .sh extension and then we can call this file in order to execute it.

Since, Hadoop was developed on top of Linux OS, mostly Linux based machines are used in production environment. Therefore, in order to interact with Hadoop clusters, we must have a good understanding of … More


Handling special characters in Hive (using encoding properties) 1

In case we are reading a text file in a Hive table which contains non-English characters and we are not using the appropriate text encoding, these non-English characters might be loaded as junk symbols (like boxes – �). To get these characters in their original form, we need to use the correct character encoding. In this post “Handling special characters in Hive (using encoding properties)“, we are going to learn that how we can read special characters in Hive using encoding properties available with TBLPROPERTIES clause.

To demonstrate it, we will be using a dummy text file which is in ANSI text encoding format and contains Spanish characters. Also, we will be using Microsoft Azure cloud platform to instantiate an on-demand HDInsight cluster that makes it easy to write Hive queries. We will upload the dummy text file to an Azure Data Lake Storage and then we will … More


Skip header and footer rows in Hive 1

In this post “Skip header and footer rows in Hive“, we are going to learn that how we can ignore few header and footer records in Hive without loading or reading these records in another table or in a view temporarily. If you want to read more about Hive, visit my post “Preserve Hive metastore in Azure HDInsight” which explains Hive QL in detail.

Skip header and footer records in Hive

We can ignore N number of rows from top and bottom from a text file without loading that file in Hive using TBLPROPERTIES clause. The TBLPROPERTIES clause provides various features which can be set as per our need. It can be used in this scenario to handle the files which are being generated with additional header and footer records. Let’s have a look at the below sample file:

Now assume that we are dealing with … More


Preserve Hive metastore in Azure HDInsight 2

In this blog “Preserve Hive metastore in Azure HDInsight“, we are going to learn how we can preserve the hive metadata while working with the Azure HDInsight services. Microsoft Azure HDInsight is an on-demand managed Open source Big Data analytics service for the enterprises. We can provision clusters as per the demand in few minutes, perform the computations, and then we can shut it down to avoid charges. We pay as per the usage only. You can visit this link to know more about Azure HDInsight.

What is Hive?

Apache Hive is a SQL like Big Data query language which is used as an abstraction for the map reduce jobs. The Hive query seamlessly converts into an equivalent map reduce job without the need to write low-level code. This increases the productivity of a developer to a great extent. If you want to read more about Hive … More