Category : Hive


Partitioning and Bucketing in Hive

In this article, we will discuss two important concepts “Partitioning and Bucketing” in Hive. These are used to improve query performance and it is important to understand them so that you can apply them efficiently. So let’s start with Partitioning.

Partitioning in Hive

Partitioning is a technique which is used to enhance query performance in hive. It is done by restructuring data into sub directories. Let us understand this concept with an example.

Suppose we have a large file of 10 GB having geographical data for a customer. Now we want to  extract a record for a particular country and for a particular employeId. In order to do so, It will perform a table scan to read all the rows and then pick only those records that satisfy the given predicate.

Now if we partition that table by country and run the query, it will not scan the … More


Handling special characters in Hive (using encoding properties) 1

In case we are reading a text file in a Hive table which contains non-English characters and we are not using the appropriate text encoding, these non-English characters might be loaded as junk symbols (like boxes – �). To get these characters in their original form, we need to use the correct character encoding. In this post “Handling special characters in Hive (using encoding properties)“, we are going to learn that how we can read special characters in Hive using encoding properties available with TBLPROPERTIES clause.

To demonstrate it, we will be using a dummy text file which is in ANSI text encoding format and contains Spanish characters. Also, we will be using Microsoft Azure cloud platform to instantiate an on-demand HDInsight cluster that makes it easy to write Hive queries. We will upload the dummy text file to an Azure Data Lake Storage and then we will … More


Skip header and footer rows in Hive 1

In this post “Skip header and footer rows in Hive“, we are going to learn that how we can ignore few header and footer records in Hive without loading or reading these records in another table or in a view temporarily. If you want to read more about Hive, visit my post “Preserve Hive metastore in Azure HDInsight” which explains Hive QL in detail.

Skip header and footer records in Hive

We can ignore N number of rows from top and bottom from a text file without loading that file in Hive using TBLPROPERTIES clause. The TBLPROPERTIES clause provides various features which can be set as per our need. It can be used in this scenario to handle the files which are being generated with additional header and footer records. Let’s have a look at the below sample file:

Now assume that we are dealing with … More


Preserve Hive metastore in Azure HDInsight 2

In this blog “Preserve Hive metastore in Azure HDInsight“, we are going to learn how we can preserve the hive metadata while working with the Azure HDInsight services. Microsoft Azure HDInsight is an on-demand managed Open source Big Data analytics service for the enterprises. We can provision clusters as per the demand in few minutes, perform the computations, and then we can shut it down to avoid charges. We pay as per the usage only. You can visit this link to know more about Azure HDInsight.

What is Hive?

Apache Hive is a SQL like Big Data query language which is used as an abstraction for the map reduce jobs. The Hive query seamlessly converts into an equivalent map reduce job without the need to write low-level code. This increases the productivity of a developer to a great extent. If you want to read more about Hive … More