In this article, we will discuss two important concepts “Partitioning and Bucketing” in Hive. These are used to improve query performance and it is important to understand them so that you can apply them efficiently. So let’s start with Partitioning.
Partitioning in Hive
Partitioning is a technique which is used to enhance query performance in hive. It is done by restructuring data into sub directories. Let us understand this concept with an example.
Suppose we have a large file of 10 GB having geographical data for a customer. Now we want to extract a record for a particular country and for a particular employeId. In order to do so, It will perform a table scan to read all the rows and then pick only those records that satisfy the given predicate.
Now if we partition that table by country and run the query, it will not scan the … More