Category : Hadoop


Data compression in Hive – An Introduction to Hadoop Data Compression

Data compression is a technique that encodes the original data in such a way so that it can be represented with fewer bits on the disk. The data compression process is used to reduce the size of the data files on the disk. We know that the Hadoop framework is meant for large scale data processing (Big Data processing) which includes lots of data files stored on HDFS or supported file systems. So data compression can be very helpful in reducing storage requirements, and in reducing the amount of data to be transferred between mappers and reducers which usually occurs over the network. In Hadoop, data compression can be implemented using Hive or any other MapReduce component. In this post, we will discuss the widely used HiveQL data compression formats or codec (compressor/decompressor) schemes.

Data compression in Hive

HiveQL supports different codec schemes that are used to compress … More


Understanding Map join in Hive

Apache Hive is a big data query language which is used to read, transform and write large datasets in a distributed environment. It has a SQL like syntax which gets translated into a MapReduce job in order to execute on Hadoop clusters. In Hadoop ecosystem, we use Hive for batch processing to extract, transform and load the data into a data warehouse system or in a file system which can be HDFS, Amazon S3, Azure Blob or Azure DataLake. However, Hive is not meant for OLTP tasks as it has high latency. In this post, we are going to learn Map Join which can be used to improve the performance of a hive query. We will also discuss the parameters required in order to enable the Map join along with its limitations.

What is Map join in Hive

Join clause in hive is used to combine records from two tables … More