Data compression is a technique that encodes the original data in such a way so that it can be represented with fewer bits on the disk. The data compression process is used to reduce the size of the data files on the disk. We know that the Hadoop framework is meant for large scale data processing (Big Data processing) which includes lots of data files stored on HDFS or supported file systems. So data compression can be very helpful in reducing storage requirements, and in reducing the amount of data to be transferred between mappers and reducers which usually occurs over the network. In Hadoop, data compression can be implemented using Hive or any other MapReduce component. In this post, we will discuss the widely used HiveQL data compression formats or codec (compressor/decompressor) schemes.
Data compression in Hive
HiveQL supports different codec schemes that are used to compress … More