Data compression is a technique that encodes the original data in such a way so that it can be represented with fewer bits on the disk. The data compression process is used to reduce the size of the data files on the disk. We know that the Hadoop framework is meant for large scale data processing (Big Data processing) which includes lots of data files stored on HDFS or supported file systems. So data compression can be very helpful in reducing storage requirements, and in reducing the amount of data to be transferred between mappers and reducers which usually occurs over the network. In Hadoop, data compression can be implemented using Hive or any other MapReduce component. In this post, we will discuss the widely used HiveQL data compression formats or codec (compressor/decompressor) schemes.
Data compression in Hive
HiveQL supports different codec schemes that are used to compress and decompress the data. Hive codec is a program to implement data compression and decompression using the Hive query language. Let’s discuss some widely used Hive compression formats:
Hive data compression codecs:
GZIP compression: GZip compression is a GNU zip compression utility that is based on the DEFLATE algorithm. The generated files have a .gz file extension. Hadoop has “org.apache.hadoop.io.compress.GzipCodec” class for gzip compression. Gzip provides a high compression ratio which results in high CPU utilization during the compression and decompression process. This codec can be a good choice for the data which is not used very frequently. Also, the gzip format is not splittable which can be a bottleneck for MapReduce task especially in case we need to process large data files. The gzip-compressed data cone be read directly into a table stored as a text file and the compression is detected automatically on the fly and the files are decompressed.
BZIP2 compression: Bzip2 compresses files more effectively and with a higher compression ratio than Gzip. The compression and decompression are slower than gzip and are more CPU intensive. The generated files have a .bz2 file extension and are splittable. However, bzip2 is much slower and that is why it is not suitable for the data files which used frequently. But it can be a better choice for the data which is not used frequently e.g. for archiving. The Hadoop class for bzip2 is “org.apache.hadoop.io.compress.BZip2Codec“. The bzip2-compressed files can also be read directly into a table that is stored as a text file and the compression gets detected automatically on the fly and the files are decompressed.
LZO compression: Lzo compression provides a low compression ratio than bzip2 and gzip. It provides overlapping compression along with in-place decompression. The file extension is .lzo and the generated files are not splittable. But we can index these files prior to the compression in order to generate the compressed splittable files. The Hadoop class for lzo compression is “com.hadoop.compression.lzo.LzopCodec“.
SNAPPY compression: Google created Snappy compression which is written in C++ and focuses on compression and decompression speed but it provides less compression ratio than bzip2 and gzip. It generates the files with .snappy extension and these files are not splittable if it is used with normal text files. The Hadoop class for Snappy compression is “org.apache.hadoop.io.compress.SnappyCodec“.
Below table displays the comparison of these widely used hive codecs:
Compression stages in Hive
Hive supports data compression at various stages. We can compress the files at these stages:
Compressed Input files: We can read the compressed files into hive tables directly which generally improves the query performance.
Compressed Intermediate result: We can also enable compression for intermediate output files that gets generated during the MapReduce task.
Compressed output files: In order to save space, we can generate compressed files as output files.
Thanks for the reading. Please share your inputs in the comment section.