In this post “Skip header and footer rows in Hive“, we are going to learn that how we can ignore few header and footer records in Hive without loading or reading these records in another table or in a view temporarily. If you want to read more about Hive, visit my post “Preserve Hive metastore in Azure HDInsight” which explains Hive QL in detail.
Skip header and footer records in Hive
We can ignore N number of rows from top and bottom from a text file without loading that file in Hive using TBLPROPERTIES clause. The TBLPROPERTIES clause provides various features which can be set as per our need. It can be used in this scenario to handle the files which are being generated with additional header and footer records. Let’s have a look at the below sample file:
Now assume that we are dealing with above text file which is being generated by an automated tool which adds 3 additional rows at the very beginning of the file and 2 additional rows at the end of the file. These extra rows are added to provide additional details about the file itself.
The first 3 rows contain the below details:
- Name of the tool which generates the file,
- Date along with time on which the file was generated and
- The file number (a sequential number which resets at the start of every month) for that particular month.
Similarly, the last 2 rows contain the below details:
- The total number of rows in the file (which can be used to verify missing records) and
- The date along with time on which the file was last modified.
If you want, you can download this sample file from here and use it for testing purpose in your environment.
To create an external Hive table which ignores these extra rows and reads only the actual data into a Hive table, we are going to use Azure cloud platform with HDInsight cluster in this demo. However, you can use other cloud environments like Amazon or Google cloud platform to instantiate a Hadoop cluster and run the Hive QL on top of that to get the similar output.
Now, upload this file to an Azure storage or Azure DataLake storage or to an Amazon S3 storage if you are using Amazon AWS platform. Next, instantiate a new Hadoop cluster where we will be writing Hive QL to read this file in an external table by ignoring these additional headers and footers. In this example, we are instantiating an Azure HDInsight cluster which is reading this sample file from an Azure blob storage. Below is the Hive QL which reads the above text file without considering the additional records:
CREATE DATABASE IF NOT EXISTS testdb; USE testDB; DROP TABLE IF EXISTS testDB.sample_table; CREATE EXTERNAL TABLE IF NOT EXISTS testdb.sample_table ( log_date string, row_num int, pkt_number string, pkt_color string, pkt_size string, pkt_quality string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 'wasb://<container-name>@<storage-account-name>.blob.core.windows.net/' TBLPROPERTIES('skip.header.line.count'='3', 'skip.footer.line.count'='2');
In the above query, we are creating a database named “testdb” and then we are using it to create a table named “sample_table“. In CREATE EXTERNAL TABLE statement, we are using the TBLPROPERTIES clause with “skip.header.line.count” and “skip.footer.line.count” to exclude the unwanted headers and footers from the file. In the above output, we can see that we don’t have any unwanted row.
Here, I would also like to highlight that in an earlier version of Hive there was an issue with footer properties which has been fixed now. You can visit this JIRA board to get more detail on this.
Thanks for the reading. Please share your inputs.