Handling special characters in Hive (using encoding properties)


In case we are reading a text file in a Hive table which contains non-English characters and we are not using the appropriate text encoding, these non-English characters might be loaded as junk symbols (like boxes – �). To get these characters in their original form, we need to use the correct character encoding. In this post “Handling special characters in Hive (using encoding properties)“, we are going to learn that how we can read special characters in Hive using encoding properties available with TBLPROPERTIES clause.

To demonstrate it, we will be using a dummy text file which is in ANSI text encoding format and contains Spanish characters. Also, we will be using Microsoft Azure cloud platform to instantiate an on-demand HDInsight cluster that makes it easy to write Hive queries. We will upload the dummy text file to an Azure Data Lake Storage and then we will read it using HiveQL. Let’s have a look at the content of the text file.

Sample data

Sample data

The above text file contains four columns which are as below:

  1. UserName – Name of the user,
  2. Gender – Gender of the user,
  3. Age – Age of the user,
  4. About – A brief summary of the user which contains Spanish characters (Contains special characters)

In the above text file, we can see that the data in “About” column contains few non-English characters. This file (ANSI encoding) can be downloaded for practice purpose from here – Click here to download the sample file.

Let’s have a look at the below Hive query which creates a database named testDB followed by a table named tbl_user_raw inside the testDB database. Also, notice that we are not using any encoding setting in the CREATE TABLE statement while creating the table in the below script.

CREATE DATABASE IF NOT EXISTS testdb;
USE testDB;
 
DROP TABLE IF EXISTS testDB.tbl_user_raw;
CREATE EXTERNAL TABLE IF NOT EXISTS testdb.tbl_user_raw
(
    username string,
    gender string,
    age string,
    about string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n' STORED AS TEXTFILE
LOCATION 'adl://<Data-Lake-Store>.azuredatalakestore.net/<Folder-Name>/';

When we display the data from the tbl_user_raw (Hive table without appropriate encoding setting), it looks as below:

SELECT * FROM tbl_user_raw;
Table data with junk characters

Table data with junk characters

In the above image, we can see that the non-English characters have been converted into junk characters in the tbl_user_raw table.

Handling special characters in Hive

To read this file with these special characters in their original form, first, we need to find the original text encoding of the text file. To do this, we can simply open this file in Notepad++ editor and it will display the actual file encoding at the bottom-right corner as below:

Get text file encoding

Get text file encoding

Next, we can write a query with TBLPROPERTIES clause by defining the serialization.encoding setting in order to interpret these special characters in their original form in Hive table. Below, we are creating a new Hive table tbl_user to read the above text file with all the special characters:

DROP TABLE IF EXISTS testDB.tbl_user;
CREATE EXTERNAL TABLE IF NOT EXISTS testdb.tbl_user
(
    username string,
    gender string,
    age string,
    about string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n' STORED AS TEXTFILE
LOCATION 'adl://<Data-Lake-Store>.azuredatalakestore.net/<Folder-Name>/'
TBLPROPERTIES('serialization.encoding'='windows-1252');

Output:

Read special characters in Hive table

Read special characters in Hive table

In the above query, we have just added the TBLPROPERTIES clause to define the original file encoding while reading the file in the Hive table. TBLPROPERTIES provides “serialization.encoding” setting which can be used to set the required character set while reading the data into a Hive table.

We have added “TBLPROPERTIES(‘serialization.encoding’=’windows-1252’)” line to define the actual text encoding which is ANSI encoding. We know that ANSI encoding is a generic term used to refer to the standard code page on a Windows system. Also, it is referred to as Windows-1252 code page. To read more about, Windows-1252 code page, click here.

Thanks for the reading. Please share your input.

Rate This
[Total: 2    Average: 5/5]


Gopal Krishna Ranjan

About Gopal Krishna Ranjan

I am Gopal Krishna Ranjan, having 8 years of industry experience in Software development. I have a head down experience in Database, Data Warehouse, Big Data and cloud technologies and have implemented end to end Database, Data Warehouse,  Big Data and Cloud Solutions. I have extensively worked on SQL Server, Python, Hadoop, Hive, Spark, Azure, Machine Learning, and MSBI (SSAS, SSIS, and SSRS). I also have good experience in windows and web application development using ASP.Net and C#.

Leave a comment

Your email address will not be published. Required fields are marked *