Hive

ERROR Utils: Aborting task java.io.IOException: Failed to connect to – Local Spark

In this post, we will discuss an error/warning message “java.io.IOException: Failed to connect to”. This error keeps coming when we try to execute a hive query from spark-shell using spark SQL. This error occurs when Spark tries to execute a task in local mode (pseudo-distributed mode). It is caused because of a connection exception. The […]

ERROR Utils: Aborting task java.io.IOException: Failed to connect to – Local Spark Read More »

Get HDFS file location of Hive table records as column

In this post, we will learn how we can extract the physical HDFS file location path of the Hive table as a column along with other columns of the table. We will demonstrate this using HiveQL, PySpark, and Scala. We can create the Hive tables as internal or external tables. So, if we create an

Get HDFS file location of Hive table records as column Read More »

Read and write data into Hive table from Spark using PySpark

In this post, we will learn how we can read and write the data to a Hive table from a Spark dataframe. Once we have the Hive table data being read into a dataframe, we can apply Spark transformations on that data. Finally, we can write back the data to the the Hive table. We

Read and write data into Hive table from Spark using PySpark Read More »

Sort By, Order By, Distribute By, and Cluster By in Hive

This post will briefly discuss the difference and similarity between Sort By, Order By, Distribute By, and Cluster By in hive queries. This is one of the most important questions being asked in Big data/Hadoop interviews. These Sort By, Order By, Distribute By, and Cluster By clauses are available in the hive query language and

Sort By, Order By, Distribute By, and Cluster By in Hive Read More »

Data compression in Hive – An Introduction to Hadoop Data Compression

Data compression is a technique that encodes the original data in such a way so that it can be represented with fewer bits on the disk. The data compression process is used to reduce the size of the data files on the disk. We know that the Hadoop framework is meant for large scale data

Data compression in Hive – An Introduction to Hadoop Data Compression Read More »

Understanding Map join in Hive

Apache Hive is a big data query language which is used to read, transform and write large datasets in a distributed environment. It has a SQL like syntax which gets translated into a MapReduce job in order to execute on Hadoop clusters. In Hadoop ecosystem, we use Hive for batch processing to extract, transform and

Understanding Map join in Hive Read More »

Partitioning and Bucketing in Hive

In this article, we will discuss two important concepts “Partitioning and Bucketing” in Hive. These are used to improve query performance and it is important to understand them so that you can apply them efficiently. So let’s start with Partitioning. Partitioning in Hive Partitioning is a technique which is used to enhance query performance in

Partitioning and Bucketing in Hive Read More »

Handling special characters in Hive (using encoding properties)

In case we are reading a text file in a Hive table which contains non-English characters and we are not using the appropriate text encoding, these non-English characters might be loaded as junk symbols (like boxes – �). To get these characters in their original form, we need to use the correct character encoding. In this

Handling special characters in Hive (using encoding properties) Read More »

Skip header and footer rows in Hive

In this post “Skip header and footer rows in Hive“, we are going to learn that how we can ignore few header and footer records in Hive without loading or reading these records in another table or in a view temporarily. If you want to read more about Hive, visit my post “Preserve Hive metastore in

Skip header and footer rows in Hive Read More »

Preserve Hive metastore in Azure HDInsight

In this blog “Preserve Hive metastore in Azure HDInsight“, we are going to learn how we can preserve the hive metadata while working with the Azure HDInsight services. Microsoft Azure HDInsight is an on-demand managed Open source Big Data analytics service for the enterprises. We can provision clusters as per the demand in few minutes,

Preserve Hive metastore in Azure HDInsight Read More »