In this blog “Preserve Hive metastore in Azure HDInsight“, we are going to learn how we can preserve the hive metadata while working with the Azure HDInsight services. Microsoft Azure HDInsight is an on-demand managed Open source Big Data analytics service for the enterprises. We can provision clusters as per the demand in few minutes, perform the computations, and then we can shut it down to avoid charges. We pay as per the usage only. You can visit this link to know more about Azure HDInsight.
What is Hive?
Apache Hive is a SQL like Big Data query language which is used as an abstraction for the map reduce jobs. The Hive query seamlessly converts into an equivalent map reduce job without the need to write low-level code. This increases the productivity of a developer to a great extent. If you want to read more about Hive QL, visit this link.
Hive Meta Store
A Hive metastore stores the metadata for all the objects (like tables, column names, data types, views, partitions, and etc.) created in HiveQL. Hive supports two ways to set up a metastore for an HDInsight cluster. As we cannot keep running a cluster all the time (because of the cost associated with the uptime), we should choose the metastore option as per the need during HDInsight cluster instantiation process. There are two ways we can store hive’s metastore in an HDInsight cluster:
- Default metastore
- Custom metastore
Let’s discuss both of these types in detail.
1. Default metastore
The default metastore gets created in the cluster itself. If a custom metastore is not configured during the cluster instantiation, Hive uses in cluster metastore by default. Here are some features of the default metastore:
- This can be a better option if we have a small set of objects and we don’t want to preserve their metadata.
- Default metastore is tied to the cluster’s life itself and we don’t need to pay any extra money for this storage.
- All the metadata gets deleted when we terminate/delete the cluster as it is stored on the cluster itself.
- If we opt to use default metastore, we cannot share it with other clusters.
2. Custom metastore
HDInsight allows us to choose Azure SQL DB as a custom metastore for a cluster. It separates the metadata from the cluster which is then used for computation purposes only. Here are some features of the custom metastore:
- We can use an Azure SQL DB as a metastore for the cluster.
- We can share the metastore with other clusters as well.
- We can preserve the metadata of the metastore which helps us to delete and create clusters as needed.
- The compute and metadata are put separately.
- We can backup the metastore like other SQL databases.
- We need to pay some extra cost for the Azure SQL DB in addition to the cluster cost as the storage is different than the cluster’s storage.
- The location of the Azure SQL DB should be same as the location of Azure HDInsight cluster.
Preserve Hive metastore in Azure HDInsight
To provision an Azure HDInsight cluster using “Create a resource” wizard:
- Login to Azure portal
- Click on the “New” button link located at the top left corner
- In next menu, click on “Internet of Things”
- Scroll down to the bottom, and click on “HDInsight”
Alternatively, you can click on New and search for HDInsight in the search box.
A create HDInsight wizard will appear. We can choose, “Quick create“, or “Custom” option. In this example, I am using “Quick create” option. In “Basic” step, we need to provide basic settings for the HDInsight cluster as in below image.
Next, in “Storage‘ step, to preserve the Hive metastore, we need to choose an Azure SQL Database (which resides in the same region in which we are creating the cluster) to store the Hive metadata as in below image.
At step 5, we can choose a SQL DB to store the Hive metastore.
Finally, in “Summary” step, make sure that the “Metastores” property has a value “Metastore(s) configured” and then click on “Create” button.
If the existing SQL database has been previously configured with any other cluster, we get access to all the objects created by the previous cluster.
The database uses SQL tables to store different kind of information related to the Hive metadata. It creates TBLS table to store all the table information, DBS table to store all the database related information and so on.
Let’s have a quick look at the database objects created to store Hive metastore in the SQL Database.
Thanks for reading. Please share your input in the comment section.