Hadoop

Dynamically Create Spark DataFrame Schema from Pandas DataFrame

Apache Spark has become a powerful tool for processing large-scale data in a distributed environment. One of its key components is the Spark DataFrame, which offers a higher-level abstraction over distributed data and enables efficient data manipulation. Spark DataFrame is typically used to manipulate large amounts of data in a distributed environment. When working within […]

Dynamically Create Spark DataFrame Schema from Pandas DataFrame Read More »

Difference between Hadoop 1.x, Hadoop 2.x and Hadoop 3.x

We know that Apache Hadoop is a framework that allows us to perform data processing in a distributed way on very large datasets using commodity computers. That is why, this framework is highly scalable and can scale up from a single machine to thousands of machines. Most importantly, Hadoop is an open source and provides

Difference between Hadoop 1.x, Hadoop 2.x and Hadoop 3.x Read More »

Fill null with the next not null value – Spark Dataframe

In this post, we discussed how to fill a null value with the previous not-null value in a Spark Dataframe. We have also discussed how to extract the non-null values per group from a spark dataframe. Now, in this post, we will learn how to fill a null value with the next available not-null value

Fill null with the next not null value – Spark Dataframe Read More »

Fill null with the previous not null value – Spark Dataframe

In the previous post, we discussed how to extract the non-null values per group from a spark dataframe. Now, in this post, we will learn how to fill the null values with the previous not-null value in a spark dataframe using the backfill method. To demonstrate this with the help of an example, we will

Fill null with the previous not null value – Spark Dataframe Read More »

ERROR Utils: Aborting task java.io.IOException: Failed to connect to – Local Spark

In this post, we will discuss an error/warning message “java.io.IOException: Failed to connect to”. This error keeps coming when we try to execute a hive query from spark-shell using spark SQL. This error occurs when Spark tries to execute a task in local mode (pseudo-distributed mode). It is caused because of a connection exception. The

ERROR Utils: Aborting task java.io.IOException: Failed to connect to – Local Spark Read More »

Get the first non-null value per group Spark dataframe

Suppose, we need to get the first non-null value from a Dataframe from each partition. Certainly, we want to get only the first not null value from each column regardless of the rows. That means a not-null value from column A from row 5 can be stitched with another not-null value of column B from

Get the first non-null value per group Spark dataframe Read More »

Scala Option, Some, None – Exception and Null handling

In the previous post, we discussed the Try, Success, Failure exception handling method. Now, in this post, we will discuss the use of Scala’s Option, Some, None pattern and its usage. Scala is a high-level programming language combining object-oriented and functional programming in one place. It is a very powerful programming language that can be

Scala Option, Some, None – Exception and Null handling Read More »

Scala Try, Success, Failure – Functional error handling

In this post, we will discuss the Scala’s functional error handling method using Try, Success, Failure. We know that Scala is a high-level programming language that combines both object-oriented and functional programming in one place. It runs on JVM so it can be mixed seamlessly with Java. Scala’s static types helps to identify bugs at

Scala Try, Success, Failure – Functional error handling Read More »

Get HDFS file location of Hive table records as column

In this post, we will learn how we can extract the physical HDFS file location path of the Hive table as a column along with other columns of the table. We will demonstrate this using HiveQL, PySpark, and Scala. We can create the Hive tables as internal or external tables. So, if we create an

Get HDFS file location of Hive table records as column Read More »