Big Data/Cloud Archives - Page 2 of 5

ERROR Utils: Aborting task java.io.IOException: Failed to connect to – Local Spark

Leave a Comment / Hadoop, Hive, Spark / Gopal Krishna Ranjan / Dec 14, 2022 / big data processing, Hadoop, pyspark, scala

In this post, we will discuss an error/warning message “java.io.IOException: Failed to connect to”. This error keeps coming when we try to execute a hive query from spark-shell using spark SQL. This error occurs when Spark tries to execute a task in local mode (pseudo-distributed mode). It is caused because of a connection exception. The […]

ERROR Utils: Aborting task java.io.IOException: Failed to connect to – Local Spark Read More »

Get the first non-null value per group Spark dataframe

Leave a Comment / Scala, Spark / Gopal Krishna Ranjan / Nov 12, 2022 / big data processing, Hadoop, scala

Suppose, we need to get the first non-null value from a Dataframe from each partition. Certainly, we want to get only the first not null value from each column regardless of the rows. That means a not-null value from column A from row 5 can be stitched with another not-null value of column B from

Get the first non-null value per group Spark dataframe Read More »

Download a file from DBFS – Databricks to the local machine

1 Comment / Azure, Databricks / Gopal Krishna Ranjan / Oct 19, 2022 / azure, azure-databricks, databricks

In this post, we will learn how we can download a file from DBFS i.e. Databricks File System to the Local machine. DBFS is the File system that Databricks uses to store its files. It is a distributed file system mounted into a Databricks workspace and it is available on Databricks clusters. To demonstrate how

Download a file from DBFS – Databricks to the local machine Read More »

Execute Scala file in Spark without creating a jar

Leave a Comment / Scala, Spark / Gopal Krishna Ranjan / Aug 17, 2022 / big data processing, Hadoop, scala

This post will teach us how to execute a scala file in Spark without creating a jar file. We know that a scala source code file has an extension of .scala. Also, we need to create or package the source code into a jar file to execute an application written in Scala. We can create

Execute Scala file in Spark without creating a jar Read More »

Using Pandas on Spark

Leave a Comment / Python, Spark / Gopal Krishna Ranjan / Jul 31, 2022 / big data processing, pyspark, python

Pandas is one of the most popular Python libraries used by Data Scientists/Data Engineers for data wrangling and data analysis. Also, Pandas provide DataFrames (a table-like structure that stores data in rows and columns) to deal with structured datasets. These DataFrames are very similar to Spark’s DataFrames. However, Pandas dataframes are limited to a single

Using Pandas on Spark Read More »

Use HDFS API to read Azure Blob files in Databricks

Leave a Comment / Azure, Databricks, Python / Gopal Krishna Ranjan / Apr 30, 2022 / python, python use case

Databricks provides a wrapper file system API named DBFS (Databricks File System) to perform any file-level operation such as read, write, move, delete, rename, etc. However, sometimes we may need to read the underlying file system objects directly without using the DBFS wrapper APIs. To do so, we can use HDFS APIs available through py4j

Use HDFS API to read Azure Blob files in Databricks Read More »

Create jar in IntelliJ IDEA for sbt-based Scala + Spark project

Leave a Comment / Scala, Spark / Gopal Krishna Ranjan / Mar 31, 2022 / big data processing, data analysis, scala, step by step

Just like the Maven build tool, sbt is another tool that can be used to manage the project development lifecycle. It helps us to build, test, and package the Scala and Java-based projects into a .jar file. This jar file can be used as a package in another application/project, or it can be simply used

Create jar in IntelliJ IDEA for sbt-based Scala + Spark project Read More »

Create jar in IntelliJ IDEA for Maven-based Scala + Spark project

Leave a Comment / Scala, Spark / Gopal Krishna Ranjan / Feb 28, 2022 / big data processing, data analysis, scala, step by step

In this post, we will learn how we can create a jar in IntelliJ IDEA for a Maven-based Scala + Spark project. We will use the maven build tool to create the jar file from the sample Scala project. We know that the Maven is a project management tool that can be used to manage

Create jar in IntelliJ IDEA for Maven-based Scala + Spark project Read More »