Daily Archives: May 31, 2019


RDD, DataFrame, and DataSet – Introduction to Spark Data Abstraction

Apache Spark is a general purpose distributed computing engine used for Big Data processing – Batch and stream processing. It provides high level APIs like Spark SQL, Spark Streaming, MLib, and GraphX to allow interaction with core functionalities of Apache Spark. Spark also facilitates several core data abstractions on top of the distributed collection of data which are RDDs, DataFrames, and DataSets. In this post, we are going to discuss these core data abstractions available in Apache Spark.

Spark Data Abstraction

The data abstraction in Spark represents a logical data structure to the underlying data distributed on different nodes of the cluster. The data abstraction APIs provides wide range of transformation methods (like map(), filter(), etc) which are used to perform computations in a distributed way. However, in order to execute these transformations, we need to call an action method like show(), collect(), etc.

Let’s have a … More