Big Data processing using Apache Spark – Introduction


What is Spark

Apache spark is an open source general purpose distributed cluster computing framework. It is an unified computing engine for big data processing. Spark is designed for lightning fast cluster computing especially for fast computation. An application can run up to 100 times faster than Hadoop MapReduce using Spark in-memory cluster computing. Also, Spark can run up to 10 times faster than Hadoop MapReduce when running on disk.

Why Spark

We can use Spark for any kind of big data processing ranging from SQL to streaming and machine learning running from a single machine to thousands of servers. It supports widely used programming languages like Python, Java, Scala, and R by exposing a set of high level API libraries. Spark can run on clusters managed by Hadoop YARN, Apache Mesos, or it can run standalone also. It provides many features like fast computational speed, multiple language support, open source unified framework, and diverse data source support which includes HDFS, Apache HBase, Hive, S3, Cassandra, and many more. These features makes Spark a primary choice for any type of big data processing.

Basic Architecture of Spark

A Spark application consists of:

  1. a driver process at master node and
  2. a set of executor processes at worker nodes
Spark - Basic architecture

Spark – Basic architecture

Driver process:

Driver process is the heart of a spark application. It sits on master node and maintains the information about the spark application throughout the program execution. Driver program accepts the user’s input and then it analyzes, distributes, and schedules the work across the executors running at worker node.

Executor process

The executor node is responsible for running the actual task assigned to it by the driver process. It simply executes the given instructions and sends the results back to the driver node.

Features of Spark

Below are some important features of Spark:

  1. Fast – Spark runs 10 times faster on disk and 100 times faster in memory than a MapReduce job.
  2. In memory processing – Spark in-memory engine facilitates high performance by executing the computations in RAM rather than disk.
  3. Multiple language support – It supports widely used languages Python, Java, Scala, and R.
  4. Real time data stream processing – Using Spark streaming, we can process real time data streams.
  5. Fault tolerant – Using DAG and RDD abstraction, Spark provides fault tolerance to ensure zero data loss.
  6. Interactive shell – Spark comes with an interactive shell which is very helpful for adhoc querying.
  7. Unified framework – Spark supports SQL/DataFrame API/DataSet API for structured data, GraphX for graph processing, MLib for Machine learning, and Stream processing for real-time data stream processing.
  8. Lazy evaluation –  Spark waits until the very last moment to execute the graph of computation instructions.

Apache Spark Components

Below are the important Spark’s component:

Spark components

Spark components

Spark Core:

Spark core is the general engine and base for all spark applications. It is responsible for task dispatching, scheduling, and for parallelism. It exposes RDD APIs through an abstraction using Python, Java, Scala, and R.

Spark SQL:

Spark SQL APIs exposes new data abstraction to provide support for SQL queries which can be executed on structured and semi-structured data. It also facilitates DataFrame and DataSet APIs.

The DataFrame is an abstraction to the RDD which is equivalent to a table in relation database or to a dataframe in Python/R. Each row in a dataframe is of type row.

The DataSet is a strictly typed interface which can be created using a Java class. Each row in a dataset is a type of the given class.

Spark Streaming:

Spark streaming is used for real-time data stream processing using micro batches of data and performs transformation on these micro batches of data. It supports fault tolerant stream processing of data using DStrem, a series of RDDs (Resilient Distributed Dataset).

Spark MLib:

MLib library provides a bunch of useful machine learning algorithms. It performs much faster than the disk based Apache Mahout.

Spark GraphX:

It is a distributed graph processing framework for graph computation using Pregel abstraction API.

Hadoop and Spark

Spark is an alternative to Hadoop MapReduce rather than a replacement to Hadoop Ecosystem. It is an extension to Hadoop and can use HDFS to leverage the distributed storage and improves the big data processing by performing the computation in-memory rather than in disk. We can use Hadoop and Spark together and can utilize best of the both. Typically, we use Hadoop for batch processing and Spark for real-time.

Thanks for the reading. Please share your inputs in comments.

Rate This
[Total: 2    Average: 5/5]

Gopal Krishna Ranjan

About Gopal Krishna Ranjan

Gopal has 8 years of industry experience in Software development. He has a head down experience in Data Science, Database, Data Warehouse, Big Data and cloud technologies and has implemented end to end solutions. He has extensively worked on SQL Server, Python, Hadoop, Hive, Spark, Azure, Machine Learning, and MSBI (SSAS, SSIS, and SSRS). He also has good experience in windows and web application development using ASP.Net and C#.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.