Monthly Archives: April 2019

Big Data processing using Apache Spark – Introduction

What is Spark

Apache spark is an open source general purpose distributed cluster computing framework. It is an unified computing engine for big data processing. Spark is designed for lightning fast cluster computing especially for fast computation. An application can run up to 100 times faster than Hadoop MapReduce using Spark in-memory cluster computing. Also, Spark can run up to 10 times faster than Hadoop MapReduce when running on disk.

Why Spark

We can use Spark for any kind of big data processing ranging from SQL to streaming and machine learning running from a single machine to thousands of servers. It supports widely used programming languages like Python, Java, Scala, and R by exposing a set of high level API libraries. Spark can run on clusters managed by Hadoop YARN, Apache Mesos, or it can run standalone also. It provides many features like fast computational speed, multiple language support, … More