Category : Spark


Install Spark on Windows (Local machine) with PySpark – Step by Step

Apache Spark is a general-purpose big data processing engine. It is a very powerful cluster computing framework which can run from a single cluster to thousands of clusters. It can run on clusters managed by Hadoop YARN, Apache Mesos, or by Spark’s standalone cluster manager itself. To read more on Spark Big data processing framework, visit this post “Big Data processing using Apache Spark – Introduction“. Here, in this post, we will learn how we can install Apache Spark on a local Windows Machine in a pseudo-distributed mode (managed by Spark’s standalone cluster manager) and run it using PySpark (Spark’s Python API).

Install Spark on Local Windows Machine

To install Apache Spark on a local Windows machine, we need to follow below steps:

Step 1 – Download and install Java JDK 8

Java JDK 8 is required as a prerequisite for the Apache Spark installation. We … More


RDD, DataFrame, and DataSet – Introduction to Spark Data Abstraction

Apache Spark is a general purpose distributed computing engine used for Big Data processing – Batch and stream processing. It provides high level APIs like Spark SQL, Spark Streaming, MLib, and GraphX to allow interaction with core functionalities of Apache Spark. Spark also facilitates several core data abstractions on top of the distributed collection of data which are RDDs, DataFrames, and DataSets. In this post, we are going to discuss these core data abstractions available in Apache Spark.

Spark Data Abstraction

The data abstraction in Spark represents a logical data structure to the underlying data distributed on different nodes of the cluster. The data abstraction APIs provides wide range of transformation methods (like map(), filter(), etc) which are used to perform computations in a distributed way. However, in order to execute these transformations, we need to call an action method like show(), collect(), etc.

Let’s have a … More


Big Data processing using Apache Spark – Introduction

What is Spark

Apache spark is an open source general purpose distributed cluster computing framework. It is an unified computing engine for big data processing. Spark is designed for lightning fast cluster computing especially for fast computation. An application can run up to 100 times faster than Hadoop MapReduce using Spark in-memory cluster computing. Also, Spark can run up to 10 times faster than Hadoop MapReduce when running on disk.

Why Spark

We can use Spark for any kind of big data processing ranging from SQL to streaming and machine learning running from a single machine to thousands of servers. It supports widely used programming languages like Python, Java, Scala, and R by exposing a set of high level API libraries. Spark can run on clusters managed by Hadoop YARN, Apache Mesos, or it can run standalone also. It provides many features like fast computational speed, multiple language support, … More