Install Spark on Windows (Local machine) with PySpark – Step by Step


Apache Spark is a general-purpose big data processing engine. It is a very powerful cluster computing framework which can run from a single cluster to thousands of clusters. It can run on clusters managed by Hadoop YARN, Apache Mesos, or by Spark’s standalone cluster manager itself. To read more on Spark Big data processing framework, visit this post “Big Data processing using Apache Spark – Introduction“. Here, in this post, we will learn how we can install Apache Spark on a local Windows Machine in a pseudo-distributed mode (managed by Spark’s standalone cluster manager) and run it using PySpark (Spark’s Python API).

Install Spark on Local Windows Machine

To install Apache Spark on a local Windows machine, we need to follow below steps:

Step 1 – Download and install Java JDK 8

Java JDK 8 is required as a prerequisite for the Apache Spark installation. We can download the JDK 8 from the Oracle official website.

JDK 8 Download

JDK 8 Download

As highlighted, we need to download 32 bit or 64 bit JDK 8 appropriately. Click on the link to start the download. Once the file gets downloaded, double click the executable binary file to start the installation process and then follow the on-screen instructions.

Step 2 – Download and install Apache Spark latest version

Now we need to download Spark latest build from Apache Spark’s home page. The latest available Spark version (at the time of writing) is Spark 2.4.3. The default spark package type is pre-built for Apache Hadoop 2.7 and later which works fine. Next, click on the download “spark-2.4.3-bin-hadoop2.7.tgz” to get the .tgz file.

Download Apache Spark

Download Apache Spark

After downloading the spark build, we need to unzip the zipped folder and copy the “spark-2.4.3-bin-hadoop2.7” folder to the spark installation folder, for example, C:\Spark\ (The unzipped directory is itself a zipped directory and we need to extract the innermost unzipped directory at the installation path.).

Spark installation folder

Spark installation folder

Step 3- Set the environment variables

Now, we need to set few environment variables which are required in order to set up Spark on a Windows machine. Also, note that we need to replace “Program Files” with “Progra~1” and “Program Files (x86)” with “Progra~2“.

  1. Set SPARK_HOME = “C:\Spark\spark-2.4.3-bin-hadoop2.7
  2. Set HADOOP_HOME = “C:\Spark\spark-2.4.3-bin-hadoop2.7
  3. Set JAVA_HOME = “C:\Progra~1\Java\jdk1.8.0_212

Step 4 – Update existing PATH variable

  1. Modify PATH variable to add:
    1. C:\Progra~1\Java\jdk1.8.0_212\bin
    2. C:\Spark\spark-2.4.3-bin-hadoop2.7\bin

Note: We need to replace “Program Files” with “Progra~1” and “Program Files (x86)” with “Progra~2“.

Step 5 – Download and copy winutils.exe

Next, we need to download winutils.exe binary file from this git repository “https://github.com/steveloughran/winutils“. To download this:

  1. Open the given git link.
  2. Navigate to the hadoop- 2.7.1 folder (We need to navigate to the same Hadoop version folder as the package type we have selected while downloading the Spark build).
  3. Go to the bin folder and download the winutils.exe binary file. This is the direct link to download winutils.exe “https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe” for Hadoop 2.7 and later spark build.
  4. Copy this file into bin folder of the spark installation folder which is “C:\Spark\spark-2.4.3-bin-hadoop2.7\bin” in our case.

Step 6 – Create hive temp folder

In order to avoid hive bugs, we need to create an empty directory at “C:\tmp\hive“.

Step 7 – Change winutils permission

Once, we have downloaded and copied the winutils.exe at the desired path and have created the required hive folder, we need to give appropriate permissions to the winutils. In order to do so, open the command prompt as an administrator and execute the below commands:

winutils.exe chmod -R 777 C:\tmp\hive
winutils.exe ls -F C:\tmp\hive

Step 8 – Download and install python latest version

Now, we are good to download and install the python latest version. Python can be downloaded from the official python website link https://www.python.org/downloads/.

Download Python

Download Python

Step 9 – pip Install pyspark

Next, we need to install pyspark package to start Spark programming using Python. To do so, we need to open the command prompt window and execute the below command:

pip install pyspark

Step 10 – Run Spark code

Now, we can use any code editor IDE or python in-built code editor (IDLE) to write and execute spark code. Below is a sample spark code written using Jupyter notebook:

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
conf = SparkConf()
conf.setMaster("local").setAppName("My app")
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)
print("Current Spark version is : {0}".format(spark.version))

Spark samp[le code

Spark sample code

Thanks for the reading. Please share your input in the comment section.

Rate This
[Total: 0    Average: 0/5]

Gopal Krishna Ranjan

About Gopal Krishna Ranjan

Gopal is a passionate Data Engineer and Data Analyst. He has implemented many end to end solutions using Big Data, Machine Learning, OLAP, OLTP, and cloud technologies. He loves to share his experience at https://www.sqlrelease.com/. Connect with Gopal on LinkedIn at https://www.linkedin.com/in/ergkranjan/.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.