How to Install Apache Spark on Windows | Setup PySpark in Anaconda

Install Spark on Windows 10



Apache Spark is a powerful framework that does in-memory computation and parallel execution of task with Scala, Python and R interfaces, that provides an API integration to process massive distributed processing over resilient sets of data. Spark framework has evolved very strongly over a short span of time to become a go-to framework for Big data Engineers, Solution architects and Data scientist. In this very first chapter, we will make a complete end-to-end setup of Apache Spark on Windows 10 computer. This guide will really be helpful for the beginners and self-learners to setup their work environment to start learning Spark application development.

Jupyter Notebook for PySpark



Jupyter Notebook is the powerful notebook that enables developers to edit and execute the developed code, view the executed results. It provides interactive web view . It allows you to change piece of code and re-execute that part of code alone in a easy and flexible way. 

Steps to Setup Spark:


Here is a complete step by step guide, on how to install PySpark on Windows 10, alongside with your anaconda and Jupyter notebook.



1. Download anaconda from the provided link and install - anaconda-python

Clicking on the given link will open the web-page as shown in the above diagram, click on the download button to start downloading.


2. Install Java JDK version 8


Before we start configuring PySpark on our windows machine, it is good to make sure that you have already installed java JDK (Java Development Kit) version 8. If not installed, then you can follow the below steps to install JAVA JDK v8. If you have Java JDK already installed in your PC, then you can directly move on to the next step. Navigate to Java Official download site and accept the oracle licence to download the Java JDK 8. It's free to download and only think is you need an email id to accept the oracle licence.


Run the executable file downloaded to install JAVA JDK. By default JAVA will be installed in the C:\Program Files\Java\jdk1.8.0_201 path. You can also modify this path while installing. Once after JDK's successful installation in your machine, set the following variables.

Environ variable: 
 JAVA_HOME = c:\Program Files\Java\jdk_<version> ​

PATH variable: 
 c:\Program Files\Java\jdk1.8.0_201\bin

3. Check if JAVA is installed:


Open Windows command prompt or anaconda prompt, from start menu and run java -version, it pops out the version by showing something like below.



4. Download Spark


Navigate through the given link to spark official site to download the Apache Spark package as '.tgz' file into your machine.
Extract the downloaded .tgz file using 7z application into local directory. In my system it was extracted to the path c:\spark\spark-2.4.4-bin-hadoop2.7There will be another compressed directory in the tar format, again extract the '.tar' to the same path itself.



Download winutils.exe by clicking on the link Winutils for hadoop binaries. You need to keep in my to choose the correct version of winutils before downloading. It should match with the hadoop version that you are using. Place the downloaded winutil.exe file into bin folder of spark installation as in my case it is C:\spark\spark-2.4.4-bin-hadoop2.7\bin\winutils.exe.

Set up environment variables as shown below

Environment variable: 
    SPARK_HOME=C:\spark\spark-2.4.4-bin-hadoop2.7
    HADOOP_HOME=C:\spark\spark-2.4.4-bin-hadoop2.7

PATH variable:

    C:\spark\spark-2.4.4-bin-hadoop2.7\bin

5. Check PySpark installation:


Now, its time to check whether the spark got installed properly in your machine. Open anaconda command prompt and type pyspark, it ill open up the screen like shown in the below picture



6. Spark with Jupyter notebook:


To gain a hands-on knowledge on PySpark/ Spark with Python accompanied by Jupyter notebook, you have to install the free python library to find the location of the Spark installed on your machine and the package name is findspark. Execute the below line of command in anaconda prompt to install the Python package findspark into your system.

conda install -c conda-forge findspark

7. Get started with Jupyter notebook:


Now you have almost completed your setup and you are now ready to make a big move towards your career in learning trending BigData technology. Open Jupyter notebook from Anaconda navigator as show in below picture


Open New python 3 notebook and type in the below given scripts and execute the cell, which returns the path where the spark is installed as shown below

import findspark
findspark.init()

import pyspark
findspark.find()

Out[]:
        'C:\spark\spark-2.4.4-bin-hadoop2.7'


Finally Spark installation and setup is completed and now you are ready to learn Spark with Python. Hope this article is helpful to you. In upcoming article, let's start with the basic concepts in Apache Spark, how to write program on PySpark, Interview questions, Solution to common problems faced in Spark application.

Happy Learning!!!

Post a Comment

3 Comments