How to Configure Windows to Build a Project Having Apache Spark Code Without Installing it?

Last Updated : 02 Feb, 2021

Apache Spark is a unified analytics engine and it is used to process large scale data. Apache spark provides the functionality to connect with other programming languages like Java, Python, R, etc. by using APIs. It provides an easy way to configure with other IDE as well to perform our tasks as per your requirements. It supports tools like Spark SQL for SQL, GraphX for graph processing, MLlib for Machine Learning, etc.

In this, You will see how we can configure Scala IDE to execute the Apache Spark code. And you will be able to learn to configure Spark Project in Scala IDE without explicitly installing Hadoop and Spark in your System. We will discuss each step in detail and you will be able to configure with these steps. Also, will cover the required dependency to configure, and we will also cover what pre-requisite will be required to configure Scala IDE. Here we will discuss and implement all the steps on Scala IDE. But these steps can be followed by any other IDE as well. If you want to configure this on Eclipse IDE then the same steps can be followed by Eclipse IDE as well.

Introduction

Spark is an open-source distributed Big Data processing Framework developed by AMPLab, the University of California in 2009.
Later Spark was donated to Apache Software Foundation. Now, It is maintained by Apache Foundation.
Spark is the main Big Data processing engine, which is a written Scala programming language.
Initially, there was MapReduce(based on the Java programming language) processing model in Hadoop.
Spark Framework supports Java, Scala, Python, and R Programming languages.

Based on the Type of Data and functionality Spark has a different API to process as following:

Basic building block of Spark is Spark core.
Spark provides SparkSQL for Structured and Semi-Structured Data analysis which is based on DataFrame and Dataset.
For Streaming Data Spark has Spark Streaming APIs.
To implement the Machine Learning algorithm Spark provides MLib i.e. Distributed Machine Learning Framework.
Graph Data can be effectively processed with GraphX i.e Distributed Graph Processing Framework.

SparkSQL, Spark Streaming, MLib, and GraphX are based on Spark core functionality and based on the concept of RDD i.e. Resilient Distributed Dataset. RDD is an immutable collection of distributed partition Dataset, which is stored on Data Nodes in the Hadoop cluster.

Prerequisite :

Java. Make sure you have Java installed on your system.
Scala IDE/Eclipse/Intellij IDEA: You can use either of these, whichever you familiar with. Here, you will see Scala IDE for your reference.

Step 1: Create a Maven Project

Creating a maven project is very simple. Please follow the below steps to create a Project.

Click on File menu tab -> New -> Other
Click on Maven Project. Here, in this step click on “Create a simple project(skip archtype selection)” check box, then click on “Next >“
Add Group Id and Artifact Id, then click on “Finish“

With this, you have successfully created a Java project with Maven. Now, the next action is to add a dependency for Spark.

Step 2: Adding required Spark dependency into pom.xml

You can just find the pom.xml file into your newly created maven project and add below Spark(spark-core, spark-sql) dependency. These dependencies version you can change as per Project need.

Note: You can see added 2.4.0 dependency of Spark-core and spark-sql. Spark 3.0.1 version is also available, you can add the dependency according to your Spark version on the cluster and according to your project requirement.

XML

<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> 
  
 <modelVersion>4.0.0</modelVersion> 
 <groupId>java.spark</groupId> 
 <artifactId>Spark-Learning</artifactId> 
 <version>0.0.1-SNAPSHOT</version> 
  
 <dependencies> 
  
 <dependency> 
  <groupId>com.thoughtworks.paranamer</groupId> 
  <artifactId>paranamer</artifactId> 
  <version>2.8</version> 
 </dependency> 
  
 <dependency> 
  <groupId>org.apache.spark</groupId> 
  <artifactId>spark-core_2.12</artifactId> 
  <version>2.4.0</version> 
 </dependency> 
  
 <dependency> 
  <groupId>org.apache.spark</groupId> 
  <artifactId>spark-sql_2.12</artifactId> 
  <version>2.4.0</version> 
 </dependency> 
  
</dependencies> 
  
</project>

Step 3: Writing Sample Spark code

Now, you are almost done. Just create a Package with the name spark.java inside your Project. Then inside the newly created package, Create a Java Class SparkReadCSV. As we don’t Have Hadoop installed on the system, still we can simply download winutils file and add that path as a Hadoop home directory path.

Here are few steps which we required to do this.

Download the winutils.exe file. https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-winutils/2.4.1
Create a Hadoop\bin directory on your drive. Here, In my case, I have created Hadoop\bin folder inside D: Drive.
Copy the winutils.exe file to D:\\hadoop\\bin folder.
Lastly in SparkReadCSV.java file add the Hadoop home directory to the above path. Like this as follows.

System.setProperty("hadoop.home.dir", "D:\\Hadoop\\");

Create an employee.txt file and add below dummy records.

id	name	address	salary
1	Bernard Norris	Amberloup	10172
2	Sebastian Russell	Delicias	18178
3	Uriel Webster	Faisalabad	16419
4	Clarke Huffman	Merritt	16850
5	Orson Travis	Oberursel	17435

Add below code to SparkReadCSV.java file. You can check the given below code with very descriptive comments for better understanding.

Code:

Java

package spark.java; 
  
import org.apache.spark.sql.Dataset; 
import org.apache.spark.sql.Row; 
import org.apache.spark.sql.SparkSession; 
  
public class SparkReadCSV { 
  
    public static void main(String[] args) 
    { 
  
        // Set winutils.exe file path 
  
        System.setProperty("hadoop.home.dir", 
                           "D:\\Hadoop\\"); 
  
        // Create a SparkSession object to process the data 
        // Function builder() used for creating SparkSession 
          
        // object 
        // Function appName() set a name for the application 
        // which will be show in YARN/Spark web UI. 
  
        // Function master() set a spark master URL to run 
        // application, such "local" to run locally OR 
  
        // "local[3]" to run with 3 cores OR "yarn-cluster" 
        //  to run on YARN Hadoop cluster. 
  
        // Function getOrCreate() return a Spark session to 
        // execute application. 
        SparkSession spark 
            = SparkSession 
                  .builder() 
                  .appName("***** Reading CSV file.*****") 
                  .master("local[3]") 
                  .getOrCreate(); 
  
        // Read sample CSV file. 
  
        // Read used to read data as a DataFrame. 
  
        // The boolean value in option function indicate that 
        // input data first line is header. 
  
        // The delimiter value("|") in option indicate that 
        // files records are | separated. 
  
        // function csv() is accept input data file path 
        // either from Local File System OR Hadoop Distributed 
        // File System. 
        
        // Here we are reading data from Local File System. 
        Dataset<Row> employeeDS 
            = spark 
                  .read() 
                  .option("header", true) 
                  .option("delimiter", "|") 
                  .csv("D:\\data\\employee.txt"); 
  
        // Displaying the records. 
        employeeDS.show(); 
    } 
}

We have set up a Spark development environment with few easy steps. With this starting point, we can further explore Spark by solving different use cases.

Suggest improvement

Install Apache Spark in a Standalone Mode on Windows

Share your thoughts in the comments

How to Configure Windows to Build a Project Having Apache Spark Code Without Installing it?

XML

Java

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?