Skip to content
Related Articles

Related Articles

How to Configure Windows to Build a Project Having Apache Spark Code Without Installing it?
  • Last Updated : 02 Feb, 2021
GeeksforGeeks - Summer Carnival Banner

Apache Spark is a unified analytics engine and it is used to process large scale data. Apache spark provides the functionality to connect with other programming languages like Java, Python, R, etc. by using APIs. It provides an easy way to configure with other IDE as well to perform our tasks as per your requirements. It supports tools like Spark SQL for SQL, GraphX for graph processing, MLlib for Machine Learning, etc. 

In this, You will see how we can configure Scala IDE to execute the Apache Spark code. And you will be able to learn to configure Spark Project in Scala IDE without explicitly installing Hadoop and Spark in your System. We will discuss each step in detail and you will be able to configure with these steps. Also, will cover the required dependency to configure, and we will also cover what pre-requisite will be required to configure Scala IDE. Here we will discuss and implement all the steps on Scala IDE. But these steps can be followed by any other IDE as well. If you want to configure this on Eclipse IDE then the same steps can be followed by Eclipse IDE as well. 

Introduction

  • Spark is an open-source distributed Big Data processing Framework developed by AMPLab, the University of California in 2009.
  • Later Spark was donated to Apache Software Foundation. Now, It is maintained by Apache Foundation.
  • Spark is the main Big Data processing engine, which is a written Scala programming language.
  • Initially, there was MapReduce(based on the Java programming language) processing model in Hadoop.
  • Spark Framework supports  Java, Scala, Python, and R Programming languages.

Based on the Type of Data and functionality Spark has a different API to process as following: 

  • Basic building block of Spark is Spark core.
  • Spark provides SparkSQL for Structured and Semi-Structured Data analysis which is based on DataFrame and Dataset.
  • For Streaming Data Spark has Spark Streaming APIs.
  • To implement the Machine Learning algorithm Spark provides MLib i.e. Distributed Machine Learning Framework.
  • Graph Data can be effectively processed with GraphX i.e Distributed Graph Processing Framework.

SparkSQL, Spark Streaming, MLib, and GraphX are based on Spark core functionality and based on the concept of RDD i.e. Resilient Distributed Dataset. RDD is an immutable collection of distributed partition Dataset, which is stored on Data Nodes in the Hadoop cluster.



Prerequisite :

  1. Java. Make sure you have Java installed on your system.
  2. Scala IDE/Eclipse/Intellij IDEA: You can use either of these, whichever you familiar with. Here, you will see Scala IDE for your reference.

Step 1: Create a Maven Project

Creating a maven project is very simple. Please follow the below steps to create a Project.

  • Click on File menu tab -> New -> Other
  • Click on Maven Project. Here, in this step click on “Create a simple project(skip archtype selection)” check box, then click on “Next >
  • Add Group Id and Artifact Id, then click on “Finish

With this, you have successfully created a Java project with Maven. Now, the next action is to add a dependency for Spark. 

Step 2: Adding required Spark dependency into pom.xml

You can just find the pom.xml file into your newly created maven project and add below Spark(spark-core, spark-sql) dependency. These dependencies version you can change as per Project need. 

Note: You can see added 2.4.0 dependency of Spark-core and spark-sql. Spark 3.0.1 version is also available, you can add the dependency according to your Spark version on the cluster and according to your project requirement.

XML






  
 <modelVersion>4.0.0</modelVersion>
 <groupId>java.spark</groupId>
 <artifactId>Spark-Learning</artifactId>
 <version>0.0.1-SNAPSHOT</version>
  
 <dependencies>
  
 <dependency>
  <groupId>com.thoughtworks.paranamer</groupId>
  <artifactId>paranamer</artifactId>
  <version>2.8</version>
 </dependency>
  
 <dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core_2.12</artifactId>
  <version>2.4.0</version>
 </dependency>
  
 <dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-sql_2.12</artifactId>
  <version>2.4.0</version>
 </dependency>
  
</dependencies>
  
</project>

Step 3: Writing Sample Spark code

Now, you are almost done. Just create a Package with the name spark.java inside your Project. Then inside the newly created package, Create a Java Class SparkReadCSV. As we don’t Have Hadoop installed on the system, still we can simply download winutils file and add that path as a Hadoop home directory path. 

Here are few steps which we required to do this.

  • Download the winutils.exe file. https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-winutils/2.4.1
  • Create a Hadoop\bin directory on your drive. Here, In my case, I have created Hadoop\bin folder inside D: Drive.
  • Copy the winutils.exe file to D:\\hadoop\\bin folder.
  • Lastly in SparkReadCSV.java file add the Hadoop home directory to the above path. Like this as follows.
System.setProperty("hadoop.home.dir", "D:\\Hadoop\\");

Create an employee.txt file and add below dummy records.

idnameaddresssalary
1Bernard NorrisAmberloup10172
2Sebastian RussellDelicias18178
3Uriel WebsterFaisalabad16419
4Clarke HuffmanMerritt16850
5Orson TravisOberursel17435

Add below code to SparkReadCSV.java file. You can check the given below code with very descriptive comments for better understanding.

Code:

Java




package spark.java;
  
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
  
public class SparkReadCSV {
  
    public static void main(String[] args)
    {
  
        // Set winutils.exe file path
  
        System.setProperty("hadoop.home.dir",
                           "D:\\Hadoop\\");
  
        // Create a SparkSession object to process the data
        // Function builder() used for creating SparkSession
          
        // object
        // Function appName() set a name for the application
        // which will be show in YARN/Spark web UI.
  
        // Function master() set a spark master URL to run
        // application, such "local" to run locally OR
  
        // "local[3]" to run with 3 cores OR "yarn-cluster"
        //  to run on YARN Hadoop cluster.
  
        // Function getOrCreate() return a Spark session to
        // execute application.
        SparkSession spark
            = SparkSession
                  .builder()
                  .appName("***** Reading CSV file.*****")
                  .master("local[3]")
                  .getOrCreate();
  
        // Read sample CSV file.
  
        // Read used to read data as a DataFrame.
  
        // The boolean value in option function indicate that
        // input data first line is header.
  
        // The delimiter value("|") in option indicate that
        // files records are | separated.
  
        // function csv() is accept input data file path
        // either from Local File System OR Hadoop Distributed
        // File System.
        
        // Here we are reading data from Local File System.
        Dataset<Row> employeeDS
            = spark
                  .read()
                  .option("header", true)
                  .option("delimiter", "|")
                  .csv("D:\\data\\employee.txt");
  
        // Displaying the records.
        employeeDS.show();
    }
}

We have set up a Spark development environment with few easy steps. With this starting point, we can further explore Spark by solving different use cases. 

Attention reader! Don’t stop learning now. Get hold of all the important DSA concepts with the DSA Self Paced Course at a student-friendly price and become industry ready.

My Personal Notes arrow_drop_up
Recommended Articles
Page :