Apache Spark is a unified analytics engine and it is used to process large scale data. Apache spark provides the functionality to connect with other programming languages like Java, Python, R, etc. by using APIs. It provides an easy way to configure with other IDE as well to perform our tasks as per your requirements. It supports tools like Spark SQL for SQL, GraphX for graph processing, MLlib for Machine Learning, etc.
In this, You will see how we can configure Scala IDE to execute the Apache Spark code. And you will be able to learn to configure Spark Project in Scala IDE without explicitly installing Hadoop and Spark in your System. We will discuss each step in detail and you will be able to configure with these steps. Also, will cover the required dependency to configure, and we will also cover what pre-requisite will be required to configure Scala IDE. Here we will discuss and implement all the steps on Scala IDE. But these steps can be followed by any other IDE as well. If you want to configure this on Eclipse IDE then the same steps can be followed by Eclipse IDE as well.
- Spark is an open-source distributed Big Data processing Framework developed by AMPLab, the University of California in 2009.
- Later Spark was donated to Apache Software Foundation. Now, It is maintained by Apache Foundation.
- Spark is the main Big Data processing engine, which is a written Scala programming language.
- Initially, there was MapReduce(based on the Java programming language) processing model in Hadoop.
- Spark Framework supports Java, Scala, Python, and R Programming languages.
Based on the Type of Data and functionality Spark has a different API to process as following:
- Basic building block of Spark is Spark core.
- Spark provides SparkSQL for Structured and Semi-Structured Data analysis which is based on DataFrame and Dataset.
- For Streaming Data Spark has Spark Streaming APIs.
- To implement the Machine Learning algorithm Spark provides MLib i.e. Distributed Machine Learning Framework.
- Graph Data can be effectively processed with GraphX i.e Distributed Graph Processing Framework.
SparkSQL, Spark Streaming, MLib, and GraphX are based on Spark core functionality and based on the concept of RDD i.e. Resilient Distributed Dataset. RDD is an immutable collection of distributed partition Dataset, which is stored on Data Nodes in the Hadoop cluster.
- Java. Make sure you have Java installed on your system.
- Scala IDE/Eclipse/Intellij IDEA: You can use either of these, whichever you familiar with. Here, you will see Scala IDE for your reference.
Step 1: Create a Maven Project
Creating a maven project is very simple. Please follow the below steps to create a Project.
- Click on File menu tab -> New -> Other
- Click on Maven Project. Here, in this step click on “Create a simple project(skip archtype selection)” check box, then click on “Next >“
- Add Group Id and Artifact Id, then click on “Finish“
With this, you have successfully created a Java project with Maven. Now, the next action is to add a dependency for Spark.
Step 2: Adding required Spark dependency into pom.xml
You can just find the pom.xml file into your newly created maven project and add below Spark(spark-core, spark-sql) dependency. These dependencies version you can change as per Project need.
Note: You can see added 2.4.0 dependency of Spark-core and spark-sql. Spark 3.0.1 version is also available, you can add the dependency according to your Spark version on the cluster and according to your project requirement.
Step 3: Writing Sample Spark code
Now, you are almost done. Just create a Package with the name spark.java inside your Project. Then inside the newly created package, Create a Java Class SparkReadCSV. As we don’t Have Hadoop installed on the system, still we can simply download winutils file and add that path as a Hadoop home directory path.
Here are few steps which we required to do this.
- Download the winutils.exe file. https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-winutils/2.4.1
- Create a Hadoop\bin directory on your drive. Here, In my case, I have created Hadoop\bin folder inside D: Drive.
- Copy the winutils.exe file to D:\\hadoop\\bin folder.
- Lastly in SparkReadCSV.java file add the Hadoop home directory to the above path. Like this as follows.
Create an employee.txt file and add below dummy records.
id name address salary 1 Bernard Norris Amberloup 10172 2 Sebastian Russell Delicias 18178 3 Uriel Webster Faisalabad 16419 4 Clarke Huffman Merritt 16850 5 Orson Travis Oberursel 17435
Add below code to SparkReadCSV.java file. You can check the given below code with very descriptive comments for better understanding.
We have set up a Spark development environment with few easy steps. With this starting point, we can further explore Spark by solving different use cases.