Apache Spark with Scala – Resilient Distributed Dataset

Last Updated : 02 Sep, 2021

In the modern world, we are dealing with huge datasets every day. Data is growing even faster than processing speeds. To perform computations on such large data is often achieved by using distributed systems. A distributed system consists of clusters (nodes/networked computers) that run processes in parallel and communicate with each other if needed.

Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs. This rich set of functionalities and libraries supported higher-level tools like Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing. In this article, we will be learning Apache spark (version 2.x) using Scala.

Some basic concepts :

RDD(Resilient Distributed Dataset) – It is an immutable distributed collection of objects. In the case of RDD, the dataset is the main part and It is divided into logical partitions.
SparkSession –The entry point to programming Spark with the Dataset and DataFrame API.

We will be using Scala IDE only for demonstration purposes. A dedicated spark compiler is required to run the below code. Follow the link to run the below code.

Let’s create our first data frame in spark.

Scala

// Importing SparkSession
import org.apache.spark.sql.SparkSession
 
// Creating SparkSession object
val sparkSession = SparkSession.builder()
                   .appName("My First Spark Application")
                   .master("local").getOrCreate()
 
// Loading sparkContext
val sparkContext = sparkSession.sparkContext
 
// Creating an RDD 
val intArray = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
 
// parallelize method creates partitions, which additionally 
// takes integer argument to specifies the number of partitions. 
// Here we are using 3 partitions.
 
val intRDD = sparkContext.parallelize(intArray, 3)
 
// Printing number of partitions
println(s"Number of partitions in intRDD : ${intRDD.partitions.size}")
 
// Printing first element of RDD
println(s"First element in intRDD : ${intRDD.first}")
 
// Creating string from RDD
// take(n) function is used to fetch n elements from 
// RDD and returns an Array.
// Then we will convert the Array to string using 
// mkString function in scala.
val strFromRDD = intRDD.take(intRDD.count.toInt).mkString(", ")
println(s"String from intRDD : ${strFromRDD}")
 
// Printing contents of RDD
// collect function is used to retrieve all the data in an RDD.
println("Printing intRDD: ")
intRDD.collect().foreach(println)

Output :

Number of partitions in intRDD : 3
First element in intRDD : 1
String from intRDD : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Printing intRDD: 
1
2
3
4
5
6
7
8
9
10

Suggest improvement

How to create Spark session in Scala?

Share your thoughts in the comments

Apache Spark with Scala – Resilient Distributed Dataset

Scala

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?