How to create Spark session in Scala?

Scala stands for scalable language. It was developed in 2003 by Martin Odersky. It is an object-oriented language that provides support for functional programming approach as well. Everything in scala is an object e.g. - values like 1,2 can invoke functions like toString(). Scala is a statically typed language although unlike other statically typed languages like C, C++, or Java, it doesn't require type information while writing the code. The type verification is done at the compile time. Static typing allows to building of safe systems by default. Smart built-in checks and actionable error messages, combined with thread-safe data structures and collections, prevent many tricky bugs before the program first runs.

Understanding Spark

The official definition of Spark on its website is "Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters".

Let us dive deeper and better understand what Spark means.

The multi-language paradigm means that Spark can be written in many languages. Currently, it supports Python, Scala, R, Java, and SQL.
The property of spark that makes it so popular and useful in different data applications is its distributed nature.
Spark processes the data by dividing it into smaller chunks and then processing each chunk on a separate node.
There are two types of nodes - driver and worker. Each spark program only has one driver node. When a program is run the driver node manages the worker nodes, it takes care of segregating the data and sending the operations to be performed on the data to the worker node.
In case, one worker node requires the data from another worker node the driver takes care of the communication between them. The driver node manages all the tasks and returns the final result to the user.
Compared to its predecessor, Hadoop, spark runs much faster. The reason is that Spark runs on memory while Hadoop uses a storage disk to do data processing.
Along with it, spark works on advanced optimizations and uses DAGs, lazy evaluation, etc. to better optimize the given task.
DAG (Directed Acyclic Graph) is an important component of Spark. Every task is broken down into subtasks and arranged in a sequence called DAGs. The MapReduce used by Hadoop is also a type of DAG.
However, Spark generalizes the concept and forms DAGs according to the task at hand.

Understanding SparkSession

The SparkSession class is the entry point into all functionality in Spark. It was introduced in Spark 2.0. It serves as a bridge to access all of Spark's core features, encompassing RDDs, DataFrames, and Datasets, offering a cohesive interface for handling structured data processing. When developing a Spark SQL application, it is typically one of the initial objects you instantiate.

Let us dive deeper and better understand what SparkSession means.

The SparkSession amalgamates various previously distinct contexts, including SparkContext, SQLContext, HiveContext and StreamingContext, into a unified entry point, streamlining interaction with Spark and its diverse APIs.
This facilitates users in conducting a range of tasks such as reading data from multiple sources, executing SQL queries, generating DataFrames and Datasets, and efficiently performing actions on distributed datasets.
SparkSession starting from Spark 2.0 provides builtin support for Hive features including the ability to write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables.

Creating SparkSession

Method 1 - Using builder API

The SparkSession object can be created using the builder API as follows.

Scala

import org.apache.spark.sql.SparkSession

object createSparkSession {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder()
        .master("local[1]")
        .appName("CreatingSparkSession")
        .getOrCreate()
    println(spark)
  }
}

The SparkSession object

Above, we used the builder function available in the SparkSession object (scala companion object, not a normal object of the class) that can create a sparksession object.

We can even access the sparkcontext and sqlcontext from this sparksession object. Let's see how to extract these from the sparksession object.

Scala

println(spark.sparkContext)
println(spark.sqlContext)

Accessing sparkcontext and sqlcontext

Here we have access the sparkcontext and sqlcontext objects present inside the sparksession.

Adding configuration to the sparksession object.

We can even add configuration options to the sparksession object to change its behaviour according to our needs. For this, we need to call the config function. Let us see how to provide the configuration while creating sparksession.

Scala

val spark: SparkSession = SparkSession.builder()
      .master("local[1]")
      .appName("CreatingSparkSession")
      .config("spark.sql.warehouse.dir", "<path>/spark-warehouse")
      .getOrCreate()

Method 2 - From existing sparksession

We can create a new sparksession from an existing sparksession. Let us see how to do

Scala

import org.apache.spark.sql.SparkSession

object createSparkSession {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder()
      .master("local[1]")
      .appName("CreatingSparkSession")
      .getOrCreate()
    val spark_2 = spark.newSession()
    println(spark_2)
  }
}

SparkSession object

Here we created a new sparksession object from an existing sparksession object. We can create as many sparksessions as we want. However, to use this method there must be an existing sparksession object to create a new sparksession from.

Article Tags :

Scala