Apache Spark with Scala – Resilient Distributed Dataset
In the modern world, we are dealing with huge datasets every day. Data is growing even faster than processing speeds. To perform computations on such large data is often achieved by using distributed systems. A distributed system consists of clusters (nodes/networked computers) that run processes in parallel and communicate with each other if needed.
Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs. This rich set of functionalities and libraries supported higher-level tools like Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing. In this article, we will be learning Apache spark (version 2.x) using Scala.
Some basic concepts :
- RDD(Resilient Distributed Dataset) – It is an immutable distributed collection of objects. In the case of RDD, the dataset is the main part and It is divided into logical partitions.
- SparkSession –The entry point to programming Spark with the Dataset and DataFrame API.
We will be using Scala IDE only for demonstration purposes. A dedicated spark compiler is required to run the below code. Follow the link to run the below code.
Let’s create our first data frame in spark.
Number of partitions in intRDD : 3 First element in intRDD : 1 String from intRDD : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 Printing intRDD: 1 2 3 4 5 6 7 8 9 10