Open In App

How to create DataFrame from Scala’s List of Iterables?

In Scala, working with large datasets is made easier with Apache Spark, a powerful framework for distributed computing. One of the core components of Spark is DataFrames, which organizes data into tables for efficient processing. In this article, we'll explore how to create DataFrames from simple lists of data in Scala using Apache Spark's DataFrame API.

Understanding DataFrames and Lists

DataFrames are like tables in a database, allowing us to work with structured data easily. On the other hand, lists are collections of data in Scala.

Creating DataFrames from Lists

There are two simple ways to turn lists into DataFrames:

1. Using RDDs:

2. Using the toDF method:

Let's see both methods in action.

Method 1: Using RDDs

// Import necessary libraries
import org.apache.spark.sql.{SparkSession, DataFrame, Row}
import org.apache.spark.sql.types._

// Create SparkSession
val spark = SparkSession.builder().appName("DataFrameFromList").getOrCreate()

// Define schema for DataFrame
val schema = StructType(Seq(
  StructField("Name", StringType, nullable = false),
  StructField("Age", IntegerType, nullable = false),
  StructField("Location", StringType, nullable = false)
))

// Sample data: List of Lists
val data = List(
  List("Alice", 30, "New York"),
  List("Bob", 25, "Los Angeles"),
  List("Charlie", 35, "Chicago")
)

// Convert List of Lists to RDD of Rows
val rowsRDD = spark.sparkContext.parallelize(data).map(Row.fromSeq)

// Create DataFrame
val df: DataFrame = spark.createDataFrame(rowsRDD, schema)

// Show DataFrame
df.show()
+-------+---+-----------+
| Name|Age| Location|
+-------+---+-----------+
| Alice| 30| New York|
| Bob| 25|Los Angeles|
|Charlie| 35| Chicago|
+-------+---+-----------+

Explanation:

  1. We import necessary libraries.
  2. A SparkSession is created.
  3. We define the DataFrame schema, specifying data types for each column.
  4. Sample data is represented as a List of Iterables, where each inner list represents a row.
  5. We use spark.sparkContext.parallelize to create an RDD from the list.
  6. The map function on the RDD converts each Iterable to a Row object using Row.fromSeq.
  7. Finally, spark.createDataFrame creates the DataFrame from the RDD and schema.

Method 2: Using toDF method

// Import necessary libraries
import org.apache.spark.sql.{SparkSession, DataFrame}

// Create SparkSession
val spark = SparkSession.builder().appName("DataFrameFromList").getOrCreate()

// Sample data: List of Lists
val data = List(
  List("Alice", "30", "New York"),
  List("Bob", "25", "Los Angeles"),
  List("Charlie", "35", "Chicago")
)

// Create DataFrame using toDF method
val df = data.toDF("Name", "Age", "Location")

// Show DataFrame
df.show()

Explanation:

Conclusion:

Creating DataFrames from lists of data in Scala is straightforward with Apache Spark's DataFrame API. Whether you choose to use RDDs for flexibility or the toDF method for simplicity, you can quickly organize and process your data for analysis and insights.

Article Tags :