DataFrame to JSON Array in Spark in Python

Last Updated : 05 Feb, 2023

In this article, we are going to see how to convert a data frame to JSON Array using Pyspark in Python.

In Apache Spark, a data frame is a distributed collection of data organized into named columns. It is similar to a spreadsheet or a SQL table, with rows and columns. You can use a data frame to store and manipulate tabular data in a distributed environment. DataFrames are designed to be expressive, efficient, and flexible, and they are a key component of Spark’s Structured Streaming API.

What is JSON array?

A JSON (JavaScript Object Notation) array is a data structure that consists of an ordered list of values. It is often used to transmit data between a server and a web application, or between two different applications. JSON arrays are written in a syntax similar to that of JavaScript arrays, with square brackets containing a list of values separated by commas.

Methods to convert a DataFrame to a JSON array in Pyspark:

Use the .toJSON() method
Using the toPandas() method
Using the write.json() method

Method 1: Use the .toJSON() method

The toJSON() method in Pyspark is used to convert pandas data frame to a JSON object. This method takes a number of arguments that allow you to specify the format of the resulting JSON object, such as the index, the data type of the values, and whether to include the column labels in the output.

Stepwise implementation:

Step 1: First of all, import the required library, i.e., SparkSession. The SparkSession library is used to create the session.

from pyspark.sql import SparkSession

Step 2: Create a spark session using the getOrCreate() function.

spark = SparkSession.builder.appName("MyApp").getOrCreate()

Step 3: Create a data frame with sample data.

df = spark.createDataFrame([(1, "Alice", 10), (2, "Bob", 20), (3, "Charlie", 30)],
                           ["id", "name", "age"])

Step 4: Print the data frame.

df.show()

Step 5: Use the .toJSON() function and convert data frame to JSON array

json_array = df.toJSON().collect()

Step 6: Finally, print the JSON array.

print("JSON array:",json_array)

Example:

In this example, we have created a data frame with three columns id, name, and age and converted it to a JSON array using toJSON() method. In the output, the data frame as well as the JSON array are printed.

Python3

# Importing SparkSession 
from pyspark.sql import SparkSession 
  
# Create a SparkSession 
spark = SparkSession.builder.appName("MyApp").getOrCreate() 
  
# Create a DataFrame 
df = spark.createDataFrame([(1, "Alice", 10), 
                            (2, "Bob", 20), 
                            (3, "Charlie", 30)], 
                           ["id", "name", "age"]) 
  
print("Dataframe: ") 
df.show() 
  
# Convert the DataFrame to a JSON array 
json_array = df.toJSON().collect() 
  
# Print the JSON array 
print("JSON array:",json_array)

Output:

Dataframe: 
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 10|
|  2|    Bob| 20|
|  3|Charlie| 30|
+---+-------+---+

JSON array:
[
  '{"id":1,"name":"Alice","age":10}',
  '{"id":2,"name":"Bob","age":20}',
  '{"id":3,"name":"Charlie","age":30}'
]

Method 2: Using the toPandas() method

In this method, we will convert the spark data frame to a pandas data frame which has three columns id, name, and age, and then going to convert it to a JSON string using the to_json() method.

Stepwise implementation:

Step 1: First of all, import the required library, i.e., SparkSession. The SparkSession library is used to create the session.

from pyspark.sql import SparkSession

Step 2: Create a spark session using the getOrCreate() function.

spark = SparkSession.builder.appName("MyApp").getOrCreate()

Step 3: Create a data frame with sample data.

df = spark.createDataFrame([(1, "Alice", 10), (2, "Bob", 20), (3, "Charlie", 30)],
                           ["id", "name", "age"])

Step 4: Print the data frame

df.show()

Step 5: Convert the data frame to a pandas data frame

pandas_df = df.toPandas()

Step 6: Convert the panda’s data frame to JSON array

json_data = pandas_df.to_json(orient='records')

Step 7: Finally, print the JSON array

print("JSON array:",json_data)

Example:

Python3

# Importing SparkSession 
from pyspark.sql import SparkSession 
  
# Create a SparkSession 
spark = SparkSession.builder.appName("MyApp").getOrCreate() 
  
# Create a DataFrame 
df = spark.createDataFrame([(1, "Joseph", 10), 
                             (2, "Jack", 20), 
                            (3, "Elon", 30)], 
                             ["id", "name", "age"]) 
  
print("Dataframe: ") 
df.show() 
  
# Convert dataframe to a Pandas dataframe 
pandas_df = df.toPandas() 
  
# Convert Pandas dataframe to a JSON string 
json_data = pandas_df.to_json(orient='records') 
  
# Print the JSON array 
print("JSON array:", json_data) 

Output:

Dataframe: 
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1| Joseph| 10|
|  2|   Jack| 20|
|  3|   Elon| 30|
+---+-------+---+

JSON array:
[
  '{"id":1,"name":"Joseph","age":10}',
  '{"id":2,"name":"Jack","age":20}',
  '{"id":3,"name":"Elon","age":30}'
]

Method 3: Using the write.json() method

In this method, we will use write.json() to create a JSON file. But this will create a directory called data.json that contains a set of files with the JSON data. If we want to write the JSON data to a single file, we can use the coalesce() method to merge the files and then use the write.json() method.

Stepwise implementation:

Step 1: First of all, import the required library, i.e., SparkSession. The SparkSession library is used to create the session.

from pyspark.sql import SparkSession

Step 2: Create a spark session using the getOrCreate() function.

spark = SparkSession.builder.appName("MyApp").getOrCreate()

Step 3: Create a data frame with sample data.

df = spark.createDataFrame([(1, "Alice", 10),
                            (2, "Bob", 20),
                            (3, "Charlie", 30)], 
                            ["id", "name", "age"])

Step 4: Use write.json() to create a JSON directory

df.write.json('data.json')

Step 5: Finally, merge the JSON files into a single JSON file.

df.coalesce(1).write.json('data_merged.json')

Example:

In this example, we created a data frame with three columns id, name, and age. The code below creates a directory with multiple JSON files which are then merged into a single file. The output is finally stored in a JSON file named data_merge.json.

Python3

from pyspark.sql import SparkSession 
  
# Create a SparkSession 
spark = SparkSession.builder.appName("MyApp").getOrCreate() 
  
# Create a DataFrame 
df = spark.createDataFrame([(1, "Donald", 51), 
                            (2, "Riya", 23),  
                            (3, "Vani", 22)],  
                           ["id", "name", "age"]) 
  
# Write dataframe to a JSON file 
df.write.json('data.json') 
  
# Merge JSON files into a single file 
df.coalesce(1).write.json('data_merged.json') 
  
# Printing data frame 
df.show()

Output:

[{"id":1,"name":"Donald","age":51},
{"id":2,"name":"Riya","age":23},
{"id":3,"name":"Vani","age":22}]

Suggest improvement

Pyspark - Converting JSON to DataFrame

Share your thoughts in the comments

DataFrame to JSON Array in Spark in Python

What is JSON array?

Methods to convert a DataFrame to a JSON array in Pyspark:

Method 1: Use the .toJSON() method

Python3

Method 2: Using the toPandas() method

Python3

Method 3: Using the write.json() method

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?