Remove all columns where the entire column is null in PySpark DataFrame

Last Updated : 07 Nov, 2022

In this article, we’ll learn how to drop the columns in DataFrame if the entire column is null in Python using Pyspark.

Creating a spark dataframe with Null Columns:

To create a dataframe with pyspark.sql.SparkSession.createDataFrame() methods.

Syntax

pyspark.sql.SparkSession.createDataFrame()

Parameters:

dataRDD: An RDD of any kind of SQL data representation(e.g. Row, tuple, int, boolean, etc.), or list, or pandas.DataFrame.

schema: A datatype string or a list of column names, default is None.

samplingRatio: The sample ratio of rows used for inferring

verifySchema: Verify data types of every row against schema. Enabled by default.

Returns: Dataframe

Python3

from pyspark.sql import SparkSession 
import pyspark.sql.types as T 
  
  
spark = SparkSession.builder.appName('My App').getOrCreate() 
  
actor_data = [ 
    ("James", None, "Bond", "M", 6000), 
    ("Michael", None, None, "M", 4000), 
    ("Robert", None, "Pattinson", "M", 4000), 
    ("Natalie", None, "Portman", "F", 4000), 
    ("Julia", None, "Roberts", "F", 1000) 
] 
actor_schema = T.StructType([ 
    T.StructField("firstname", T.StringType(), True), 
    T.StructField("middlename", T.StringType(), True), 
    T.StructField("lastname", T.StringType(), True), 
    T.StructField("gender", T.StringType(), True), 
    T.StructField("salary", T.IntegerType(), True) 
]) 
  
df = spark.createDataFrame(data=actor_data, schema=actor_schema) 
df.show(truncate=False) 

Output:

+---------+----------+---------+------+------+
|firstname|middlename|lastname |gender|salary|
+---------+----------+---------+------+------+
|James    |null      |Bond     |M     |6000  |
|Michael  |null      |null     |M     |4000  |
|Robert   |null      |Pattinson|M     |4000  |
|Natalie  |null      |Portman  |F     |4000  |
|Julia    |null      |Roberts  |F     |1000  |
+---------+----------+---------+------+------+

Remove all columns where the entire column is null in PySpark DataFrame

Here we want to drop all the columns where the entire column is null, as we can see the middle name columns are null and we want to drop that.

Python3

import pyspark.sql.functions as F 
  
# Create a dictionary to store the 
# count of nulls present in each column. 
null_counts = df.select([F.count(F.when(F.col(c).isNull(), c)).alias( 
    c) for c in df.columns]).collect()[0].asDict() 
print(null_counts) 
  
# Calculate the size of the DataFrame. 
df_size = df.count() 
  
# Make a list expression using 
# the dictionary to drop columns 
# whose value is equal to the data frame size. 
to_drop = [k for k, v in null_counts.items() if v == df_size] 
print(to_drop) 
  
# Drop all the columns present in that list. 
output_df = df.drop(*to_drop) 
  
output_df.show(truncate=False) 

{'firstname': 0, 'middlename': 5, 'lastname': 1, 'gender': 0, 'salary': 0}
['middlename']
+---------+---------+------+------+
|firstname|lastname |gender|salary|
+---------+---------+------+------+
|James    |Bond     |M     |6000  |
|Michael  |null     |M     |4000  |
|Robert   |Pattinson|M     |4000  |
|Natalie  |Portman  |F     |4000  |
|Julia    |Roberts  |F     |1000  |
+---------+---------+------+------+

Suggest improvement

How to drop all columns with null values in a PySpark DataFrame ?

Share your thoughts in the comments

Remove all columns where the entire column is null in PySpark DataFrame

Creating a spark dataframe with Null Columns:

Python3

Remove all columns where the entire column is null in PySpark DataFrame

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?