Select columns in PySpark dataframe
In this article, we will learn how to select columns in PySpark dataframe.
In PySpark we can select columns using the select() function. The select() function allows us to select single or multiple columns in different formats.
Syntax: dataframe_name.select( columns_names )
Note: We are specifying our path to spark directory using the findspark.init() function in order to enable our program to find the location of apache spark in our local machine. Ignore this line if you are running the program on cloud. Suppose we have our spark folder in c drive by name of spark so the function would look something like: findspark.init(‘c:/spark’). Not specifying the path sometimes may lead to py4j.protocol.Py4JError error when running the program locally.
Example 1: Select single or multiple columns
We can select single or multiple columns using the select() function by specifying the particular column name. Here we are using our custom dataset thus we need to specify our schema along with it in order to create the dataset.
Note: There are a lot of ways to specify the column names to the select() function. Here we used “column_name” to specify the column. Other ways include (All the examples as shown with reference to the above code):
- We can use col() function from pyspark.sql.functions module to specify the particular columns
Note: All the above methods will yield the same output as above
Example 2: Select columns using indexing
Indexing provides an easy way of accessing columns inside a dataframe. Indexing starts from 0 and has total n-1 numbers representing each column with 0 as first and n-1 as last nth column. We can use df.columns to access all the columns and use indexing to pass in the required columns inside a select function. Here is how the code will look like. We are using our custom dataset thus we need to specify our schema along with it in order to create the dataset.
Example 3: Access nested columns of a dataframe
While creating a dataframe there might be a table where we have nested columns like, in a column name “Marks” we may have sub-columns of Internal or external marks, or we may have separate columns for the first middle, and last names in a column under the name. In order to access the nested columns inside a dataframe using the select() function, we can specify the sub-column with the associated column. Here we are using our custom dataset thus we need to specify our schema along with it in order to create the dataset.
Here we can se we have a dataset of following schema
We have a column name with sub columns as firstname and lastname. Now as we performed the select operation we have an output like