PySpark – Read CSV file into DataFrame
In this article, we are going to see how to read CSV files into Dataframe. For this, we will use Pyspark and Python.
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning - Basic Level Course
Read CSV File into DataFrame
Here we are going to read a single CSV into dataframe using spark.read.csv and then create dataframe with this data using .toPandas().
Here, we passed our CSV file authors.csv. Second, we passed the delimiter used in the CSV file. Here the delimiter is comma ‘,‘. Next, we set the inferSchema attribute as True, this will go through the CSV file and automatically adapt its schema into PySpark Dataframe. Then, we converted the PySpark Dataframe to Pandas Dataframe df using toPandas() method.
Read Multiple CSV Files
To read multiple CSV files, we will pass a python list of paths of the CSV files as string type.
Here, we imported authors.csv and book_author.csv present in the same current working directory having delimiter as comma ‘,‘ and the first row as Header.
Read All CSV Files in Directory
To read all CSV files in the directory, we will use * for considering each file in the directory.
This will read all the CSV files present in the current working directory, having delimiter as comma ‘,‘ and the first row as Header.