Logistic Regression using PySpark Python

Last Updated : 21 Mar, 2023

In this tutorial series, we are going to cover Logistic Regression using Pyspark. Logistic Regression is one of the basic ways to perform classification (don’t be confused by the word “regression”). Logistic Regression is a classification method. Some examples of classification are:

Loading Dataframe

We will be using the data for Titanic where I have columns PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, and Embarked. We have to predict whether the passenger will survive or not using the Logistic Regression machine learning model. To get started, open a new notebook and follow the steps mentioned in the below code:

Python3

# Starting the Spark Session 
from pyspark.sql import SparkSession 
spark = SparkSession.builder.appName('Titanic').getOrCreate() 
  
# Reading the data 
df = spark.read.csv('Titanic.csv',inferSchema=True, header=True) 
  
# Showing the data 
df.show()

Output:

Showing the data

df.printSchema()

Schema of the data

df.columns

Columns in the data

Removing NULL Values Columns

The next step includes removing the data having null values as shown in the above picture. We do not need the columns PassengerId, Name, Ticket, and Cabin as they are not required to train and test the model.

Python3

# Selecting the columns which are required  
# to train and test the model. 
rm_columns = df.select(['Survived','Pclass', 
                       'Sex','Age','SibSp', 
                       'Parch','Fare','Embarked']) 
  
# Drops the data having null values 
result = rm_columns.na.drop() 
  
# Again showing the data 
result.show()

Output:

Convert String Column to Ordinal Columns

The next task is to convert the string columns (Sex and Embarked) to integral columns as without doing this, we cannot vectorize the data using VectorAssembler.

Python3

# Importing the required libraries 
from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder 
  
# Converting the Sex Column 
sexIdx = StringIndexer(inputCol='Sex', 
                               outputCol='SexIndex') 
sexEncode = OneHotEncoder(inputCol='SexIndex', 
                               outputCol='SexVec') 
  
# Converting the Embarked Column 
embarkIdx = StringIndexer(inputCol='Embarked', 
                               outputCol='EmbarkIndex') 
embarkEncode = OneHotEncoder(inputCol='EmbarkIndex', 
                               outputCol='EmbarkVec') 
  
# Vectorizing the data into a new column "features"  
# which will be our input/features class 
assembler = VectorAssembler(inputCols=['Pclass', 
                                       'SexVec','Age', 
                                       'SibSp','Parch', 
                                       'Fare','EmbarkVec'], 
                                    outputCol='features') 

Now we need Pipeline to stack the tasks one by one and import and call the Logistic Regression Model.

Python3

# Importing Pipeline and Model 
from pyspark.ml import Pipeline 
from pyspark.ml.classification import LogisticRegression 
  
log_reg = LogisticRegression(featuresCol='features', 
                             labelCol='Survived') 
  
# Creating the pipeline 
pipe = Pipeline(stages=[sexIdx, embarkIdx, 
                            sexEncode, embarkEncode, 
                            assembler, log_reg]) 

After pipelining the tasks, we will split the data into training data and testing data to train and test the model.

Python3

# Splitting the data into train and test 
train_data, test_data = my_final_data.randomSplit([0.7, .3]) 
  
# Fitting the model on training data 
fit_model = pipeline.fit(train_data) 
  
# Storing the results on test data 
results = fit_model.transform(test_data) 
  
# Showing the results 
results.show() 

Output:

data.show()

Model evaluation using ROC-AUC

The results will add extra columns rawPrediction, probability, and prediction because we are transforming the results on our data. After getting the results, we will now find the AUC(Area under the ROC Curve) which will give the efficiency of the model. For this, we will use BinaryClassificationEvaluator as shown:

Python3

# Importing the evaluator 
from pyspark.ml.evaluation import BinaryClassificationEvaluator 
  
# Calling the evaluator 
res = BinaryClassificationEvaluator 
        (rawPredictionCol='prediction',labelCol='Survived') 
  
# Evaluating the AUC on results 
ROC_AUC = res.evaluate(results) 

Output:

Note: In general, an AUC value above 0.7 is considered good, but it’s important to compare the value to the expected performance of the problem and the data to determine if it’s actually good.

ROC_AUC

Suggest improvement

Logistic Regression using Python

Share your thoughts in the comments

Logistic Regression using PySpark Python

Loading Dataframe

Python3

Removing NULL Values Columns

Python3

Convert String Column to Ordinal Columns

Python3

Python3

Python3

Model evaluation using ROC-AUC

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?