Logistic Regression using PySpark Python
Last Updated :
21 Mar, 2023
In this tutorial series, we are going to cover Logistic Regression using Pyspark. Logistic Regression is one of the basic ways to perform classification (don’t be confused by the word “regression”). Logistic Regression is a classification method. Some examples of classification are:
Loading Dataframe
We will be using the data for Titanic where I have columns PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, and Embarked. We have to predict whether the passenger will survive or not using the Logistic Regression machine learning model. To get started, open a new notebook and follow the steps mentioned in the below code:
Python3
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'Titanic' ).getOrCreate()
df = spark.read.csv( 'Titanic.csv' ,inferSchema = True , header = True )
df.show()
|
Output:
Showing the data
df.printSchema()
Schema of the data
df.columns
Columns in the data
Removing NULL Values Columns
The next step includes removing the data having null values as shown in the above picture. We do not need the columns PassengerId, Name, Ticket, and Cabin as they are not required to train and test the model.
Python3
rm_columns = df.select([ 'Survived' , 'Pclass' ,
'Sex' , 'Age' , 'SibSp' ,
'Parch' , 'Fare' , 'Embarked' ])
result = rm_columns.na.drop()
result.show()
|
Output:
Convert String Column to Ordinal Columns
The next task is to convert the string columns (Sex and Embarked) to integral columns as without doing this, we cannot vectorize the data using VectorAssembler.
Python3
from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder
sexIdx = StringIndexer(inputCol = 'Sex' ,
outputCol = 'SexIndex' )
sexEncode = OneHotEncoder(inputCol = 'SexIndex' ,
outputCol = 'SexVec' )
embarkIdx = StringIndexer(inputCol = 'Embarked' ,
outputCol = 'EmbarkIndex' )
embarkEncode = OneHotEncoder(inputCol = 'EmbarkIndex' ,
outputCol = 'EmbarkVec' )
assembler = VectorAssembler(inputCols = [ 'Pclass' ,
'SexVec' , 'Age' ,
'SibSp' , 'Parch' ,
'Fare' , 'EmbarkVec' ],
outputCol = 'features' )
|
Now we need Pipeline to stack the tasks one by one and import and call the Logistic Regression Model.
Python3
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
log_reg = LogisticRegression(featuresCol = 'features' ,
labelCol = 'Survived' )
pipe = Pipeline(stages = [sexIdx, embarkIdx,
sexEncode, embarkEncode,
assembler, log_reg])
|
After pipelining the tasks, we will split the data into training data and testing data to train and test the model.
Python3
train_data, test_data = my_final_data.randomSplit([ 0.7 , . 3 ])
fit_model = pipeline.fit(train_data)
results = fit_model.transform(test_data)
results.show()
|
Output:
data.show()
Model evaluation using ROC-AUC
The results will add extra columns rawPrediction, probability, and prediction because we are transforming the results on our data. After getting the results, we will now find the AUC(Area under the ROC Curve) which will give the efficiency of the model. For this, we will use BinaryClassificationEvaluator as shown:
Python3
from pyspark.ml.evaluation import BinaryClassificationEvaluator
res = BinaryClassificationEvaluator
(rawPredictionCol = 'prediction' ,labelCol = 'Survived' )
ROC_AUC = res.evaluate(results)
|
Output:
Note: In general, an AUC value above 0.7 is considered good, but it’s important to compare the value to the expected performance of the problem and the data to determine if it’s actually good.
ROC_AUC
Share your thoughts in the comments
Please Login to comment...