Open In App

Sarcasm Detection Using Neural Networks

Last Updated : 20 Mar, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

Sarcasm refers to the words that are used in order to insult someone, to show your aggression or irritation or to make things funny.
It is a form to convey your negative feelings in positive or funny words. In recent times, Twitter is one of the largest social platforms where people convey their thoughts, feelings or actions in the form of tweets, which sometimes can be sarcastic. To analyze and detect this sarcasm on Twitter, we will build ML models using a neural network for sarcasm detection.

Sarcasm Detection Using Neural Network

In this article, we will be building sarcasm detection using a neural network and classifying it as sarcasm or non-sarcasm. This implies that it is a classification problem. So let’s start with the data analysis or EDA and build the classification model.

Steps to be followed:

  • Importing libraries
  • Loading datasets
  • Data Preprocessing
  • Text cleaning
  • Splitting train and test datasets
  • Model building
  • Prediction
  • Conclusion

Dataset features:

  • Article link – the link to the articles
  • Headlines – Headlines of the article
  • Is_Sarcastic – Representing 0 and 1 for if the headline is sarcastic or not

Importing/Installing libraries

Python libraries make it incredibly simple for us to manage the data and complete both common and difficult jobs with only one line of code.

  • Numpy – It is a array-processing package that provide high-dimensional array objects and , tools to work with these arrays.
  • Pandas – Pandas is a library of pandas that mainly works with labeled data easily. It is used to analyze and manipulate the data.
  • Matplotlib/Seaborn – Matplotlib is a python library used for data visualization. It helps us to create visual representations of huge amount of data that can be easy to use and understand. It consists of several plots like bar chart, scatter, line, graphs, etc.
  • NLTK – NLTK (Natural Language Toolkit ) is a vast library of python that can perform variety of operational tasks such as stemming, lemmatization, tagging, parsing, tokenization, etc.

Let’s start the implementation of the model.

Importing libraires

Python3




#importing libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string
from nltk.corpus import stopwords
from wordcloud import WordCloud
  
#importing libraries for model building and evaluation
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
  
from sklearn.metrics import classification_report, confusion_matrix


  • Natural Language Toolkit(nltk) : removes stop words, lemmatize the sentences, and many different functions can be performed using NLTK.
  • Regular expression(re) : Helps to remove unnecessary symbols that do not provide any meaning such as @, comma, hashtags, *,etc.
  • Tokenizer: Used to split text into individual units, such as words or sub words which make it easier to analyze and process text for various language-related tasks.

Loading Datasets

Let’s now import the dataset into a pandas data frame and examine the dataset’s first five rows and get the data info.

Python3




# loading the datasets
data1 = pd.read_csv('Sarcasm data1.csv')
data2 = pd.read_csv('Sarcasm data2.csv')
data = pd.concat([data1, data2], ignore_index=True)
  
# Display the concatenated data
print(data.head())


Output:

                                        article_link  \
0  https://www.huffingtonpost.com/entry/versace-b...   
1  https://www.huffingtonpost.com/entry/roseanne-...   
2  https://local.theonion.com/mom-starting-to-fea...   
3  https://politics.theonion.com/boehner-just-wan...   
4  https://www.huffingtonpost.com/entry/jk-rowlin...   
                                            headline  is_sarcastic  
0  former versace store clerk sues over secret 'b...             0  
1  the 'roseanne' revival catches up to our thorn...             0  
2  mom starting to fear son's web series closest ...             1  
3  boehner just wants wife to listen, not come up...             1  
4  j.k. rowling wishes snape happy birthday in th...             0  

Data Assessment:

For example, studying the information of the data

Python3




data.info()


Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55328 entries, 0 to 55327
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   article_link  55328 non-null  object
 1   headline      55328 non-null  object
 2   is_sarcastic  55328 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.3+ MB

Data Preprocessing:

Data processing is essentially the process of preparing data for analysis by cleaning, manipulating, and combining it. As raw data can be inaccurate or lacking in formatting. it is the most crucial step in an analysis to produce correct results with improved quality and reliability.

Python3




# Check the dataset label balance or not
  
# Check number of headlines by is_sarcastics
plt.figure(figsize=(10, 5))
sns.countplot(x='is_sarcastic', data=data, palette="Set1").set_title(
    "Countplot of headlines")
plt.show()


Output:

Check the dataset label balance or not-Geeksforgeeks

Check the dataset label balance or not

Data Cleaning:

Now let’s start with the data cleaning , where we will be performing different tasks like removing punctuations, Stopwords , and removing punctuations, removing text in square brackets and words containing digits, performing Tokenization on the text. After performing these tasks.

Python3




#downloading the stopwords corpus list 
nltk.download('stopwords')
stopwords_list = stopwords.words('english')


In the above code, we will be downloading the stopwords corpus list from the NLTK library and load the list of stopwords.

Python3




def clean_text(sentences):
    # convert text to lowercase
    text = sentences.lower()
    # remove text in square brackets
    text = re.sub('\[.*?\]', '', text)
    # removing punctuations
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    # removing words containing digits
    text = re.sub('\w*\d\w*', '', text)
    # Join the words
    text = ' '.join([word for word in text.split()
                     if word not in stopwords_list])
    return text
  
  
print(data['headline'].iloc[1])
clean_text(data['headline'].iloc[1])


Output:

the 'roseanne' revival catches up to our thorny political mood, for better and worse
roseanne revival catches thorny political mood better worse

Here in this code, the function ‘clean_text’ takes sentence as the input and applies a series of text cleaning tasks and returns the cleaned text.

Python3




#new column to store cleaned text 
data['cleaned_headline']=data['headline'].map(clean_text)


Here, we will be storing the cleaned text resulting after text cleaning.

Wordcloud

Most Frequent Word in Sarcastic Headlines

Python3




# Combine all sarcastic cleaned headlines into a single text
import matplotlib.pyplot as plt
from wordcloud import WordCloud
Sarcastic_text = ' '.join(
    data['cleaned_headline'][data['is_sarcastic'] == 1].tolist())
  
# Import the necessary libraries
  
# Create a WordCloud object with specified width, height, and background color
wordcloud = WordCloud(width=800, height=400,
                      background_color='black').generate(Sarcastic_text)
  
# Display the WordCloud without axes
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Sarcastic')
plt.show()


Output:

Most frequent word in Sarcastic Headlines-Geeksforgeeks

Most Frequent word in Sarcastic Headlines

Interpretation: This wordcloud visualization highlights the most common words in the sarcastic headlines. The size of the word in word cloud represents the frequency in the text.

Most Frequent Word in Non-Sarcastic Headlines

Python3




# Combine all non-sarcastic cleaned headlines into a single text
Non_Sarcastic_text = ' '.join(
    data['cleaned_headline'][data['is_sarcastic'] == 0].tolist())
  
# Create a WordCloud object with specified width, height, and background color
wordcloud = WordCloud(width=800, height=400,
                      background_color='black').generate(Non_Sarcastic_text)
  
# Display the WordCloud without axes
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Not Sarcastic')
plt.show()


Output:

Most Frequent Word in Non-Sarcastic Headlines-Geeksforgeeks

Most Frequent Word in Non-Sarcastic Headlines

Interpretation: This wordcloud visualization represents the most common non-sarcastic words in the headlines.

Splitting Train and test datasets

Python3




#coverting the columns into lists
text = data['cleaned_headline'].tolist()
label = data['is_sarcastic'].tolist()


We will be converting the text into a list containing cleaned headline text and the label into the list containing the corresponding labels where (1 is sarcastic and 0 is non- sarcastic) for each headline in text list.

Python3




# train :test : validation = 80:10:10
train_portion = .8
  
# Set the train size using training_portion arg
train_size = int(len(text) * train_portion)
  
# Training dataset
train_text = text[:train_size]
train_label = label[:train_size]
# Validations dataset
valid_size = train_size+int((len(text)-train_size)/2)
val_text = text[train_size:valid_size]
val_label = label[train_size:valid_size]
# Testing dataset
test_text = text[valid_size:]
test_label = label[valid_size:]
  
# Check
print('Training data :', len(train_text), len(train_label))
print('Validations data :', len(val_text), len(val_label))
print('Testing data :', len(test_text), len(test_label))


Ouput:

Training data : 44262 44262
Validations data : 5533 5533
Testing data : 5533 5533

here we are dividing the dataset into training, testing and validation data according to 80:10:10 split ratio. It calculates the size of the subset, extracting the corresponding text and printing the size of the subset.

Now , we will be setting the parameters for training the

Python3




# Set parameters
# Max len of unique words
vocab_size = 10000
  
# Embedding dimension value
embedding_dim = 200
  
# Max length of sentence
max_length = 60
  
# pad_sequences arg
padding_type = 'post'
  
# Unknow words = <OOV>
oov_tok = '<OOV>'
  
# Tokenizing and padding
# Create a tokenizer with a specified vocabulary size and out-of-vocabulary token
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
# Fit the tokenizer on the training text data to create word-to-index mapping
tokenizer.fit_on_texts(train_text)


We will then set the parameters for the text processing and tokenization. Tokenizer will create a word-to-index mapping from the training text data, with a specified vocabulary size and out-of-vocabulary toke. This mapping is then used to convert the text into sequences.

Let’s understand some terms:

  • Embedding_dim: It represents the dimensionality of continuous vector space in which words are represented as dense embeddings, capturing semantic relationships for natural language processing tasks.
  • Padding: Process of adding special tokens or values to sequences of varying length of ensure they all have same length.
  • Pad_sequences: Function provided by various libraries, such as keras in tensorflow, that is used to pad_sequences to ensure they all have same length. This function allows you to specify whether padding should be added at the beginning or end of the sequences.

Python3




# Get the word index from the tokenizer
word_index = tokenizer.word_index
  
#Printing the word_index
word_index


Output:

{'<OOV>': 1,
 'new': 2,
 'man': 3,
 'trump': 4,
 'us': 5,
 'report': 6,...}

Here, this code is generating the word-to-index mapping generated by the tokenizer, which will assign a unique index to each word based on frequency of the word and order of appearance in the training text data.

Python3




# Convert training text to sequences of word indices
tokenizer.texts_to_sequences(train_text[:5])


Output:

[[204, 1, 740, 2953, 2235, 245, 35, 1958, 2515, 8487],
 [8488, 3234, 2422, 8489, 289, 3081, 151, 853],
 [62, 597, 719, 1063, 1840, 496, 4199, 116, 9676],
 [1320, 130, 337, 1439, 219, 3385, 1, 703],
 [4779, 5143, 679, 1, 454, 490, 1137, 33]]

Using the previously defined tokenizer’s word index , it converts text into sequences of word indices . This list is a list of lists, where each sublist represents sequences of indices for word in a text.

Python3




# Tokenize and pad the training text data
# Convert training text to sequences of word indices
train_indices = tokenizer.texts_to_sequences(train_text)
# Pad sequences to a fixed length
train_padded = pad_sequences(train_indices,
                             padding=padding_type,
                             maxlen=max_length)


In the above code, it preprocesses the training data into sequences of word indices and then padding those sequences for the uniform length input.

Python3




# Convert validation text to sequences of word indices
val_indices = tokenizer.texts_to_sequences(val_text)
# Pad sequences to a fixed length
validation_padded = pad_sequences(val_indices,
                                  padding=padding_type,
                                  maxlen=max_length)
  
# Convert test text to sequences of word indices
test_indices = tokenizer.texts_to_sequences(test_text)
# Pad sequences to a fixed length
test_padded = pad_sequences(test_indices,
                            padding=padding_type,
                            maxlen=max_length)
  
# Check
print('Training vector :', train_padded.shape)
print('Validations vector :', validation_padded.shape)
print('Testing vector :', test_padded.shape)


Output:

Training vector : (44262, 60)
Validations vector : (5533, 60)
Testing vector : (5533, 60)

This code demonstrates that the validation and test datasets are tokenized into sequences of word indices using the tokenizer. The sequences padded to fixed length will be used to create data vectors for training, validation and testing purposes.

Python3




# Decode the sample training vector
tokenizer.sequences_to_texts([train_padded[0].tolist()])


Output:

['former <OOV> store clerk sues secret black code minority shoppers <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV>']

In the above code, sequence of word indices in the first training vector will convert back into the text using the reverse mapping. It will verify and check quality of the preprocessing.

Python3




# Prepare labels for model
training_labels_final = np.array(train_label)
validation_labels_final = np.array(val_label)
testing_labels_final = np.array(test_label)
  
  
# Check shapes
print('Training vector :', training_labels_final.shape)
print('Validations vector :', validation_labels_final.shape)
print('Testing vector :', testing_labels_final.shape)


Output:

Training vector : (44262,)
Validations vector : (5533,)
Testing vector : (5533,)

Here we are preparing the labels for the training, testing and validation sets. These labels will then be converted into arrays and then we will print the shapes of the label array to verify the dimensions.

Development and Evaluation of the Model

We will be implementing the Sequential Model which consist of different parts or layers:

  • Embedding layer: This layer is responsible for creating word embeddings for the input text. It transforms word indices into dense vectors, facilitating the understanding of word relationships.
  • GlobalMaxPool1D layer: Global max pooling is used in this layer to select the highest value from each feature map in the preceding layer. It aids in extracting the key details from the embedded sequences.
  • Dropout layers: In order to introduce regularization, these layers randomly remove some of the neurons outputs during training.
  • Dense layers: These fully connected layers process the pooled features. They apply non-linear transformation to data, that enhances model’s capacity to capture complex patterns.
  • Output layers: This layer contains a single neuron with sigmoid activation function produces a binary output indicating the predicted probability of a headline being sarcastic.

Python3




# Import the TensorFlow library
import tensorflow as tf
  
# Define a sequential neural network model
model = tf.keras.Sequential([
    # Embedding layer for creating word embeddings
    tf.keras.layers.Embedding(
        vocab_size, embedding_dim, input_length=max_length),
  
    # GlobalMaxPooling layer to extract relevant features
    tf.keras.layers.GlobalMaxPool1D(),
  
    # First Dense layer with 40 neurons and ReLU activation
    tf.keras.layers.Dense(40, activation='relu'),
  
    # Dropout layer to prevent overfitting
    tf.keras.layers.Dropout(0.5),
  
    # Second Dense layer with 20 neurons and ReLU activation
    tf.keras.layers.Dense(20, activation='relu'),
  
    # Dropout layer to prevent overfitting
    tf.keras.layers.Dropout(0.5),
  
    # Third Dense layer with 10 neurons and ReLU activation
    tf.keras.layers.Dense(10, activation='relu'),
  
    # Dropout layer to prevent overfitting
    tf.keras.layers.Dropout(0.2),
  
    # Final Dense layer with 1 neuron and sigmoid activation for binary classification
    tf.keras.layers.Dense(1, activation='sigmoid')
])
  
model.summary()


Output:

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, 60, 200)           2000000   
                                                                 
 global_max_pooling1d (Globa  (None, 200)              0         
 lMaxPooling1D)                                                  
                                                                 
 dense (Dense)               (None, 40)                8040      
                                                                 
 dropout (Dropout)           (None, 40)                0         
                                                                 
 dense_1 (Dense)             (None, 20)                820       
                                                                 
 dropout_1 (Dropout)         (None, 20)                0         
                                                                 
 dense_2 (Dense)             (None, 10)                210       
                                                                 
 dropout_2 (Dropout)         (None, 10)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 11        
                                                                 
=================================================================
Total params: 2,009,081
Trainable params: 2,009,081
Non-trainable params: 0

Model summary prints the summary of the model architecture , giving an overview of the layer in the model.

Python3




# Compile the model with specified loss function, optimizer, and evaluation metrics
model.compile(loss='binary_crossentropy',
              optimizer='adam', metrics=['accuracy'])


We will then compile the model to configure model’s learning process.

Let’s look at some important parameters :

  • Binary_crossentropy: This determines the loss function for binary classification tasks. It gives the difference between actual and predicted values.
  • Adam Optimizer: It is used to adjust model’s weight during training to converge model to provide a better solution.
  • Accuracy metrics: During training, model’s accuracy will be calculated and monitored as a performance metrics.

Python3




# Set the number of training epochs
num_epochs = 5
  
# Fit the model to the training data and validate on the validation data
history = model.fit(
    train_padded, training_labels_final,
    epochs=num_epochs,
    validation_data=(validation_padded, validation_labels_final)
)


Output:

Epoch 1/5
1384/1384 [==============================] - 22s 15ms/step - loss: 0.4852 - accuracy: 0.7619 - val_loss: 0.2764 - val_accuracy: 0.8865
Epoch 2/5
1384/1384 [==============================] - 20s 15ms/step - loss: 0.2277 - accuracy: 0.9191 - val_loss: 0.1649 - val_accuracy: 0.9344
Epoch 3/5
1384/1384 [==============================] - 22s 16ms/step - loss: 0.1086 - accuracy: 0.9652 - val_loss: 0.1257 - val_accuracy: 0.9642
Epoch 4/5
1384/1384 [==============================] - 22s 16ms/step - loss: 0.0454 - accuracy: 0.9860 - val_loss: 0.1103 - val_accuracy: 0.9790
Epoch 5/5
1384/1384 [==============================] - 21s 15ms/step - loss: 0.0258 - accuracy: 0.9929 - val_loss: 0.1110 - val_accuracy: 0.9814

Here , we are setting the number of cycles for the model and then using model.fit() to train the model using the provided data and evaluate its performance on validation data.

Python3




# Create a figure with subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
  
# Plot validation loss
ax1.plot(history.history['loss'], label='Training Loss')
ax1.plot(history.history['val_loss'], label='Validation Loss',color='orange')
ax1.set_title('Validation Loss')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.legend()
  
# Plot validation accuracy
ax2.plot(history.history['accuracy'], label='Training Accuracy')
ax2.plot(history.history['val_accuracy'], label='Validation Accuracy', color='orange')
ax2.set_title('Validation Accuracy')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.legend()
  
# Adjust layout
plt.tight_layout()
  
# Show the plots
plt.show()


Output:

Epoch Vs Losses & Accuracy-Geeksforgeeks

Python3




#evaluating loss and accuray of the model
loss, accuracy = model.evaluate(test_padded,testing_labels_final)
print(f'Accurcy on test dataset :{round(accuracy*100,2)}%')


Output:

173/173 [==============================] - 0s 880us/step - loss: 0.0840 - accuracy: 0.9839
Accurcy on test dataset :98.39%

This code will evaluate the model using the test dataset. It calculates loss and accuracy of the model and then prints the accuracy percentage of the model.

Python3




#predicting the model 
  
pred_prob = model.predict(test_padded)
pred_label = [1 if prob >= 0.5 else 0 for prob in pred_prob]
pred_label[:5]


Output:

173/173 [==============================] - 0s 743us/step
[1, 1, 1, 1, 0]

Here we will be predicting the model using the test data. It calculates the predicted probability values and convert them to predicted labels based on threshold of 0.5 .

Confusion Matrix:

Python3




# Compute confusion matrix using the predicted classes and true labels
conf_matrix = confusion_matrix(testing_labels_final, pred_label)
  
# Plot the confusion matrix using a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Not Sarcastic', 'Sarcastic'],
            yticklabels=['Not Sarcastic', 'Sarcastic'])
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()


Output:

matrix

Classification Report

Python3




# Print Classification Report
print("\nClassification Report:")
print(classification_report(testing_labels_final, pred_label,
                            target_names=['Not Sarcastic', 'Sarcastic']))


Output:


Classification Report:
precision recall f1-score support

Not Sarcastic 0.97 0.99 0.98 2916
Sarcastic 0.99 0.97 0.98 2617

accuracy 0.98 5533
macro avg 0.98 0.98 0.98 5533
weighted avg 0.98 0.98 0.98 5533

Predicting if headline is sarcastic or not

Let’s see how the model will predict if the headline is sarcastic or non-sarcastic.

Python3




while True:
    # Prompt the user to input a headline or exit
    user_input = input(
        "Enter a headline for prediction (or type 'exit' to quit): ")
  
    # Check if the user wants to exit
    if user_input.lower() == 'exit':
        break
  
    # Clean and preprocess user input
    cleaned_input = clean_text(user_input)
    tokenized_input = tokenizer.texts_to_sequences(
        [cleaned_input])  # Tokenize the cleaned text
    padded_input = pad_sequences(
        tokenized_input, maxlen=max_length, padding=padding_type)  # Pad the tokenized text
  
    # Predict sarcasm
    prediction = model.predict(padded_input)
  
    # Print the prediction result
    if prediction >= 0.5:
        print(f"Headline: {user_input}\nPrediction: Sarcastic")
    else:
        print(f"Headline: {user_input}\nPrediction: Not Sarcastic")


Output:

Enter a headline for prediction (or type 'exit' to quit): Hello, what's happenning now?
1/1 [==============================] - 0s 12ms/step
Headline: Hello, what's happenning now?
Prediction: Not Sarcastic
Enter a headline for prediction (or type 'exit' to quit): exit

This code will create an interactive loop that repeatedly prompts the user to input a headline for sarcasm prediction. If user enter “exit”, the loop will terminate.

Conclusion:

Finally, the neural network journey through sarcasm detection highlights the astonishing developments in natural language processing. We have tapped into the potential of machine learning to understand the complexities of sarcasm in textual data by utilizing sophisticated approaches including text preprocessing, tokenization, and model training. The trained model, strengthened with layers of interconnected neurons, has shown a high degree of accuracy in understanding and differentiating sarcastic tones. The interactive prediction function further demonstrates the model’s practical relevance by enabling users to determine whether their text contains sarcastic undertones.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads