Sentiment Classification Using BERT

BERT stands for Bidirectional Representation for Transformers, was proposed by researchers at Google AI language in 2018. Although the main aim of that was to improve the understanding of the meaning of queries related to Google Search, BERT becomes one of the most important and complete architecture for various natural language tasks having generated state-of-the-art results on Sentence pair classification task, question-answer task, etc. For more details on the architecture please look at this article

Architecture:

One of the most important features of BERT is that its adaptability to perform different NLP tasks with state-of-the-art accuracy (similar to the transfer learning we used in Computer vision). For that, the paper also proposed the architecture of different tasks. In this post, we will be using BERT architecture for single sentence classification tasks specifically the architecture used for CoLA (Corpus of Linguistic Acceptability) binary classification task. In the previous post about BERT, we discussed BERT architecture in detail, but let’s recap some of the important details of it:

BERT single sentence classification task

BERT has proposed in the two versions:

  • BERT (BASE): 12 layers of encoder stack with 12 bidirectional self-attention heads and 768 hidden units.
  • BERT (LARGE): 24 layers of encoder stack with 24 bidirectional self-attention heads and 1024 hidden units.

For TensorFlow implementation, Google has provided two versions of both the BERT BASE and BERT LARGE: Uncased and Cased. In an uncased version, letters are lowercased before WordPiece tokenization.



Implementation:

  • First, we need to clone the GitHub repo to BERT to make the setup easier.

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

! git clone https://github.com / google-research / bert.git

chevron_right


Cloning into 'bert'...
remote: Enumerating objects: 340, done.
remote: Total 340 (delta 0), reused 0 (delta 0), pack-reused 340
Receiving objects: 100% (340/340), 317.20 KiB | 584.00 KiB/s, done.
Resolving deltas: 100% (185/185), done.
  • Now, we need to download the BERTBASE model using the following link and unzip it into the working directory ( or the desired location).

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

# Download BERT BASE model from tF hub ! wget https://storage.googleapis.com / bert_models / 2018_10_18 / uncased_L-12_H-768_A-12.zip ! unzip uncased_L-12_H-768_A-12.zip

chevron_right


Archive:  uncased_L-12_H-768_A-12.zip
   creating: uncased_L-12_H-768_A-12/
  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.meta  
  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001  
  inflating: uncased_L-12_H-768_A-12/vocab.txt  
  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.index  
  inflating: uncased_L-12_H-768_A-12/bert_config.json  
  • We will be using the TensorFlow 1x version. In Google colab there is a magic function called tensorflow_version that can switch different versions.

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

% tensorflow_version 1.x

chevron_right


TensorFlow 1.x selected.
  • Now, we will import modules necessary for running this project, we will be using NumPy, scikit-learn and Keras from TensorFlow inbuilt modules. These are already preinstalled in colab, make sure to install these in your environment.

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

import os
import re
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow import keras
import csv
from sklearn import metrics

chevron_right


  • Now we will load IMDB sentiments datasets and do some preprocessing before training. For loading the IMDB dataset from TensorFlow Hub, we will follow this tutorial. 

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

# load data from positive and negative directories and a columns that takes there\
# positive and negative label
def load_directory_data(directory):
  data = {}
  data["sentence"] = []
  data["sentiment"] = []
  for file_path in os.listdir(directory):
    with tf.gfile.GFile(os.path.join(directory, file_path), "r") as f:
      data["sentence"].append(f.read())
      data["sentiment"].append(re.match("\d+_(\d+)\.txt", file_path).group(1))
  return pd.DataFrame.from_dict(data)
  
# Merge positive and negative examples, add a polarity column and shuffle.
def load_dataset(directory):
  pos_df = load_directory_data(os.path.join(directory, "pos"))
  neg_df = load_directory_data(os.path.join(directory, "neg"))
  pos_df["polarity"] = 1
  neg_df["polarity"] = 0
  return pd.concat([pos_df, neg_df]).sample(frac = 1).reset_index(drop = True)
  
# Download and process the dataset files.
def download_and_load_datasets(force_download = False):
  dataset = tf.keras.utils.get_file(
      fname ="aclImdb.tar.gz"
      origin ="http://ai.stanford.edu/~amaas / data / sentiment / aclImdb_v1.tar.gz"
      extract = True)
    
  train_df = load_dataset(os.path.join(os.path.dirname(dataset), 
                                       "aclImdb", "train"))
  test_df = load_dataset(os.path.join(os.path.dirname(dataset), 
                                      "aclImdb", "test"))
    
  return train_df, test_df
train, test = download_and_load_datasets()
train.shape, test.shape

chevron_right


Downloading data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
84131840/84125825 [==============================] - 8s 0us/step
((25000, 3), (25000, 3))
  • This dataset contains 50k reviews 25k for each training and test, we will sample 5k reviews from each test and train. Also, both test and train dataset contains 3 columns whose list is given below

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

# sample 5k datapoints for both train and test
train = train.sample(5000)
test = test.sample(5000)
# List columns of train and test data
train.columns, test.columns

chevron_right


(Index(['sentence', 'sentiment', 'polarity'], dtype='object'),
 Index(['sentence', 'sentiment', 'polarity'], dtype='object'))
  • Now, we need to convert the specific format that is required by the BERT model to train and predict, for that we will use pandas dataframe. Below are the columns required in BERT training and test format:
    • GUID: An id for the row. Required for both train and test data
    • Class label.: A value of 0 or 1 depending on positive and negative sentiment. 
    • alpha: This is a dummy column for text classification but is expected by BERT during training.
    • text:  The review text of the data point which needed to be classified. Obviously required for both training and test

Code:



filter_none

edit
close

play_arrow

link
brightness_4
code

# code
# Convert training data into BERT format
train_bert = pd.DataFrame({
  'guid': range(len(train)),
 'label':train['polarity'],
 'alpha': ['a']*train.shape[0],
 'text': train['sentence'].replace(r'\n', '', regex = True)
})
  
train_bert.head()
print("-----")
# convert test data into bert format
bert_test = pd.DataFrame({
 'id':range(len(test)),
 'text': test['sentence'].replace(r'\n', ' ', regex = True)
})
bert_test.head()

chevron_right


guid    label    alpha    text
14930    0    1    a    William Hurt may not be an American matinee id...
1445    1    1    a    Rock solid giallo from a master filmmaker of t...
16943    2    1    a    This movie surprised me. Some things were "cli...
6391    3    1    a    This film may seem dated today, but remember t...
4526    4    0    a    The Twilight Zone has achieved a certain mytho...
-----
guid    text
20010    0    One of Alfred Hitchcock's three greatest films...
16132    1    Hitchcock once gave an interview where he said...
24947    2    I had nothing to do before going out one night...
5471    3    tell you what that was excellent. Dylan Moran ...
21075    4    I watched this show until my puberty but still...
  • Now, we split the data into three parts: train, dev, and test and save it into tsv file save it into a folder (here “IMDB Dataset”). This is because  run classifier file requires dataset in tsv format.

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

# split  data into  train and validation set
bert_train, bert_val = train_test_split(train_bert, test_size = 0.1)
# save train, validation and testfile to afolder
bert_train.to_csv('bert / IMDB_dataset / train.tsv', sep ='\t', index = False, header = False)
bert_val.to_csv('bert / IMDB_dataset / dev.tsv', sep ='\t', index = False, header = False)
bert_test.to_csv('bert / IMDB_dataset / test.tsv', sep ='\t', index = False, header = True)

chevron_right


  • In this step, we train the model using the following command, for executing bash commands on colab, we use ! sign in front of the command. The run_classifier file trains the model with the help of given command. Due to time and resource constraints, we will run it only on 3  epochs.

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

# Most of the arguments  hereare self-explanatory but some  aguments needs  to be explained:
# task name:We have discussed this above .Here we need toperform binary  classification that why we use cola
# vocab file :  A vocab file (vocab.txt) to map WordPiece to word id.
# init checkpoint:  A tensorflow checkpoint required. Here we used downlaoded bert.
# max_seq_length :caps the maximunumber of words  to each reviews
# bert_config_file: file contains hyperparameter settings ! python bert / run_classifier.py 
--task_name = cola --do_train = true --do_eval = true 
--data_dir =/content / bert / IMDB_dataset 
--vocab_file =/content / uncased_L-12_H-768_A-12 / vocab.txt
--bert_config_file =/content / uncased_L-12_H-768_A-12 / bert_config.json 
--init_checkpoint =/content / uncased_L-12_H-768_A-12 / bert_model.ckpt 
--max_seq_length = 64 
--train_batch_size = 8 --learning_rate = 2e-5 
--num_train_epochs = 3.0 
--output_dir =/content / bert_output/ 
--do_lower_case = True
--save_checkpoints_steps 10000

chevron_right


# Last few lines
INFO:tensorflow:***** Eval results *****
I0713 06:06:28.966619 139722620139392 run_classifier.py:923] ***** Eval results *****
INFO:tensorflow:  eval_accuracy = 0.796
I0713 06:06:28.966814 139722620139392 run_classifier.py:925]   eval_accuracy = 0.796
INFO:tensorflow:  eval_loss = 0.95403963
I0713 06:06:28.967138 139722620139392 run_classifier.py:925]   eval_loss = 0.95403963
INFO:tensorflow:  global_step = 1687
I0713 06:06:28.967317 139722620139392 run_classifier.py:925]   global_step = 1687
INFO:tensorflow:  loss = 0.95741796
I0713 06:06:28.967507 139722620139392 run_classifier.py:925]   loss = 0.95741796
  • Now we will use test data to evaluate our model with the following bash script. This script saves the predictions into a tsv file.

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

# code to predict bert on test.tsv
# here we use  saved training checkpoint as  initial model ! python bert / run_classifier.py
--task_name = cola 
--do_predict = true 
--data_dir =/content / bert / IMDB_dataset 
--vocab_file =/content / uncased_L-12_H-768_A-12 / vocab.txt 
--bert_config_file =/content / uncased_L-12_H-768_A-12 / bert_config.json 
--init_checkpoint =/content / bert_output / model.ckpt-0 
--max_seq_length = 128 
--output_dir =/content / bert_output/

chevron_right


INFO:tensorflow:Restoring parameters from /content/bert_output/model.ckpt-1687
I0713 06:08:22.372014 140390020667264 saver.py:1284] Restoring parameters from /content/bert_output/model.ckpt-1687
INFO:tensorflow:Running local_init_op.
I0713 06:08:23.801442 140390020667264 session_manager.py:500] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0713 06:08:23.859703 140390020667264 session_manager.py:502] Done running local_init_op.
2020-07-13 06:08:24.453814: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
INFO:tensorflow:prediction_loop marked as finished
I0713 06:10:02.280455 140390020667264 error_handling.py:101] prediction_loop marked as finished
INFO:tensorflow:prediction_loop marked as finished
I0713 06:10:02.280870 140390020667264 error_handling.py:101] prediction_loop marked as finished
  • The code below takes maximum prediction for each row of test data and store it into a list.

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

# code
import  csv
label_results =[]
with open('/content / bert_output / test_results.tsv') as file:
    rows = csv.reader(file, delimiter ="\t")
    for row in rows:
      data_1 =[float(i) for i in row]
      label_results.append(data_1.index(max(data_1)))

chevron_right


  • The code below calculates accuracy and F1-score.

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

print("Accuracy", metrics.accuracy_score(test['polarity'], label_results))
print("F1-Score", metrics.f1_score(test['polarity'], label_results))

chevron_right


Accuracy 0.8548
F1-Score 0.8496894409937888
  • We have achieved 85% accuracy and F1-score on the IMDB reviews dataset while training BERT (BASE)  just for 3 epochs which is quite a good result.  Training on more epochs will certainly improve the accuracy.

References:

Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.




My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :
Practice Tags :


Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.