**Prerequisite:** BERT Model

**SpanBERT vs BERT**

SpanBERT is an improvement on the BERT model providing improved prediction of spans of text. Unlike BERT, we perform the following steps here i) mask random contiguous spans, rather than random individual tokens.

ii) training the model based on tokens at the start and end of the boundary of span (known as *Span Boundary Objective*) to predict the entire marked spans.

It differs from BERT model in its masking scheme as *BERT* used to randomly mask tokens in a sequence but here in *SpanBERT* we mask random contiguous spans of text.

Another difference is that of the training objective. *BERT* was trained on two objectives (2 loss functions) :

**Masked Language Modeling (MLM)****—**Predicting the mask token on the outputPredicting if 2 sequences of texts followed each other.**Next Sequence Prediction (NSP) —**

But in *SpanBERT*, the only thing the model is trained on is the Span Boundary Objective which later contributes to the loss function.

**SpanBERT: Implementation**

To implement SpanBERT, we build a replica of the BERT model but do certain changes in it so that it can perform better than the original BERT model. It is observed that the BERT model performs much better when only trained on ‘Masked Language Modelling’ alone rather than with ‘Next Sequence Prediction’. Hence, we disregard NSP and tuned the model on Single Sequence baseline while building the replica of BERT model, thereby improving its prediction accuracy.

**SpanBERT: Intuition**

Fig 1 shows the training of the SpanBERT model. In the given sentence, the span of words ‘*a football championship tournament*‘ is masked. The Span Boundary Objective is defined by the *x _{4}* and

*x*highlighted in blue. This is used to predict each token in the masked span.

_{9}Here in Fig 1, a sequence of words ‘*a football championship tournament*‘ is created and the whole sequence is passed through the encoder block and get the prediction of the masked tokens as output. *(x _{5} to x_{8}*)

For example, if we were to predict for the token x_{6} (i.e. football), below is the equivalent loss (as shown in *Eqn(1)*) that we would get.

This loss is the summation of losses given by MLM and SBO losses.

Now, the MLM loss is the same as ‘-ve log of likelihood’ or in simpler terms what is the chances of x_{6} being football.

Then, the SBO loss is depends on three parameters.x- the start of the span boundary_{4}x- the end of the span boundary_{6}P- the position of x_{2}_{6}(football) from the starting point (x_{4}) So given these three parameters, we see how good the model is at predicting the token 'football'.

Using the above two loss functions, the BERT model is fine-tuned and is called SpanBERT.

**Span Boundary Objective: **

Here, we get the output as a vector encoding the tokens in the sequence represented as (*x _{1}, ….., x_{n}*). The masked span of tokens is represented by (

*x*), where

_{s}, …., x_{e}*x*denotes start and

_{s}*x*denotes the end of the masked span of tokens. SBO function is represented as:

_{e}where P_{1}, P_{2}, ... are relative positions w.r.t the left boundary token x_{s-1.}

The SBO function ‘*f’ *is a 2 layer feed-forward network with **GeLU** activation. This 2 layer network is represented as:

where,h= first hidden representation_{0}x= starting boundary word_{s-1}x= ending boundary word_{e+1}P= positional embedding of the word_{i-s+1}

We pass *h _{0}*

_{ }to first hidden layer

_{ }with weight W

_{1}.

where,GeLU (Gaussian Error Linear Units)= non-linear activation functionh= second hidden representation_{1}W= weight of first hidden layer_{1}LayerNorm= a normalization technique used to prevent interactions within the batches

Now, we pass this through another with weight W_{2} layer to get the output y_{i}.

where,y= vector representation for all the toxens x_{i}_{i}W= wight of second hidden layer_{2}

To generalize, SpanBERT equivalent loss of a particular token in a span of words is calculated by:

where,X= final representation of tokens_{i}x_{i}_{ }= original sequence of tokensy= output obtained by passing x_{i}_{i}through 2-layer feed forward network.

This was a basic intuition and understanding of the SpanBERT model and how it predicts a span of words instead of the individual token, making it more powerful than the BERT Model. For any doubts/queries, comment below.

## Recommended Posts:

- Intuition of Adam Optimizer
- Intuition behind Adagrad Optimizer
- Facebook Transcoder
- ML(Machine Learning) vs ML(Meta Language)
- F-Test
- FOCL Algorithm
- Z-test
- Hyperparameter tuning using GridSearchCV and KerasClassifier
- Image GPT
- Hebbian Learning Rule with Implementation of AND Gate
- ALBERT - A Light BERT for Supervised Learning
- How L1 Regularization brings Sparsity`
- T-test
- Mann and Whitney U test

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.