Catch up on the latest AI articles

Introducing SBERT-WK, Which Combines The Output Of BERT To Create A Sentence Vector

Introducing SBERT-WK, Which Combines The Output Of BERT To Create A Sentence Vector

Natural Language Processing

3 main points
✔️ Demonstrate that the embedded representation of BERT captures different information at each layer 
✔️ Propose a method for integrating information from each layer to construct a sentence vector

✔️ Improved accuracy in key tasks with the proposed method

SBERT-WK: A Sentence Embedding Method by Dissecting BERT-based Word Models
written by 
Bin WangC.-C. Jay Kuo
(Submitted on 16 Feb 2020 (v1), last revised 1 Jun 2020 (this version, v2))

Comments: Accepted at arXiv
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)


In recent years, natural language processing has seen great success with models that have been pre-trained on huge corpora. A prime example of this is the BERT (Bidirectional Encoder Representations from Transformers), BERT is a multi-layered model of Transformers, which obtains useful representations for downstream tasks by solving pre-training tasks with large amounts of unsupervised text. The two pre-training tasks proposed in the original paper are.

  • Selects 15% of the words in the text and uses them as the correct answers to predict the words (MLM: Masked LM). Each selected token is processed as follows
    • Replace 80% with a special token called [MASK]
    • Replace 10% with a random token
    • 10% stay the same.
  • The special token [CLS], which is given at the beginning of a sentence during BERT tokenization, is used to predict whether or not the second sentence is a continuation of the first when two sentences are entered (NSP: Next Sentence Prediction).

The vector representation of words obtained by BERT is called a contextualized word embedding representation, and it provides a polysemous vector representation that existing word embedding representations such as Word2VecGloVe, and fastText fail to capture.

For example, the word "bank" with multiple meanings, such as "safe" or "bank", could only be represented by one vector in Word2Vec, but BERT can give you different vectors depending on the context in which the word appears.

In the original paper, we used BooksCorpus (800 million words) and English Wikipedia (2.5 billion words) to train the above and achieved a significant improvement in accuracy over existing research on a number of tasks.

For a more detailed explanation, please see a separate article on BERT on this site.

We also reviewed the above BERT training methods and incorporated various techniques to improve robustness in RoBERTa, as proposed in the paper RoBERTa: A Robustly Optimized BERT Pretraining Approach. It used larger data, batch sizes, and improved performance on various tasks by removing next sentence prediction from the pre-training task.

There is also a model, XLNet, proposed in the paper XLNet: Generalized Autoregressive Pretraining for Language Understanding, which introduces a new learning method to overcome the problems of BERT. A detailed explanation of this model can be found in the article on this site.

By the way, in natural language processing, we sometimes want to create embedded representations of sentences. An embedded representation of a sentence can be used, for example, to recommend a similar article to an article, or to infer the emotional polarity of the author from a sentence. Existing methods used to create sentence vectors by simply adding the word vectors from Word2Vec or averaging them together, but as mentioned earlier, this method may not take into account the ambiguity of the words well. Also, the performance of BERT's [CLS] vectors and the average of the output vectors as sentence vectors was not good, and an effective method for generating sentence vectors was needed.

For this problem, SBERT (Sentence-BERT), proposed in the paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, provided a means of generating sentence vectors using BERT. allows the pre-trained BERT to create useful sentence vector representations by further training them.

In this article, we will introduce a model, SBERT-WK, which further improves on SBERT by making good use of the information captured by BERT. As a prelude, we will give an overview of SBERT and a commentary on the paper, and then we will give a commentary on the SBERT-WK paper. All figures and tables are taken from the paper.

To read more,

Please register with AI-SCHOLAR.

Sign up for free in 1 minute


If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us