BERT, as we previously stated — is a special MVP of NLP. And a massive part of this is underneath BERTs capability to **embed the essence of words inside densely bound vectors.**

We call them ** dense vectors** because

*each value inside the vector has a value and has a purpose for holding that value*— this is in contradiction to

**. Hence, as**

*sparse*vectors**one-hot encoded vectors**where the preponderance of proceedings is

**0**.

BERT is skilled at generating those dense vectors, and all encoder layer (there are numerous) outputs a collection of dense vectors.

For the **BERT** support, this will be a vector comprising **768 digits.** Those **768 values** have our mathematical representation of a particular **token** — which we can *practice as contextual message embeddings.*

Unit vector denoting each token ( *product by each encoder* ) is indeed watching tensor ( *768 by the number of tickets)* .

**We can use these tensors** and convert them to **generate semantic designs** of the **input sequence** . We can next take our *similarity metrics and measure the corresponding similarity linking separate lines.*

The *easiest* and most *regularly extracted tensor* is the ** last_hidden_state tensor** , conveniently yield by the

**BERT model**.

Of course, this is a moderately large tensor — at **512×768** — and we *need a vector to implement our similarity measures.*

To do this, we require to turn our ** last_hidden_states tensor** to a

**vector of 768 tensors.**