How to use BERT?

For the implementation of BERT for any task on our dataset, pre-trained weights are available and we can easily use those pre-trained weights to fine-tune the model on our own dataset. The pre-trained weights for BERT are available in the transformers library and we can use that by the following code.

from transformers import BertModel
bert = BertModel.from_pretrained(‘bert-base-uncased’)
Here, “bert” contains the pre-trained model weights for BERTBase. We also need to use the same tokenizer and tokens index mapping using which model has been pre-trained. We can get the tokenizer using the code given below.

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’)
tokens = tokenizer.tokenize(“What’s going on?”)
Output: [‘what’, “‘”, ‘s’, ‘going’, ‘on’, ‘?’]

Let’s try to fine-tune the pre-trained bert model for the sentiment classification task. The model can be designed just by adding a linear layer at the output hidden state of the [CLS] token.

import torch.nn as nn

class BERTSentiment(nn.Module):
def init(self,

    self.bert = bert
    embedding_dim = bert.config.to_dict()['hidden_size']
    self.out = nn.Linear(embedding_dim, output_dim)

def forward(self, text):
    #text = [batch size, sent len]
    embedded = self.bert(text)[1]
    #embedded = [batch size, emb dim]
    output = self.out(embedded)
    #output = [batch size, out dim]

return output

model = BERTSentiment(bert,
We can then easily train the model using the above model by defining the loss function and optimizer.

optimizer = AdamW(model.parameters(),lr=2e-5,eps=1e-6,correct_bias=False)
criterion = nn.CrossEntropyLoss().to(device)
max_grad_norm = 1

def train(model, iterator, optimizer, criterion, scheduler):
epoch_loss = 0
epoch_acc = 0

for batch in iterator:
    optimizer.zero_grad() # clear gradients first
    torch.cuda.empty_cache() # releases all unoccupied cached memory 
    text = batch.text
    label = batch.label
    predictions = model(text)
    loss = criterion(predictions, label)
    acc = categorical_accuracy(predictions, label)
    #torch.nn.utils.clip_grad_norm_(optimizer, max_grad_norm)
    epoch_loss += loss.item()
    epoch_acc += acc.item()
return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate(model, iterator, criterion):
epoch_loss = 0
epoch_acc = 0


with torch.no_grad():
    for batch in iterator:
        text = batch.text
        predictions = model(text)
        loss = criterion(predictions, labels)
        acc = categorical_accuracy(predictions, labels)
        epoch_loss += loss.item()
        epoch_acc += acc.item()

return epoch_loss / len(iterator), epoch_acc / len(iterator)

We can then use train() and evaluate() function to train the model and to test.

import math
train_data_len = 25000
warmup_percent = 0.2
total_steps = math.ceil(N_EPOCHStrain_data_len1./BATCH_SIZE)
warmup_steps = int(total_steps*warmup_percent)
scheduler = get_scheduler(optimizer, warmup_steps)

for epoch in range(N_EPOCHS):
start_time = time.time()
train_loss, train_acc = train(model, train_iterator, optimizer, criterion, scheduler)
valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
end_time = time.time()
epoch_mins, epoch_secs = epoch_time(start_time, end_time)
print(f’Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s’)
print(f’tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc100:.2f}%’)
print(f’t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc