BERT-Fine-Tuning with PyTorch

6 minute read

BERT Fine-Tuning using PyTorch:

BERT was trained by masking 15% of the tokens with the goal to guess them. An additional objective was to predict the next sentence. Let’s look at examples of these tasks:
Masked Language Modeling (Masked LM) The objective of this task is to guess the masked tokens. Let’s look at an example, and try to not make it harder than it has to be:

That’s [mask] she [mask] -> That’s what she said
Next Sentence Prediction (NSP) Given a pair of two sentences, the task is to say whether or not the second follows the first (binary classification). Let’s continue with the example:

Input = [CLS] That’s [mask] she [mask]. [SEP] Hahaha, nice! [SEP]

Label = IsNext

Input = [CLS] That’s [mask] she [mask]. [SEP] Dwight, you ignorant [mask]! [SEP]

Label = NotNextFor setup and execution We’ll use the Transformers library by Hugging Face We need to download the dataset which will be a multiclass classification.

!pip install bert
!gdown --id 1S6qMioqPJjyBLpLVz4gmRTnJHnjitnuV
!gdown --id 1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv
!pip install -qq transformers

Collecting bert
  Downloading https://files.pythonhosted.org/packages/e8/e6/55ed98ef52b168a38192da1aff7265c640f214009790220664ee3b4cb52a/bert-2.2.0.tar.gz
Collecting erlastic
  Downloading https://files.pythonhosted.org/packages/f3/30/f40d99fe35c38c2e0415b1e746c89569f2483e64ef65d054b9f0f382f234/erlastic-2.0.0.tar.gz
Building wheels for collected packages: bert, erlastic
  Building wheel for bert (setup.py) ... [?25l[?25hdone
  Created wheel for bert: filename=bert-2.2.0-cp36-none-any.whl size=3756 sha256=a56274395506b492e10227c6cbc4cdc238a19238fc6e97daf789e106d0b1e6b0
  Stored in directory: /root/.cache/pip/wheels/fe/71/b7/941459453bd38e5d97a8c886361dee19325e9933c9cf88ad46
  Building wheel for erlastic (setup.py) ... [?25l[?25hdone
  Created wheel for erlastic: filename=erlastic-2.0.0-cp36-none-any.whl size=6786 sha256=4470ebfece84675e16f856dc773be9885ae85216732e1086cdc0e64a36134a75
  Stored in directory: /root/.cache/pip/wheels/02/62/46/93c713a5f061aeeb4f16eb6bf5ee798816e6ddda70faa78e69
Successfully built bert erlastic
Installing collected packages: erlastic, bert
Successfully installed bert-2.2.0 erlastic-2.0.0
Downloading...
From: https://drive.google.com/uc?id=1S6qMioqPJjyBLpLVz4gmRTnJHnjitnuV
To: /content/apps.csv
100% 134k/134k [00:00<00:00, 40.3MB/s]
Downloading...
From: https://drive.google.com/uc?id=1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv
To: /content/reviews.csv
7.17MB [00:00, 22.8MB/s]
[K     |████████████████████████████████| 778kB 4.7MB/s 
[K     |████████████████████████████████| 3.0MB 20.4MB/s 
[K     |████████████████████████████████| 1.1MB 58.9MB/s 
[K     |████████████████████████████████| 890kB 59.8MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone

Import all the necessary libraries:

import pandas as pd
import numpy as np
import transformers
from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup
import torch
import matplotlib.pyplot as plt
import seaborn as sns
import pandas.util.testing as tm
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split 
import torch.nn.functional as F
from sklearn.metrics import confusion_matrix, classification_report
from collections import defaultdict
from textwrap import wrap

data=pd.read_csv('reviews.csv')

Select the device to be a GPU as we will be using torch objects. Then convert the sentiments to integers for polarity.

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

def convert_sentiment(rating):
  rating=int(rating)
  if rating<=2:
    return 0
  elif rating==3:
    return 1
  else:
    return 2

data['sentiment']=data['score'].apply(lambda x:convert_sentiment(x))

class_names = ['negative', 'neutral', 'positive']

"""### Preprocessing starts using the BertBodel"""
PRE_TRAINED_MODEL_NAME='bert-base-cased'
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)

token_lens = []
for txt in data.content:
  tokens = tokenizer.encode(txt, max_length=512,truncation=True)
  token_lens.append(len(tokens))

sns.distplot(token_lens)
plt.xlim([0, 256]);
plt.xlabel('Token count');

png

MAX_LEN=160

Define the dataloader to load the dataset as per the expectation of the BERT model.

# Build the pytorch dataset:
class GPReviewDataset(Dataset):
  def __init__(self, reviews, targets, tokenizer, max_len):
    self.reviews = reviews
    self.targets = targets
    self.tokenizer = tokenizer
    self.max_len = max_len
  def __len__(self):
    return len(self.reviews)
  def __getitem__(self, item):
    review = str(self.reviews[item])
    target = self.targets[item]
    encoding = self.tokenizer.encode_plus(
      review,
      truncation=True,
      add_special_tokens=True,
      max_length=self.max_len,
      return_token_type_ids=False,
      pad_to_max_length=True,
      return_attention_mask=True,
      return_tensors='pt',
    )
    return {
      'review_text': review,
      'input_ids': encoding['input_ids'].flatten(),
      'attention_mask': encoding['attention_mask'].flatten(),
      'targets': torch.tensor(target, dtype=torch.long)
    }

Split the data to Train,Test and Dev:

data_train, data_test = train_test_split(data,test_size=0.1,random_state=RANDOM_SEED)
data_val, data_test = train_test_split(data_test,test_size=0.5,random_state=RANDOM_SEED)

data_train.shape, data_val.shape, data_test.shape

((14171, 12), (787, 12), (788, 12))

# Need to create dataloaders:
def create_data_loader(df, tokenizer, max_len, batch_size):
  ds = GPReviewDataset(
    reviews=df.content.to_numpy(),
    targets=df.sentiment.to_numpy(),
    tokenizer=tokenizer,
    max_len=max_len
  )
  return DataLoader(ds,batch_size=batch_size,num_workers=4)
BATCH_SIZE = 16

train_data_loader = create_data_loader(data_train, tokenizer, MAX_LEN, BATCH_SIZE)
val_data_loader = create_data_loader(data_val, tokenizer, MAX_LEN, BATCH_SIZE)
test_data_loader = create_data_loader(data_test, tokenizer, MAX_LEN, BATCH_SIZE)

data = next(iter(train_data_loader))
data.keys()

dict_keys(['review_text', 'input_ids', 'attention_mask', 'targets'])

print(data['input_ids'].shape)
print(data['attention_mask'].shape)
print(data['targets'].shape)

torch.Size([16, 160])
torch.Size([16, 160])
torch.Size([16])

bert_model = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)

#This is the raw Bert model :
last_hidden_state, pooled_output = bert_model(input_ids=data['input_ids'],attention_mask=data['attention_mask'])

last_hidden_state.shape

torch.Size([16, 160, 768])

Fine tune the BERT model as we did below by adding a Dropout and a Linear layer for the classificaation of sentiment.

# Create a classifier using the Bert Model:
class SentimentClassifier(nn.Module):
  def __init__(self, n_classes):
    super(SentimentClassifier, self).__init__()
    self.bert = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
    self.drop = nn.Dropout(p=0.3)
    self.out = nn.Linear(self.bert.config.hidden_size, n_classes)
  def forward(self, input_ids, attention_mask):
    _, pooled_output = self.bert(
      input_ids=input_ids,
      attention_mask=attention_mask
    )
    output = self.drop(pooled_output)
    return self.out(output)

Carefully move all the input parameters and model to the GPU device:

# Now move it to GPU:
model = SentimentClassifier(len(class_names))
model = model.to(device)

input_ids = data['input_ids'].to(device)
attention_mask = data['attention_mask'].to(device)
print(input_ids.shape) # batch size x seq length
print(attention_mask.shape) # batch size x seq length

torch.Size([16, 160])
torch.Size([16, 160])

# To get the predicted probabilities from our trained model, we’ll apply the softmax function to the outputs:
F.softmax(model(input_ids, attention_mask), dim=1)

tensor([[0.4167, 0.1629, 0.4204],
        [0.5268, 0.1343, 0.3389],
        [0.3120, 0.1510, 0.5370],
        [0.5324, 0.1180, 0.3496],
        [0.6053, 0.1680, 0.2267],
        [0.3334, 0.1828, 0.4838],
        [0.4221, 0.1138, 0.4641],
        [0.3167, 0.2619, 0.4215],
        [0.3325, 0.1465, 0.5211],
        [0.2637, 0.1996, 0.5367],
        [0.4092, 0.1467, 0.4441],
        [0.3765, 0.2127, 0.4107],
        [0.5731, 0.0732, 0.3537],
        [0.4752, 0.1724, 0.3525],
        [0.3837, 0.1831, 0.4332],
        [0.3463, 0.2110, 0.4427]], device='cuda:0', grad_fn=<SoftmaxBackward>)

Training the Model:

Predefined hyperparams from the authors:

Batch size: 16, 32 | Learning rate (Adam): 5e-5, 3e-5, 2e-5 | Number of epochs: 2, 3, 4

EPOCHS = 10
optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
total_steps = len(train_data_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(optimizer,num_warmup_steps=0,num_training_steps=total_steps)
loss_fn = nn.CrossEntropyLoss().to(device)

Train the model as the features will be the input_ids and attention_mask , label will be the sentiments.

def train_epoch(model,data_loader,loss_fn,optimizer,device,scheduler,n_example):
  model = model.train()
  losses = []
  correct_predictions = 0
  for d in data_loader:
    input_ids = d["input_ids"].to(device)
    attention_mask = d["attention_mask"].to(device)
    targets = d["targets"].to(device)
    outputs = model(
      input_ids=input_ids,
      attention_mask=attention_mask
    )
    _, preds = torch.max(outputs, dim=1)
    loss = loss_fn(outputs, targets)
    correct_predictions += torch.sum(preds == targets)
    losses.append(loss.item())
    loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)# We are avoiding gradient exploding: Through clipping
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()
  return correct_predictions.double() / n_example, np.mean(losses)

# Eval the Model:
def eval_model(model, data_loader, loss_fn, device, n_example):
  model = model.eval()
  losses = []
  correct_predictions = 0
  with torch.no_grad():
    for d in data_loader:
      input_ids = d["input_ids"].to(device)
      attention_mask = d["attention_mask"].to(device)
      targets = d["targets"].to(device)
      outputs = model(
        input_ids=input_ids,
        attention_mask=attention_mask
      )
      _, preds = torch.max(outputs, dim=1)
      loss = loss_fn(outputs, targets)
      correct_predictions += torch.sum(preds == targets)
      losses.append(loss.item())
  return correct_predictions.double() / n_example, np.mean(losses)

%%time
history = defaultdict(list)
best_accuracy = 0
for epoch in range(EPOCHS):
  print(f'Epoch {epoch + 1}/{EPOCHS}')
  print('-' * 30)
  train_acc, train_loss = train_epoch(model,train_data_loader,loss_fn,optimizer,device,scheduler,len(data_train))
  print(f'Train loss {train_loss} accuracy {train_acc}')
  val_acc, val_loss = eval_model(model,val_data_loader,loss_fn,device,len(data_val))
  print(f'Val   loss {val_loss} accuracy {val_acc}')
  print()
  history['train_acc'].append(train_acc)
  history['train_loss'].append(train_loss)
  history['val_acc'].append(val_acc)
  history['val_loss'].append(val_loss)
  if val_acc > best_accuracy:
    torch.save(model.state_dict(), 'best_model_state.bin')
    best_accuracy = val_acc

Epoch 1/10
------------------------------
Train loss 0.3928224016485513 accuracy 0.8523745677792676
Val   loss 0.4960121926665306 accuracy 0.8335451080050826

Epoch 2/10
------------------------------
Train loss 0.22725200758191585 accuracy 0.9281631500952651
Val   loss 0.6242331060767173 accuracy 0.8678526048284626

Epoch 3/10
------------------------------
Train loss 0.16011956546447212 accuracy 0.9540611107190742
Val   loss 0.7498525661230088 accuracy 0.8691232528589581

Epoch 4/10
------------------------------
Train loss 0.10572175838843695 accuracy 0.9696563404135206
Val   loss 0.7286872366070747 accuracy 0.878017789072427

Epoch 5/10
------------------------------
Train loss 0.0774897876580332 accuracy 0.9776303718862466
Val   loss 0.7949933139979839 accuracy 0.8805590851334181

Epoch 6/10
------------------------------
Train loss 0.06398470333144574 accuracy 0.9810175710959002
Val   loss 0.8431818175315857 accuracy 0.8818297331639137

Epoch 7/10
------------------------------
Train loss 0.04880920779927051 accuracy 0.9849693035071626
Val   loss 0.8389958699047565 accuracy 0.8856416772554003

Epoch 8/10
------------------------------
Train loss 0.041375226511474926 accuracy 0.9868746030625927
Val   loss 0.884228421896696 accuracy 0.8805590851334181

Epoch 9/10
------------------------------
Train loss 0.03628475189158529 accuracy 0.9872274363135982
Val   loss 0.8891580128669738 accuracy 0.8843710292249047

Epoch 10/10
------------------------------
Train loss 0.032766298771442735 accuracy 0.9885682026674194
Val   loss 0.8891580128669738 accuracy 0.8843710292249047

CPU times: user 28min 52s, sys: 13min 17s, total: 42min 10s
Wall time: 42min 30s

Now Model Evaluation

test_acc, _ = eval_model(model,test_data_loader,loss_fn,device,len(data_test))
test_acc.item()

0.8883248730964466

plt.plot(history['train_acc'], label='train accuracy')
plt.plot(history['val_acc'], label='validation accuracy')
plt.title('Training history')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
plt.ylim([0, 1]);

png

Predicting on Raw Text

review_text = "The food was not good at all not even the drinks were good"
encoded_review = tokenizer.encode_plus(
  review_text,
  max_length=MAX_LEN,
  truncation=True,
  add_special_tokens=True,
  return_token_type_ids=False,
  pad_to_max_length=True,
  return_attention_mask=True,
  return_tensors='pt',
)

input_ids = encoded_review['input_ids'].to(device)
attention_mask = encoded_review['attention_mask'].to(device)
output = model(input_ids, attention_mask)
_, prediction = torch.max(output, dim=1)
print(f'Review text: {review_text}')
print(f'Sentiment  : {class_names[prediction]}')

Review text: The food was not good at all not even the drinks were good
Sentiment  : negative

End Note:

You can use and fine-tune the bert model according to your requirements like Sequence to Sequence models. You can also use tensorflow keras to build the fine-tune layer.There are few predefined models present in the transformers libaries built by Hugging Face. You can go through them also.

Share on

Twitter Facebook Google+ LinkedIn

Babrit Behera

BERT-Fine-Tuning with PyTorch

BERT Fine-Tuning using PyTorch:

Training the Model:

Predefined hyperparams from the authors:

Now Model Evaluation

Predicting on Raw Text

End Note:

Share on

You May Also Enjoy

Implementing ML model in AutoML

Deploy application in GCP using Kubernetes

Simple Price Prediction Model

Understanding Data Scrapping