Artificial Intelligence

Fine-Tuning BERT for Text Classification | by Shaw Talebi | Oct, 2024

October 17, 2024

We’ll start by importing a few handy libraries.

from datasets import DatasetDict, Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, 
TrainingArguments, Trainer
import evaluate
import numpy as np
from transformers import DataCollatorWithPadding

Next, we’ll load the training dataset. It consists of 3,000 text-label pairs with a 70–15–15 train-test-validation split. The data are originally from here (open database license).

dataset_dict = load_dataset("shawhin/phishing-site-classification")

The Transformer library makes it super easy to load and adapt pre-trained models. Here’s what that looks like for the BERT model.

# define pre-trained model path
model_path = "google-bert/bert-base-uncased"# load model tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
# load model with binary classification head
id2label = {0: "Safe", 1: "Not Safe"}
label2id = {"Safe": 0, "Not Safe": 1}
model = AutoModelForSequenceClassification.from_pretrained(model_path, 
num_labels=2, 
id2label=id2label, 
label2id=label2id,)

When we load a model like this, all the parameters will be set as trainable by default. However, training all 110M parameters will be computationally costly and potentially unnecessary.

Instead, we can freeze most of the model parameters and only train the model’s final layer and classification head.

# freeze all base model parameters
for name, param in model.base_model.named_parameters():
param.requires_grad = False# unfreeze base model pooling layers
for name, param in model.base_model.named_parameters():
if "pooler" in name:
param.requires_grad = True

Next, we will need to preprocess our data. This will consist of two key operations: tokenizing the URLs (i.e., converting them into integers) and truncating them.

# define text preprocessing
def preprocess_function(examples):
# return tokenized text with truncation
return tokenizer(examples["text"], truncation=True)# preprocess all datasets
tokenized_data = dataset_dict.map(preprocess_function, batched=True)

Another important step is creating a data collator that will dynamically pad token sequences in a batch during training so they have the same length. We can do this in one line of code.

# create data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

As a final step before training, we can define a function to compute a set of metrics to help us monitor training progress. Here, we will consider model accuracy and AUC.

# load metrics
accuracy = evaluate.load("accuracy")
auc_score = evaluate.load("roc_auc")def compute_metrics(eval_pred):
# get predictions
predictions, labels = eval_pred
# apply softmax to get probabilities
probabilities = np.exp(predictions) / np.exp(predictions).sum(-1, 
keepdims=True)
# use probabilities of the positive class for ROC AUC
positive_class_probs = probabilities[:, 1]
# compute auc
auc = np.round(auc_score.compute(prediction_scores=positive_class_probs, 
references=labels)['roc_auc'],3)
# predict most probable class
predicted_classes = np.argmax(predictions, axis=1)
# compute accuracy
acc = np.round(accuracy.compute(predictions=predicted_classes, 
references=labels)['accuracy'],3)
return {"Accuracy": acc, "AUC": auc}

Now, we are ready to fine-tune our model. We start by defining hyperparameters and other training arguments.

# hyperparameters
lr = 2e-4
batch_size = 8
num_epochs = 10training_args = TrainingArguments(
output_dir="bert-phishing-classifier_teacher",
learning_rate=lr,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=num_epochs,
logging_strategy="epoch",
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)

Then, we pass our training arguments into a trainer class and train the model.

trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_data["train"],
eval_dataset=tokenized_data["test"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)trainer.train()

The training results are shown below. We can see that the training and validation loss are monotonically decreasing while the accuracy and AUC increase with each epoch.

As a final test, we can evaluate the performance of the model on the independent validation data, i.e., data not used for training or setting hyperparameters.

# apply model to validation dataset
predictions = trainer.predict(tokenized_data["validation"])# Extract the logits and labels from the predictions object
logits = predictions.predictions
labels = predictions.label_ids
# Use your compute_metrics function
metrics = compute_metrics((logits, labels))
print(metrics)
# >> {'Accuracy': 0.889, 'AUC': 0.946}

Bonus: Although a 110M parameter model is tiny compared to modern language models, we can reduce its computational requirements using model compression techniques. I cover how to reduce the memory footprint model by 7X in the article below.

Fine-tuning pre-trained models is a powerful paradigm for developing better models at a lower cost than training them from scratch. Here, we saw how to do this with BERT using the Hugging Face Transformers library.

While the example code was for URL classification, it can be readily adapted to other text classification tasks.

Large Language Models (LLMs)