Fine Tune Pre-Trained Models with Hugging Face

⇐ Natural Language Processing ROS

Fine Tune Pre-Trained Models with Hugging Face

In the realm of Natural Language Processing (NLP), harnessing the capabilities of pre-trained models is a fundamental endeavor. These models, having already learned from vast amounts of text data, serve as valuable starting points for various NLP tasks. Hugging Face, a library widely embraced by NLP practitioners, simplifies the process of fine-tuning pre-trained models to suit specific applications.

Fine-tuning entails refining a pre-trained model by training it further on domain-specific data. This process allows the model to adapt and excel in tasks like text classification, sentiment analysis, and more. With its user-friendly interfaces and comprehensive functionalities, Hugging Face empowers us to effectively fine-tune these models, bridging the gap between general language understanding and specific tasks.

This tutorial delves into the pragmatic intricacies of fine-tuning pre-trained models using Hugging Face. We will explore the nuances of preparing datasets, navigating hyperparameters, visualizing training progress, and employing advanced training techniques. By the end of this tutorial, you’ll be well-equipped to unlock the full potential of pre-trained models for your NLP undertakings.

Section 1: Prepare Your Dataset

Data Exploration

Before delving into preprocessing, it’s essential to understand your dataset’s structure. For instance, let’s explore the Yelp Reviews dataset, which consists of reviews and associated ratings. Understanding your data is crucial, so let’s begin by gaining insights into the dataset’s dimensions:

from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

# Display the number of training and test examples
print("Number of training examples:", len(dataset["train"]))
print("Number of test examples:", len(dataset["test"]))

# Print a few training examples
for i in range(3):
    print("Example", i, ":", dataset["train"][i]["text"])
    print("Rating:", dataset["train"][i]["label"])
Number of training examples: 650000
Number of test examples: 50000
Example 0 : dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.
Rating: 4
Example 1 : Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff.  It seems that his staff simply never answers the phone.  It usually takes 2 hours of repeated calling to get an answer.  Who has time for that or wants to deal with it?  I have run into this problem with many other doctors and I just don't get it.  You have office workers, you have patients with medical needs, why isn't anyone answering the phone?  It's incomprehensible and not work the aggravation.  It's with regret that I feel that I have to give Dr. Goldberg 2 stars.
Rating: 1
Example 2 : Been going to Dr. Goldberg for over 10 years. I think I was one of his 1st patients when he started at MHMG. He's been great over the years and is really all about the big picture. It is because of him, not my now former gyn Dr. Markoff, that I found out I have fibroids. He explores all options with you and is very patient and understanding. He doesn't judge and asks all the right questions. Very thorough and wants to be kept in the loop on every aspect of your medical health and your life.
Rating: 3

This exploration gives you a bird’s-eye view of your dataset, aiding in identifying potential imbalances and understanding its distribution.

Tokenization Strategies

Tokenization lies at the core of NLP tasks, where text input is transformed into a sequence of tokens understandable by machine learning models. Hugging Face provides a versatile Tokenizer class that simplifies this crucial step. Tokenizers are available for various models and come in two flavors: a full Python implementation and a “Fast” implementation based on the Rust library, which offers improved performance, especially for batched tokenization.

Preparing Inputs

The main classes for tokenization are PreTrainedTokenizer and PreTrainedTokenizerFast, which are base classes for all tokenizers. These classes provide common methods for encoding string inputs into model-ready inputs. They handle tokenizing (splitting text into sub-word token strings), converting token strings to IDs and vice versa, and encoding/decoding (tokenizing and converting to integers). Additionally, they facilitate the management of special tokens like mask, beginning-of-sentence, etc., ensuring they are not split during tokenization.

Special Tokens

Hugging Face tokenizers offer attributes for important special tokens like [MASK], [CLS], [SEP], and more. You can easily add new special tokens to the vocabulary without altering the underlying tokenizer structure. These tokens are used for various purposes such as marking the start or end of sentences, padding sequences, and masking tokens for language modeling tasks.

Tokenizing Text

The encode method in the Tokenizer class is the key for tokenization. It takes the input text and tokenizes it, allowing for various options like adding special tokens, padding, truncation, and controlling the maximum length. For instance, by setting add_special_tokens=True, you ensure that the appropriate special tokens are added to the beginning and end of the token sequence. Truncation and padding can be managed using parameters like max_length, padding, and truncation.

text = "Hello, how are you?"
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
encoded = tokenizer.encode(text, add_special_tokens=True)
[101, 7592, 1010, 2129, 2024, 2017, 1029, 102]

Decoding Tokens

The decode method performs the inverse operation of tokenization. It takes a sequence of token IDs and converts it back to a human-readable text string. Special tokens can be skipped during decoding using the skip_special_tokens parameter.

decoded_text = tokenizer.decode(encoded, skip_special_tokens=True)

Batch Encoding

For efficiency in processing multiple inputs, the batch_encode_plus method tokenizes a batch of sequences at once. It returns a BatchEncoding object that contains various model inputs like input_ids, attention_mask, and more.

texts = ["Hello, how are you?", "I'm doing well, thank you!"]
batch_encoding = tokenizer.batch_encode_plus(texts, add_special_tokens=True, padding=True, truncation=True)

Custom Tokenization

Hugging Face tokenizers also support custom tokenization functions. If your input is already tokenized (e.g., for Named Entity Recognition), you can set is_split_into_words=True and tokenize accordingly.

tokens = ["Hello", ",", "how", "are", "you", "?"]
encoded = tokenizer.encode(tokens, is_split_into_words=True)

Hugging Face’s Tokenizer class simplifies this process by providing an array of methods and options to tokenize, encode, decode, and handle special tokens effectively. By understanding and utilizing these strategies, you can seamlessly integrate tokenization into your NLP workflows.

Section 2: Fine-Tune a Pretrained Model with the Trainer Class

Get ready to dive deeper into training using the powerful Trainer class. By exploring hyperparameter tuning, visualization options, and advanced training techniques, you’ll wield Hugging Face’s prowess to the fullest.

Hyperparameter Tuning

Mastering hyperparameters is a cornerstone of model training. Uncover essential hyperparameters within the TrainingArguments class:

  • learning_rate: Controls optimization’s step size.
  • num_train_epochs: Specifies training epochs.
  • per_device_train_batch_size: Sets batch size during training.

Discover the art of hyperparameter tuning by experimenting with different values for optimal performance.

Visualizing Training Progress

Real-time training visualization empowers your progress tracking. Integrate TensorBoard seamlessly into the Trainer class for dynamic insights:

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(

trainer = Trainer(

Visualizing training progress using TensorBoard enhances your ability to track metrics, identify trends, and make informed decisions.

Advanced Training Techniques

Elevate your training beyond the basics by exploring advanced techniques like gradient accumulation, mixed precision training, and learning rate schedules:

# Experiment with gradient accumulation and mixed precision training
training_args.gradient_accumulation_steps = 4
training_args.fp16 = True

# Implement a learning rate schedule
from transformers import get_linear_schedule_with_warmup

optimizer = AdamW(model.parameters(), lr=5e-

scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=num_train_steps)

Understanding and implementing these advanced techniques empower you to achieve faster convergence and better results in complex training scenarios.

Section 3: Fine-Tuning a Pretrained Model in Native PyTorch

Delve into the heart of model training by exploring native PyTorch techniques. By mastering gradient clipping, learning rate schedulers, and custom evaluation functions, you’ll deepen your understanding of underlying mechanisms.

Gradient Clipping

Prevent gradient explosions with precision. Implement gradient clipping to scale gradients if their norms surpass a predefined threshold:

from torch.nn.utils import clip_grad_norm_

max_grad_norm = 1.0
clip_grad_norm_(model.parameters(), max_grad_norm)

Implementing gradient clipping safeguards your training from numerical instability and ensures smoother convergence.

Learning Rate Schedulers

Fine-tune your model’s learning rate dynamically. PyTorch offers a suite of learning rate schedulers, including StepLR, ReduceLROnPlateau, and CosineAnnealingLR:

from torch.optim.lr_scheduler import StepLR

scheduler = StepLR(optimizer, step_size=10, gamma=0.1)

Adapting learning rates using these schedulers optimizes model performance and accelerates convergence.

Custom Evaluation Functions

Elevate your evaluation metrics by incorporating precision, recall, and F1-score alongside accuracy:

from sklearn.metrics import precision_recall_fscore_support

def compute_custom_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average="weighted")
    return {"accuracy": (predictions == labels).mean(), "precision": precision, "recall": recall, "f1": f1}

Crafting custom evaluation functions enables you to assess model performance with a nuanced perspective.

Section 4: Additional Resources

As you’ve delved deep into Hugging Face’s capabilities, keep expanding your horizons with these additional resources:


Congratulations! You’ve completed an extensive journey into the world of advanced NLP using the Hugging Face library. By mastering data preparation, advanced training techniques, and native PyTorch approaches, you’re equipped to tackle complex NLP challenges with confidence. Remember, continuous learning and exploration in this dynamic field will lead you to ever greater heights. Happy fine-tuning and experimenting!

Arman Asgharpoor Golroudbari
Arman Asgharpoor Golroudbari
Space-AI Researcher

My research interests revolve around planetary rovers and spacecraft vision-based navigation.