Fine-tuning Custom Language Models with Hugging Face and Unsloth

deep learning
huggingface
unsloth
LLM
finetuning
Author

Favian Hatje

Published

April 21, 2024

Fine-Tuning Large Language Models with Custom Data Using Hugging Face and Unsloth on a Single GPU

This image was generated using DALL-E and the following prompt:
A sloth running through a jungle, carrying a huggingface emoji in its arms. The emoji has hearts as eyes. Motionblur in the background. Flaming foot steps. 3D rendering. 16:9 aspect ratio.

Why Unsloth?

Unsloth is a relatively new library that offers speed and ease of use. It employs quantization and is built on top of Hugging Face, providing support for models like Mistral, Gemma, and Llama. If your preferred model is supported, Unsloth is currently one of the best options available for fine-tuning large language models.

Installation

Begin by setting up a dedicated environment. To install Unsloth, follow the instructions provided on their GitHub page.

Dataset Creation

Convert your raw conversation or text data into a list of lists of dictionaries, as described in the Unsloth wiki:

[
    [{"from": "human", "value": "Hi there!"},
     {"from": "gpt", "value": "Hi how can I help?"},
     {"from": "human", "value": "What is 2+2?"}],
    [{"from": "human", "value": "What's your name?"},
     {"from": "gpt", "value": "I'm Daniel!"},
     {"from": "human", "value": "Ok! Nice!"},
     {"from": "gpt", "value": "What can I do for you?"},
     {"from": "human", "value": "Oh nothing :)"},],
]

Next, determine the format required for parsing the data before tokenization. Begin by loading the tokenizer, alongside the model. For this tutorial, we will use the google/gemma-1.1-2b-it model. Despite its seemingly large size, it is relatively small with only 2.51 billion parameters.

from unsloth import FastLanguageModel

base_model = "google/gemma-1.1-2b-it"
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = base_model,
    load_in_4bit=True, # Load the model in 4-bit mode
)
==((====))==  Unsloth: Fast Gemma patching release 2024.4
   \\   /|    GPU: NVIDIA GeForce RTX 3090. Max memory: 23.669 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.2. CUDA = 8.6. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
tokenizer.chat_template
"{{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message['content'] | trim + '<end_of_turn>\n' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"

The chat template utilizes a Jinja template. Due to Jinja’s lenient handling of whitespaces and new lines, it’s crucial to remove any unnecessary whitespaces and new lines to ensure clarity.

Below is the template presented in a more readable format:

{{ bos_token }}
{% if messages[0]['role'] == 'system' %}
    {{ raise_exception('System role not supported') }}
{% endif %}
{% for message in messages %}
    {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
        {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
    {% endif %}
    {% if (message['role'] == 'assistant') %}
        {% set role = 'model' %}
    {% else %}
        {% set role = message['role'] %}
    {% endif %}
    {{ '<start_of_turn>' + role + '\n' + message['content'] | trim + '<end_of_turn>\n' }}
{% endfor %}
{% if add_generation_prompt %}
    {{'<start_of_turn>model\n'}}
{% endif %}

Note that each message is represented as a dictionary containing a from and value field. In the data, the from values are labeled as gpt and human, whereas in the chat template, they are identified as assistant and user. To align these, it is simpler to adjust the data. However, if required, roles can be modified in the tokenizer using the mapping argument:

mapping = {
    "role" : "from", 
    "content" : "value", 
    "user" : "human", 
    "assistant" : "gpt"
}
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = (
        tokenizer.chat_template, # we are not changing anything here,
        tokenizer.eos_token),    # just passing in the default values
    mapping = {
        "role" : "from",     # Change 'role' to ‘from'
        "content" : "value", # Change 'content' to value'
        "user" : "human",    # Default and not relevant here
        "assistant" : "gpt"  # Default and not relevant here
        },
    map_eos_token = False,
)
data = [
    # Conversation 1
    [{"from": "human", "value": "Hi there!"},
     {"from": "gpt", "value": "Hi how can I help?"},
     {"from": "human", "value": "What is 2+2?"}],
    # Conversation 2
    [{"from": "human", "value": "What's your name?"},
     {"from": "gpt", "value": "I'm Daniel!"},
     {"from": "human", "value": "Ok! Nice!"},
     {"from": "gpt", "value": "What can I do for you?"},
     {"from": "human", "value": "Oh nothing :)"},],
]
tokenizer.apply_chat_template(data[0], tokenize=False)
'<bos><start_of_turn>human\nHi there!<end_of_turn>\n<start_of_turn>model\nHi how can I help?<end_of_turn>\n<start_of_turn>human\nWhat is 2+2?<end_of_turn>\n'

Now that both the data and tokenizer are set up, you can proceed to create a Hugging Face dataset:

from datasets import Dataset

# Hugging Face expects data in the following format
data = {"samples": data}
dataset = Dataset.from_dict(data)

# Optionally, create training and testing splits, 
# and push the dataset to the Hugging Face Hub:

# dataset = dataset.train_test_split(test_size=0.1)
# dataset.push_to_hub("new_custom_dataset")

# To load the dataset from the hub:
# dataset = load_dataset("your_huggingface_name/new_custom_dataset")

Preprocessing the Dataset for Training

Finally, preprocess the dataset for training by mapping each entry through a function that applies the chat template using the tokenizer. This step prepares the data without converting it into tokens yet.

# Lastly, we preprocess the dataset for training
dataset = dataset.map(
    lambda x: {
        "preprocessed": tokenizer.apply_chat_template(
            x["samples"], 
            tokenize=False,              # Avoid converting text into tokens at this stage
            add_generation_prompt=False, # Required for setting up inference
            add_special_tokens=False     # May be necessary depending on model specifics
        )
    }
)

Now our dataset features two fields: samples, which contains the raw conversations, and preprocessed, where the chat template has been applied. Additionally, special tokens have been added to the preprocessed data:

dataset["samples"][0]
[{'from': 'human', 'value': 'Hi there!'},
 {'from': 'gpt', 'value': 'Hi how can I help?'},
 {'from': 'human', 'value': 'What is 2+2?'}]
dataset["preprocessed"][0]
'<bos><start_of_turn>human\nHi there!<end_of_turn>\n<start_of_turn>model\nHi how can I help?<end_of_turn>\n<start_of_turn>human\nWhat is 2+2?<end_of_turn>\n'

Ensure that the preprocessed output is formatted precisely as you and the model require. If there is any uncertainty about the format, consult the respective paper or documentation associated with the model to verify the expected data structure and formatting details. This is crucial because any discrepancies in format can lead to a degradation in model performance.

PEFT Training

To optimize memory usage and speed during fine-tuning, we will employ a technique known as QLORA. Unsloth manages the quantization and LoRA (Low-Rank Adaptation) parameters for us, streamlining the process.

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Rank of the lora adapters
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0.1, # Supports any, but = 0 is optimized
    bias = "none",      # Supports any, but = "none" is optimized
    use_gradient_checkpointing = True,
    random_state = 3407,
    max_seq_length = 2048,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)
Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.1.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2024.4 patched 18 layers with 0 QKV layers, 0 O layers and 0 MLP layers.
import torch
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "preprocessed",
    max_seq_length = 2048,
    tokenizer = tokenizer,
    args = TrainingArguments(
        # Adjust the parameters to your needs
        per_device_train_batch_size = 8,
        gradient_accumulation_steps = 1,
        warmup_steps = 1,       
        max_steps = 5,          
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,      
        save_steps = 100,       
        output_dir = "new_model_name",
        optim = "adamw_8bit",
        report_to="tensorboard",
        learning_rate = 1e-3,
    ),
)
trainer.train()
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 2 | Num Epochs = 5
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 1
\        /    Total batch size = 8 | Total steps = 5
 "-____-"     Number of trainable parameters = 19,611,648
[5/5 00:00, Epoch 5/5]
Step Training Loss
1 16.625000
2 16.625000
3 9.937500
4 7.281200
5 5.968800

TrainOutput(global_step=5, training_loss=11.2875, metrics={'train_runtime': 1.4908, 'train_samples_per_second': 26.831, 'train_steps_per_second': 3.354, 'total_flos': 7880446279680.0, 'train_loss': 11.2875, 'epoch': 5.0})

Once the model is trained, you can merge the LoRA adapters into the base model to finalize your newly fine-tuned model.

model.save_pretrained_merged("new_model", tokenizer, save_method = "merged_16bit",)
# alternatively we can push it to the huggingface hub
# model.push_to_hub_merged("your hf_name/new_model", tokenizer, save_method = "merged_16bit", token = "")
Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 18.43 out of 31.25 RAM for saving.
100%|██████████| 18/18 [00:00<00:00, 128.34it/s]
Unsloth: Saving tokenizer...
 Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.

And there you have it—a large language model fine-tuned on our custom dataset. Both the model and tokenizer can now be loaded using Hugging Face alone. Unsloth also provides an inference solution. Enjoy exploring the capabilities of your fine-tuned model!