Notebook

This notebook is based on the example provided by Google on how to fine-tune the Gemma-7B model, as found in this example notebook.

此筆記乃基於 Google 所提供如何微調 Gemma-7B 模型的範例筆記所寫成。

How to Fine-Tune Large Language Models (LLMs) Efficiently and Cost-Effectively?¶

In recent years, Large Language Models (LLM) have attracted significant attention due to their ability to solve a wide range of problems and their outstanding performance. These models are typically trained using massive datasets and huge numbers of parameters. Many big-tech companies offer pre-trained models, called base or foundational models, but to utilize them in specific domains, fine-tuning is required. Although ChatGPT offers online model fine-tuning features, users may prefer to fine-tune models in a local environment for privacy or customization reasons.

Fine-tuning large models primarily falls into the following two methods:

Full Parameters Fine-Tuning: Adjusts all parameters of the pre-trained large model. However, due to the massive number of parameters in large models, full parameters fine-tuning requires substantial computational resources, making it impractical for many users.
Parameter-Efficient Fine-Tuning (PEFT): A notable example of PEFT is the Low-Rank Adaptation (LoRA) proposed by the Microsoft team. The concept of LoRA involves freezing the original pre-trained model parameters and fine-tuning a small number of additional parameters (which can be viewed as plugins or patches). Since the original pre-trained model remains unchanged, the training cost is significantly reduced, achieving efficient transfer learning.

如何以較少成本有效地微調大型語言模型 (Large Language Models, LLMs)?¶

近年來，大型語言模型（LLM）因可解決廣泛問題及其優秀表現而受矚目。這些模型通常利用大量數據集和龐大的模型參數進行訓練。許多大型科技公司提供預訓練好的基礎模型（Base/Foundational Model），但如要用於特定領域，則需要透過微調（Fine-Tuning）來調整模型。儘管ChatGPT提供了線上微調模型功能，但出於隱私或自定義需求等各種原因，用戶或希望在本地環境中對模型進行微調。

微調大型模型主要分為以下兩種方法：

全量微調 (Full Parameters Fine-Tuning)：在預訓練的大型模型基礎上調整所有參數，但由於大型模型參數過多，全量微調需要大量的計算資源，對許多用戶來說不切實際。
參數高效微調 (Parameter-Efficient Fine-Tuning, PEFT)：PEFT的一個著名例子是Microsoft團隊提出的Low-Rank Adaptation（LoRA）。LoRA的概念是通過凍結原始的預訓練模型參數，並搭配數據微調少部分額外參數（可以視為插件或補丁）。由於原始預訓練模型保持不變，因此訓練成本大幅降低，實現高效的遷移學習(Transfer Learning)。

參考文獻 / References¶

Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models," In ICLR 2021.

Tutorial: Fine-Tuning Large Models with Hugging Face¶

In this tutorial, we will use Google Gemma as our foundational model to demonstrate how to fine-tune models using Hugging Face. Although Google Gemma is publicly available, specific conditions must be accepted for its use. Please obtain permission and a Token here, and then store this Token in the environment variable ["HF_TOKEN"].

教學：使用 Hugging Face 微調大型模型¶

在本教學中，我們將採用Google Gemma作為基礎模型來展示如何使用Hugging Face進行微調。Google Gemma雖然公開，但需接受特定條件才能使用。請在此獲取許可和Token，然後將這個Token儲存在 ["HF_TOKEN"] 環境變數中。

In [ ]:

import os
# from google.colab import userdata
# os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')
os.environ["HF_TOKEN"] = "API_TOKEN"

Common Libraries used During fine-tuning with Hugging Face¶

transformers: The core of Hugging Face, facilitating the use of state-of-the-art pre-trained models.
datasets: Provides common datasets.
bitsandbytes: Offers quantization functionality, helping to reduce model memory usage and improve computational efficiency.
accelerate: Speeds up model computation.
trl and peft: Transformer Reinforcement Learning & Parameter-Efficient Fine-Tuning. Offer efficient model fine-tuning capabilities.

在使用 Hugging Face 進行大型模型微調過程中，以下是幾個常用且重要的程式庫：¶

transformers: Hugging Face的核心，可方便使用最先進的預訓練模型
datasets: 提供常用數據集
bitsandbytes: 提供量化（quantization）功能，幫助減少模型的內存使用量並提升運行效率。
accelerate: 加速模型計算。
trl 和 peft: Transformer Reinforcement Learning & Parameter-Efficient Fine-Tuning. 提供高效的模型微調功能

In [ ]:

# !pip3 install -q -U transformers==4.38.1
# !pip3 install -q -U datasets==2.17.0
# !pip3 install -q -U bitsandbytes==0.42.0
# !pip3 install -q -U accelerate==0.27.1
# !pip3 install -q -U trl==0.7.10
# !pip3 install -q -U peft==0.8.2

What is Quantization?¶

Quantization reduces the model size and speeds up computation by converting model parameters into a lower precision format. Proper quantization can maintain model performance with minimal impact while saving memory. However, excessive quantization may lead to a decrease in model performance. In essence, quantization stores model parameters in fewer bits, reducing computational complexity and memory space (most computations involve matrix multiplication, essentially multiplication).

bnb_config: Optional, quantization configuration options.
tokenizer: Converts text into numbers, a format that the model can understand.
model's device_map: Specifies the device for running the model, where 0 represents GPU 0.

何謂量化 (Quantization)？¶

量化將模型參數轉換成低精度格式來減少模型大小並加速運算。適當的量化可以保持模型性能不受太大影響同時節省記憶體。但過度量化可能會導致模型性能下降。簡而言之，量化用更少位元去儲存模型參數，可減少運算複雜性和記憶空間（大部份是運算是矩陣乘法，本質上是乘法）。

bnb_config: 可選，量化的配置選項
tokenizer: 將文本轉換成數字，即模型可以理解的格式
model 的 device_map: 指定模型運行的裝置，0代表GPU 0

In [ ]:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, GemmaTokenizer

model_id = "google/gemma-7b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}, token=os.environ['HF_TOKEN'])

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Demonstrating How to Generate Text with a Model¶

Use tokenizer to convert the input text into a tensor and place it on the GPU for acceleration.
return_tensors="pt" specifies that the return type is a PyTorch tensor (other options include 'tf' for a TensorFlow tensor or 'np' for a numpy array).
max_new_tokens sets the maximum number of tokens to generate. The default is 20, adjust as needed.
outputs[0] retrieves the generated text's tensor. Use tokenizer.decode to convert it back into text format.
skip_special_tokens=True omits special tokens (such as [CLS], [SEP], etc.) during the conversion process.

The generated text is the completion of the unfinished input text"Quote: Imagination is more".

以下展示如何使用模型生成文字¶

使用tokenizer將輸入的文字轉換為tensor，放在GPU上加速。
return_tensors="pt"指定返回的是PyTorch tensor（其他選項包括'tf'返回TensorFlow tensor或'np'返回numpy array）。
max_new_tokens生成最大字數。默認為20，按需調整。
outputs[0]獲得生成文字的tensor。使用tokenizer.decode將它轉換回文字格式
skip_special_tokens=True 轉換過程中不顯示特殊字符（例如[CLS]、[SEP]等）。

生成結果是補全了未完成的輸入文字 "Quote: Imagination is more"。

In [ ]:

text = "Quote: Imagination is more"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quote: Imagination is more important than knowledge.

In [ ]:

os.environ["WANDB_DISABLED"] = "true"

Fine-Tuning Example¶

Suppose we want the model to not only complete unfinished quotes but also to append the author's name. First, we need to fine-tune the model such that it learns generating text in a specific format. Here, we use SFTTrainer and set up LoraConfig:

r=8: A key parameter in LoRA, referring to the rank of the low-rank matrix. In simple terms, LoRA needs to fine-tune/train linear layers with dimensions d*r and r*d. The larger it is, the more model parameters, see the paper for details.
target_modules: Specifies which modules of the Transformer to apply LoRA to, see Attention Is All You Need(2017) for details.
task_type: Specifies the task type, CAUSAL_LM stands for Causal Language Model, usually used for tasks predicting the following words based on the previous context. Others include FEATURE_EXTRACTION, QUESTION_ANS, SEQ_2_SEQ_LM, SEQ_CLS, and TOKEN_CLS.

下面是如何微調模型的例子¶

假設我們並不只想模型補充未完結的名言，而是希望模型更能夠加上作者名字。首先，我們使需要微調的模型去讓其學習生成特定格式的文字。這裡我們使用SFTTrainer，並設定 LoraConfig:

r=8：LoRA中較關鍵的參數，指的是低秩矩陣的秩（rank）。簡易言之，LoRA需要微調／訓練 d*r 和 r*d 大小的線性層。愈大則模型參數愈多，詳見論文。
target_modules：指定Transformer哪些模塊應用LoRA，詳見Attention Is All You Need(2017)。
task_type：指定任務類型，CAUSAL_LM表示因果語言模型（Causal Language Model），通常用於根據前文預測接下來詞語的任務。其他包括FEATURE_EXTRACTION, QUESTION_ANS, SEQ_2_SEQ_LM, SEQ_CLS and TOKEN_CLS.

In [ ]:

from peft import LoraConfig

lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

Loading Data¶

We will use the "Abirate/english_quotes" dataset provided by Hugging Face as an example. The data contains a large number of English quotes ["quote"] and their speakers ["author"].

載入數據¶

我們將使用Hugging Face提供的"Abirate/english_quotes"為例。數據包含大量英文名言["quote"]及其講者["author"]。

In [ ]:

from datasets import load_dataset

data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

Setting the Output Format¶

In the configuration of SFTTrainer, formatting_func allows you to customize the format of the output data, which should be the format you want the model to learn to generate text in. For example, if we want the model's output format to be a quote followed by the author's name, we can process the quote and author data into the following format:

Example: "Quote: Be yourself; everyone else is already taken.\nAuthor: Oscar Wilde"

Thus, during fine-tuning, the data will be trained in this format, enabling the model to learn and mimic this specific output format.

設定輸出格式¶

在SFTTrainer的設定中，formatting_func允許你自定義輸出資料的格式，並應是你希望模型學習生成文字的格式。舉例來說，如果我們希望模型輸出的格式是一段引用加上作者名字，我們可以將quote和author的資料處理成以下格式：

例如："Quote: Be yourself; everyone else is already taken.\nAuthor: Oscar Wilde"

即微調模型時，資料會以此格式的數據去訓練，使模型學習並模仿這種特定的輸出格式。

In [ ]:

import transformers
from trl import SFTTrainer

def formatting_func(example):
    text = f"Quote: {example['quote'][0]}\nAuthor: {example['author'][0]}"
    return [text]

trainer = SFTTrainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
    formatting_func=formatting_func,
)
trainer.train()

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
C:\Users\sitma\AppData\Local\Programs\Python\Python310\lib\site-packages\trl\trainer\sft_trainer.py:245: UserWarning: You didn't pass a `max_seq_length` argument to the SFTTrainer, this will default to 1024
  warnings.warn(
C:\Users\sitma\AppData\Local\Programs\Python\Python310\lib\site-packages\trl\trainer\sft_trainer.py:317: UserWarning: You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code.
  warnings.warn(
C:\Users\sitma\AppData\Local\Programs\Python\Python310\lib\site-packages\accelerate\accelerator.py:432: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an `accelerate.DataLoaderConfiguration` instead: 
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
  warnings.warn(

[ 2/10 : < :, Epoch 1/10]

Step	Training Loss

Out[ ]:

TrainOutput(global_step=10, training_loss=0.5222328573465347, metrics={'train_runtime': 26.3684, 'train_samples_per_second': 1.517, 'train_steps_per_second': 0.379, 'total_flos': 21555767439360.0, 'train_loss': 0.5222328573465347, 'epoch': 6.67})

Results¶

Although there are repetition issues, the format indeed meets our expectations.

結果¶

雖然有重覆的問題，但格式的確符合我們的預期。

In [ ]:

text = "Quote: Imagination is"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quote: Imagination is more important than knowledge
Author: Albert Einstein
Author: Albert Einstein
Author: Albert Einstein