This notebook is based on the example provided by Google on how to fine-tune the Gemma-7B model, as found in this example notebook.
In recent years, Large Language Models (LLM) have attracted significant attention due to their ability to solve a wide range of problems and their outstanding performance. These models are typically trained using massive datasets and huge numbers of parameters. Many big-tech companies offer pre-trained models, called base or foundational models, but to utilize them in specific domains, fine-tuning is required. Although ChatGPT offers online model fine-tuning features, users may prefer to fine-tune models in a local environment for privacy or customization reasons.
Fine-tuning large models primarily falls into the following two methods:
近年來,大型語言模型(LLM)因可解決廣泛問題及其優秀表現而受矚目。這些模型通常利用大量數據集和龐大的模型參數進行訓練。許多大型科技公司提供預訓練好的基礎模型(Base/Foundational Model),但如要用於特定領域,則需要透過微調(Fine-Tuning)來調整模型。儘管ChatGPT提供了線上微調模型功能,但出於隱私或自定義需求等各種原因,用戶或希望在本地環境中對模型進行微調。
微調大型模型主要分為以下兩種方法:
Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models," In ICLR 2021.
In this tutorial, we will use Google Gemma as our foundational model to demonstrate how to fine-tune models using Hugging Face. Although Google Gemma is publicly available, specific conditions must be accepted for its use. Please obtain permission and a Token here, and then store this Token in the environment variable ["HF_TOKEN"]
.
在本教學中,我們將採用Google Gemma作為基礎模型來展示如何使用Hugging Face進行微調。Google Gemma雖然公開,但需接受特定條件才能使用。請在此獲取許可和Token,然後將這個Token儲存在 ["HF_TOKEN"]
環境變數中。
import os
# from google.colab import userdata
# os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')
os.environ["HF_TOKEN"] = "API_TOKEN"
transformers
: The core of Hugging Face, facilitating the use of state-of-the-art pre-trained models.datasets
: Provides common datasets.bitsandbytes
: Offers quantization functionality, helping to reduce model memory usage and improve computational efficiency.accelerate
: Speeds up model computation.trl
and peft
: Transformer Reinforcement Learning & Parameter-Efficient Fine-Tuning. Offer efficient model fine-tuning capabilities.transformers
: Hugging Face的核心,可方便使用最先進的預訓練模型datasets
: 提供常用數據集bitsandbytes
: 提供量化(quantization)功能,幫助減少模型的內存使用量並提升運行效率。accelerate
: 加速模型計算。trl
和 peft
: Transformer Reinforcement Learning & Parameter-Efficient Fine-Tuning. 提供高效的模型微調功能# !pip3 install -q -U transformers==4.38.1
# !pip3 install -q -U datasets==2.17.0
# !pip3 install -q -U bitsandbytes==0.42.0
# !pip3 install -q -U accelerate==0.27.1
# !pip3 install -q -U trl==0.7.10
# !pip3 install -q -U peft==0.8.2
Quantization reduces the model size and speeds up computation by converting model parameters into a lower precision format. Proper quantization can maintain model performance with minimal impact while saving memory. However, excessive quantization may lead to a decrease in model performance. In essence, quantization stores model parameters in fewer bits, reducing computational complexity and memory space (most computations involve matrix multiplication, essentially multiplication).
bnb_config
: Optional, quantization configuration options.tokenizer
: Converts text into numbers, a format that the model can understand.model
's device_map
: Specifies the device for running the model, where 0
represents GPU 0.量化將模型參數轉換成低精度格式來減少模型大小並加速運算。適當的量化可以保持模型性能不受太大影響同時節省記憶體。但過度量化可能會導致模型性能下降。簡而言之,量化用更少位元去儲存模型參數,可減少運算複雜性和記憶空間(大部份是運算是矩陣乘法,本質上是乘法)。
bnb_config
: 可選,量化的配置選項tokenizer
: 將文本轉換成數字,即模型可以理解的格式model
的 device_map
: 指定模型運行的裝置,0
代表GPU 0import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, GemmaTokenizer
model_id = "google/gemma-7b"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}, token=os.environ['HF_TOKEN'])
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
tokenizer
to convert the input text into a tensor
and place it on the GPU for acceleration.return_tensors="pt"
specifies that the return type is a PyTorch tensor
(other options include 'tf'
for a TensorFlow tensor
or 'np'
for a numpy array
).max_new_tokens
sets the maximum number of tokens to generate. The default is 20, adjust as needed.outputs[0]
retrieves the generated text's tensor
. Use tokenizer.decode
to convert it back into text format.skip_special_tokens=True
omits special tokens (such as [CLS]
, [SEP]
, etc.) during the conversion process.The generated text is the completion of the unfinished input text"Quote: Imagination is more"
.
tokenizer
將輸入的文字轉換為tensor
,放在GPU上加速。return_tensors="pt"
指定返回的是PyTorch tensor
(其他選項包括'tf'
返回TensorFlow tensor
或'np'
返回numpy array
)。max_new_tokens
生成最大字數。默認為20,按需調整。outputs[0]
獲得生成文字的tensor
。使用tokenizer.decode
將它轉換回文字格式skip_special_tokens=True
轉換過程中不顯示特殊字符(例如[CLS]
、[SEP]
等)。生成結果是補全了未完成的輸入文字 "Quote: Imagination is more"
。
text = "Quote: Imagination is more"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Quote: Imagination is more important than knowledge.
os.environ["WANDB_DISABLED"] = "true"
Suppose we want the model to not only complete unfinished quotes but also to append the author's name. First, we need to fine-tune the model such that it learns generating text in a specific format. Here, we use SFTTrainer
and set up LoraConfig
:
r=8
: A key parameter in LoRA, referring to the rank of the low-rank matrix. In simple terms, LoRA needs to fine-tune/train linear layers with dimensions d*r
and r*d
. The larger it is, the more model parameters, see the paper for details.target_modules
: Specifies which modules of the Transformer
to apply LoRA to, see Attention Is All You Need(2017) for details.task_type
: Specifies the task type, CAUSAL_LM
stands for Causal Language Model, usually used for tasks predicting the following words based on the previous context. Others include FEATURE_EXTRACTION
, QUESTION_ANS
, SEQ_2_SEQ_LM
, SEQ_CLS
, and TOKEN_CLS
.假設我們並不只想模型補充未完結的名言,而是希望模型更能夠加上作者名字。首先,我們使需要微調的模型去讓其學習生成特定格式的文字。這裡我們使用SFTTrainer
,並設定 LoraConfig
:
r=8
:LoRA中較關鍵的參數,指的是低秩矩陣的秩(rank)。簡易言之,LoRA需要微調/訓練 d*r
和 r*d
大小的線性層。愈大則模型參數愈多,詳見論文。target_modules
:指定Transformer
哪些模塊應用LoRA,詳見Attention Is All You Need(2017)。task_type
:指定任務類型,CAUSAL_LM
表示因果語言模型(Causal Language Model),通常用於根據前文預測接下來詞語的任務。其他包括FEATURE_EXTRACTION
, QUESTION_ANS
, SEQ_2_SEQ_LM
, SEQ_CLS and TOKEN_CLS
.from peft import LoraConfig
lora_config = LoraConfig(
r=8,
target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
task_type="CAUSAL_LM",
)
from datasets import load_dataset
data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)
In the configuration of SFTTrainer
, formatting_func
allows you to customize the format of the output data, which should be the format you want the model to learn to generate text in. For example, if we want the model's output format to be a quote followed by the author's name, we can process the quote
and author
data into the following format:
Example:
"Quote: Be yourself; everyone else is already taken.\nAuthor: Oscar Wilde"
Thus, during fine-tuning, the data will be trained in this format, enabling the model to learn and mimic this specific output format.
在SFTTrainer
的設定中,formatting_func
允許你自定義輸出資料的格式,並應是你希望模型學習生成文字的格式。舉例來說,如果我們希望模型輸出的格式是一段引用加上作者名字,我們可以將quote
和author
的資料處理成以下格式:
例如:
"Quote: Be yourself; everyone else is already taken.\nAuthor: Oscar Wilde"
即微調模型時,資料會以此格式的數據去訓練,使模型學習並模仿這種特定的輸出格式。
import transformers
from trl import SFTTrainer
def formatting_func(example):
text = f"Quote: {example['quote'][0]}\nAuthor: {example['author'][0]}"
return [text]
trainer = SFTTrainer(
model=model,
train_dataset=data["train"],
args=transformers.TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
warmup_steps=2,
max_steps=10,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir="outputs",
optim="paged_adamw_8bit"
),
peft_config=lora_config,
formatting_func=formatting_func,
)
trainer.train()
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). C:\Users\sitma\AppData\Local\Programs\Python\Python310\lib\site-packages\trl\trainer\sft_trainer.py:245: UserWarning: You didn't pass a `max_seq_length` argument to the SFTTrainer, this will default to 1024 warnings.warn( C:\Users\sitma\AppData\Local\Programs\Python\Python310\lib\site-packages\trl\trainer\sft_trainer.py:317: UserWarning: You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code. warnings.warn( C:\Users\sitma\AppData\Local\Programs\Python\Python310\lib\site-packages\accelerate\accelerator.py:432: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an `accelerate.DataLoaderConfiguration` instead: dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True) warnings.warn(
Step | Training Loss |
---|
TrainOutput(global_step=10, training_loss=0.5222328573465347, metrics={'train_runtime': 26.3684, 'train_samples_per_second': 1.517, 'train_steps_per_second': 0.379, 'total_flos': 21555767439360.0, 'train_loss': 0.5222328573465347, 'epoch': 6.67})
text = "Quote: Imagination is"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Quote: Imagination is more important than knowledge Author: Albert Einstein Author: Albert Einstein Author: Albert Einstein