Notebook

Finetuning falcon-1b with Axolotl+QLoRA on RunPod¶

This notebook makes it easy to try out finetuning falcon-1b with Axolotl+QLo on RunPod.

If you run into any issues, welcome to report here .

To run this notebook on RunPod, use this RunPod template to deploy a pod with GPU.

Step 1. Generate default config for accelerate¶

In [1]:

%cd axolotl

/workspace/axolotl

In [2]:

!accelerate config --config_file configs/accelerate/default_config.yaml default

Setting ds_accelerator to cuda (auto detect)
accelerate configuration saved at /root/.cache/huggingface/accelerate/default_config.yaml

Step 2. Use a well-tested falcon-7b qlora config and adjust it to 1b¶

In [3]:

!wget https://raw.githubusercontent.com/utensil/axolotl/falcon-7b-qlora/examples/falcon/config-7b-qlora.yml

--2023-06-05 14:10:49--  https://raw.githubusercontent.com/utensil/axolotl/falcon-7b-qlora/examples/falcon/config-7b-qlora.yml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2238 (2.2K) [text/plain]
Saving to: ‘config-7b-qlora.yml’

config-7b-qlora.yml 100%[===================>]   2.19K  --.-KB/s    in 0s      

2023-06-05 14:10:49 (30.8 MB/s) - ‘config-7b-qlora.yml’ saved [2238/2238]

In [4]:

!cp config-7b-qlora.yml config-1b-qlora.yml
!sed -i -e 's/falcon-7b/falcon-rw-1b/g' -e 's/wandb_project: falcon-qlora/wandb_project: /g' config-1b-qlora.yml

In [5]:

!cat config-1b-qlora.yml

# 1b: tiiuae/falcon-rw-1b
# 40b: tiiuae/falcon-40b
base_model: tiiuae/falcon-rw-1b
base_model_config: tiiuae/falcon-rw-1b
# required by falcon custom model code: https://huggingface.co/tiiuae/falcon-rw-1b/tree/main 
trust_remote_code: true
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
load_in_8bit: false
# enable 4bit for QLoRA
load_in_4bit: true
gptq: false
strict: false
push_dataset_to_hub:
datasets:
  - path: QingyiSi/Alpaca-CoT
    data_files:
      - Chain-of-Thought/formatted_cot_data/gsm8k_train.json
    type: "alpaca:chat"
dataset_prepared_path: last_run_prepared
val_set_size: 0.01
# enable QLoRA
adapter: qlora
lora_model_dir:
sequence_len: 2048
max_packed_sequence_len:

# hyperparameters from QLoRA paper Appendix B.2
# "We find hyperparameters to be largely robust across datasets"
lora_r: 64
lora_alpha: 16
# 0.1 for models up to 13B
# 0.05 for 33B and 65B models
lora_dropout: 0.05
# add LoRA modules on all linear layers of the base model
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project: 
wandb_watch:
wandb_run_id:
wandb_log_model:
output_dir: ./qlora-out

# QLoRA paper Table 9
# - 16 for 7b & 13b
# - 32 for 33b, 64 for 64b
# Max size tested on A6000
# - 7b: 40
# - 40b: 4
# decrease if OOM, increase for max VRAM utilization
micro_batch_size: 1
gradient_accumulation_steps: 2
num_epochs: 3
# Optimizer for QLoRA
optimizer: paged_adamw_32bit
torchdistx_path:
lr_scheduler: cosine
# QLoRA paper Table 9
# - 2e-4 for 7b & 13b
# - 1e-4 for 33b & 64b
learning_rate: 0.0002
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: true
gradient_checkpointing: true
# stop training after this many evaluation losses have increased in a row
# https://huggingface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallback
early_stopping_patience: 3
resume_from_checkpoint:
auto_resume_from_checkpoints: true
local_rank:
logging_steps: 1
xformers_attention: true
flash_attention:
gptq_groupsize:
gptq_model_v1:
warmup_steps: 10
eval_steps: 5
save_steps: 10
debug:
deepspeed:
weight_decay: 0.000001
fsdp:
fsdp_config:
special_tokens:
  pad_token: "<|endoftext|>"
  bos_token: ">>ABSTRACT<<"
  eos_token: "<|endoftext|>"

Step 3. Set W&B to offline mode¶

In [6]:

%env WANDB_MODE=offline

env: WANDB_MODE=offline

This is to skip some extra setup steps, you can also choose to login to your W&B account before training.

Step 4. Start training and enjoy!¶

In [ ]:

!accelerate launch scripts/finetune.py config-1b-qlora.yml

Setting ds_accelerator to cuda (auto detect)

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /root/miniconda3/envs/py3.9/lib/python3.9/site-packages/bitsandbytes-0.39.0-py3.9.egg/bitsandbytes/libbitsandbytes_cuda118.so
/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/bitsandbytes-0.39.0-py3.9.egg/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')}
  warn(msg)
/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/bitsandbytes-0.39.0-py3.9.egg/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/bitsandbytes-0.39.0-py3.9.egg/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('module'), PosixPath('//matplotlib_inline.backend_inline')}
  warn(msg)
/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/bitsandbytes-0.39.0-py3.9.egg/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDvnE4umHheXhWsDJbbukYvvyc47/mC4z8syS93btA72T90WDrQagOy5O+DrhdXOvr5i/JwsTlAImy57eLRrtRFOrQq73jyi7Dzo0tvrAiNLVgX2q2dFLoplRyXDXiVYLPmPieMWQOeUCLeSb8FC5zzllcocZwjMXpxScDerZqnlAR0ccpSkGyKIod4ZMkn/29A/C5kHEb/wT8cOAq+MWJ/2okZZgbiR0AMV4DynAkrtcx9JnJnTs9chiMyH+dyCS42Ai24sHWJBkQo6TfxXkyKo9GOpu3Y2WLgrHyaot9Lk5mA1mujyIWdlReD2nvjeCQKjl3KW3xZ73m4nD97MydWSWoJfEWlr+VZvk8EWsZk3CYLZCIBLdod6xXJJ0DD0pvTIq11c8VB7XkgVjapuU/sC8M6HFzHW/NBeE+xX/txPkZkIGqrnxeQ0AtBXdN9ukyNGhGzTkPYJNliiYpY0dCvVuz/BJ2FawFTQGnD1EHOenUCRajREFGCbKoYZqi40j8= utensil@Utensils-MacBook-Pro.local')}
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/bitsandbytes-0.39.0-py3.9.egg/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/miniconda3/envs/py3.9/lib/python3.9/site-packages/bitsandbytes-0.39.0-py3.9.egg/bitsandbytes/libbitsandbytes_cuda118.so...
Setting ds_accelerator to cuda (auto detect)
WARNING:root:`trust_remote_code` is set to true. Please make sure that you reviewed the remote code/model.
INFO:root:loading tokenizer... tiiuae/falcon-rw-1b
Downloading (…)okenizer_config.json: 100%|█████| 255/255 [00:00<00:00, 48.1kB/s]
Downloading (…)olve/main/vocab.json: 100%|███| 798k/798k [00:00<00:00, 54.2MB/s]
Downloading (…)olve/main/merges.txt: 100%|███| 456k/456k [00:00<00:00, 11.4MB/s]
Downloading (…)/main/tokenizer.json: 100%|█| 2.11M/2.11M [00:00<00:00, 51.1MB/s]
Downloading (…)cial_tokens_map.json: 100%|████| 99.0/99.0 [00:00<00:00, 200kB/s]
Using pad_token, but it is not set yet.
INFO:root:Unable to find prepared dataset in last_run_prepared/0ecc5b78e3ce4254b22e749b093712b4
INFO:root:Loading raw datasets...
Downloading readme: 100%|██████████████████| 8.26k/8.26k [00:00<00:00, 23.9MB/s]
Downloading and preparing dataset json/QingyiSi--Alpaca-CoT to /root/.cache/huggingface/datasets/QingyiSi___json/QingyiSi--Alpaca-CoT-2953efcfeb19f105/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...
Downloading data files:   0%|                             | 0/1 [00:00<?, ?it/s]
Downloading data:   0%|                             | 0.00/4.45M [00:00<?, ?B/s]
Downloading data: 100%|████████████████████| 4.45M/4.45M [00:00<00:00, 25.2MB/s]
Downloading data files: 100%|█████████████████████| 1/1 [00:00<00:00,  2.18it/s]
Extracting data files: 100%|████████████████████| 1/1 [00:00<00:00, 2031.14it/s]
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/QingyiSi___json/QingyiSi--Alpaca-CoT-2953efcfeb19f105/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 830.88it/s]
INFO:root:tokenizing, merging, and shuffling master dataset
INFO:root:Saving merged prepared dataset to disk... last_run_prepared/0ecc5b78e3ce4254b22e749b093712b4
INFO:root:loading model and peft_config...                                      
Downloading (…)lve/main/config.json: 100%|██████| 665/665 [00:00<00:00, 452kB/s]
Downloading (…)/configuration_RW.py: 100%|█| 2.61k/2.61k [00:00<00:00, 3.79MB/s]
A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-rw-1b:
- configuration_RW.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Downloading (…)main/modelling_RW.py: 100%|█| 47.5k/47.5k [00:00<00:00, 48.6MB/s]
A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-rw-1b:
- modelling_RW.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Downloading pytorch_model.bin: 100%|████████| 2.62G/2.62G [00:23<00:00, 113MB/s]
Downloading (…)neration_config.json: 100%|█████| 111/111 [00:00<00:00, 28.3kB/s]
INFO:root:converting PEFT model w/ prepare_model_for_int8_training
/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/peft/utils/other.py:76: FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. Use prepare_model_for_kbit_training instead.
  warnings.warn(
INFO:root:found linear modules: ['dense_h_to_4h', 'dense', 'dense_4h_to_h', 'query_key_value']
trainable params: 50331648 || all params: 757911552 || trainable%: 6.6408339953630895
INFO:root:Compiling torch model
INFO:root:Pre-saving adapter config to ./qlora-out
INFO:root:Starting trainer...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb: Tracking run with wandb version 0.15.3
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
  0%|                                                 | 0/11097 [00:00<?, ?it/s]You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
{'loss': 1.8589, 'learning_rate': 2e-05, 'epoch': 0.0}                          
{'loss': 1.6525, 'learning_rate': 4e-05, 'epoch': 0.0}                          
{'loss': 1.6585, 'learning_rate': 6e-05, 'epoch': 0.0}                          
{'loss': 1.8112, 'learning_rate': 8e-05, 'epoch': 0.0}                          
{'loss': 1.7954, 'learning_rate': 0.0001, 'epoch': 0.0}                         
  0%|                                       | 5/11097 [00:02<1:25:34,  2.16it/s]
  0%|                                                    | 0/75 [00:00<?, ?it/s]
 97%|█████████████████████████████████████████▊ | 73/75 [00:03<00:00, 19.84it/s]
                                                                                
{'eval_loss': 1.9517160654067993, 'eval_runtime': 3.8321, 'eval_samples_per_second': 19.572, 'eval_steps_per_second': 19.572, 'epoch': 0.0}

In [ ]: