This is a tutorial of how to use Large Language Model (LLM) with Transformers.jl.
using Transformers, CUDA
After loading the package, we need to setup the gpu. Currently multi-gpu is not supported. If your machine have multiple gpu devices, we can use CUDA.devices()
to get the list of all device and use CUDA.device!(device_number)
to specify the device we want to run our model on.
CUDA.devices()
CUDA.DeviceIterator() for 8 devices: 0. NVIDIA A100 80GB PCIe 1. NVIDIA A100 80GB PCIe 2. NVIDIA A100-PCIE-40GB 3. Tesla V100-PCIE-32GB 4. Tesla V100-PCIE-32GB 5. Tesla V100S-PCIE-32GB 6. Tesla V100-PCIE-32GB 7. Tesla V100-PCIE-32GB
CUDA.device!(0)
CuDevice(0): NVIDIA A100 80GB PCIe
For demonstration, we disable the scalar indexing on gpu so that we can make sure all gpu calls are handled without performance issue. By setting enable_gpu
, we get a todevice
provided by Transformers.jl that will move data/model to gpu device.
CUDA.allowscalar(false)
enable_gpu(true)
todevice (generic function with 1 method)
In this tutorial, we show how to do use the dolly-v2-12b (https://huggingface.co/databricks/dolly-v2-12b) in Julia. Dolly is an instruction-following large language model trained on the Databricks machine learning platform that is licensed for commercial use. It's based on the EleutherAI pythia model family and fine-tuned exclusively on a new, high-quality human generated instruction-following dataset databricks-dolly-15k, crowdsourced among Databricks employees. They provide 3 model size: dolly-v2-3b, dolly-v2-7b, and dolly-v2-12b. More information can be founded in databricks' blogpost
The process should also work for other causal LM based model. With Transformers.jl, we can get the tokenizer and model by using the hgf""
macro or HuggingFace.load_tokenizer
/HuggingFace.load_model
. The required files like the model weights will be downloaded and managed automatically.
using Transformers.HuggingFace
textenc = hgf"databricks/dolly-v2-12b:tokenizer"
model = todevice(hgf"databricks/dolly-v2-12b:ForCausalLM") # move to gpu with `todevice` (or `Flux.gpu`)
HGFGPTNeoXForCausalLM( HGFGPTNeoXModel( CompositeEmbedding( token = Embed(5120, 50280), # 257_433_600 parameters ), Chain( Transformer<36>( ParallelPreNorm2TransformerBlock( SelfAttention( CausalGPTNeoXRoPEMultiheadQKVAttenOp(base = 10000.0, dim = 32, head = 40, p = nothing), GPTNeoXSplit(40, Dense(W = (5120, 15360), b = true)), # 78_658_560 parameters Dense(W = (5120, 5120), b = true), # 26_219_520 parameters ), LayerNorm(5120, ϵ = 1.0e-5), # 10_240 parameters Chain( Dense(σ = NNlib.gelu, W = (5120, 20480), b = true), # 104_878_080 parameters Dense(W = (20480, 5120), b = true), # 104_862_720 parameters ), LayerNorm(5120, ϵ = 1.0e-5), # 10_240 parameters ), ), # Total: 432 arrays, 11_327_016_960 parameters, 72.859 KiB. LayerNorm(5120, ϵ = 1.0e-5), # 10_240 parameters ), ), Branch{(:logit,) = (:hidden_state,)}( EmbedDecoder(Embed(5120, 50280)), # 257_433_600 parameters ), ) # Total: 436 arrays, 11_841_894_400 parameters, 93.758 KiB.
We define some helper functions for the text generation. Here we are doing the simple greedy decoding. It can be replaced with other decoding algorithm like beam search. The k
in top_k_sample
decide the number of possible choices at each generation step. The default k = 1
is simply argmax
.
using Flux
using StatsBase
function temp_softmax(logits; temperature = 1.2)
return softmax(logits ./ temperature)
end
function top_k_sample(probs; k = 1)
sorted = sort(probs, rev = true)
indexes = partialsortperm(probs, 1:k, rev=true)
index = sample(indexes, ProbabilityWeights(sorted[1:k]), 1)
return index
end
top_k_sample (generic function with 1 method)
The main generation loop is defined as follows:
textenc
. The encode
function return a NamedTuple
where .token
is the one-hot representation of our context tokens.NamedTuple
where .logit
is the predictions of our model. We then apply the greedy decoding scheme to get the prediction of next token. The token will be appended to the end of context tokens. The iteration stop if we exceed the maximum generation length or the predicted token is an end token.decode
function convert the onehots to texts and also perform some post-processing to get the final list of strings.using Transformers.TextEncoders
function generate_text(textenc, model, context = ""; max_length = 512, k = 1, temperature = 1.2, ends = textenc.endsym)
encoded = encode(textenc, context).token
ids = encoded.onehots
ends_id = lookup(textenc.vocab, ends)
for i in 1:max_length
input = (; token = encoded) |> todevice
outputs = model(input)
logits = @view outputs.logit[:, end, 1]
probs = temp_softmax(logits; temperature)
new_id = top_k_sample(collect(probs); k)[1]
push!(ids, new_id)
new_id == ends_id && break
end
return decode(textenc, encoded)
end
generate_text (generic function with 2 methods)
We use the same prompt of dolly defined in instruct_pipeline.py
function generate(textenc, model, instruction; max_length = 512, k = 1, temperature = 1.2)
prompt = """
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
$instruction
### Response:
"""
text_token = generate_text(textenc, model, prompt; max_length, k, temperature, ends = "### End")
gen_text = join(text_token)
println(gen_text)
end
generate (generic function with 1 method)
generate(textenc, model, "Explain to me the difference between nuclear fission and fusion.")
Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Explain to me the difference between nuclear fission and fusion. ### Response: Nuclear fission and fusion are both methods by which the nucleus of an atom splits and combines, releasing energy in the process. In nuclear fission, the nucleus is split into two or more smaller pieces. This releases a lot of energy in the form of heat and light, but the pieces are often unstable and will decay into smaller pieces over time. Nuclear fusion occurs when two or more nuclei combine to form a larger nucleus. This process releases less energy than nuclear fission but is more stable, and the energy can be captured and released more efficiently. ### End