Notebook

This is a tutorial of how to use Large Language Model (LLM) with Transformers.jl.

In [1]:

using Transformers, CUDA

After loading the package, we need to setup the gpu. Currently multi-gpu is not supported. If your machine have multiple gpu devices, we can use CUDA.devices() to get the list of all device and use CUDA.device!(device_number) to specify the device we want to run our model on.

In [2]:

CUDA.devices()

Out[2]:

CUDA.DeviceIterator() for 8 devices:
0. NVIDIA A100 80GB PCIe
1. NVIDIA A100 80GB PCIe
2. NVIDIA A100-PCIE-40GB
3. Tesla V100-PCIE-32GB
4. Tesla V100-PCIE-32GB
5. Tesla V100S-PCIE-32GB
6. Tesla V100-PCIE-32GB
7. Tesla V100-PCIE-32GB

In [3]:

CUDA.device!(0)

Out[3]:

CuDevice(0): NVIDIA A100 80GB PCIe

For demonstration, we disable the scalar indexing on gpu so that we can make sure all gpu calls are handled without performance issue. By setting enable_gpu, we get a todevice provided by Transformers.jl that will move data/model to gpu device.

In [4]:

CUDA.allowscalar(false)
enable_gpu(true)

Out[4]:

todevice (generic function with 1 method)

In this tutorial, we show how to do use the dolly-v2-12b (https://huggingface.co/databricks/dolly-v2-12b) in Julia. Dolly is an instruction-following large language model trained on the Databricks machine learning platform that is licensed for commercial use. It's based on the EleutherAI pythia model family and fine-tuned exclusively on a new, high-quality human generated instruction-following dataset databricks-dolly-15k, crowdsourced among Databricks employees. They provide 3 model size: dolly-v2-3b, dolly-v2-7b, and dolly-v2-12b. More information can be founded in databricks' blogpost

The process should also work for other causal LM based model. With Transformers.jl, we can get the tokenizer and model by using the hgf"" macro or HuggingFace.load_tokenizer/HuggingFace.load_model. The required files like the model weights will be downloaded and managed automatically.

In [5]:

using Transformers.HuggingFace

textenc = hgf"databricks/dolly-v2-12b:tokenizer"
model = todevice(hgf"databricks/dolly-v2-12b:ForCausalLM") # move to gpu with `todevice` (or `Flux.gpu`)

Out[5]:

HGFGPTNeoXForCausalLM(
  HGFGPTNeoXModel(
    CompositeEmbedding(
      token = Embed(5120, 50280),       # 257_433_600 parameters
    ),
    Chain(
      Transformer<36>(
        ParallelPreNorm2TransformerBlock(
          SelfAttention(
            CausalGPTNeoXRoPEMultiheadQKVAttenOp(base = 10000.0, dim = 32, head = 40, p = nothing),
            GPTNeoXSplit(40, Dense(W = (5120, 15360), b = true)),  # 78_658_560 parameters
            Dense(W = (5120, 5120), b = true),  # 26_219_520 parameters
          ),
          LayerNorm(5120, ϵ = 1.0e-5),  # 10_240 parameters
          Chain(
            Dense(σ = NNlib.gelu, W = (5120, 20480), b = true),  # 104_878_080 parameters
            Dense(W = (20480, 5120), b = true),  # 104_862_720 parameters
          ),
          LayerNorm(5120, ϵ = 1.0e-5),  # 10_240 parameters
        ),
      ),                  # Total: 432 arrays, 11_327_016_960 parameters, 72.859 KiB.
      LayerNorm(5120, ϵ = 1.0e-5),      # 10_240 parameters
    ),
  ),
  Branch{(:logit,) = (:hidden_state,)}(
    EmbedDecoder(Embed(5120, 50280)),   # 257_433_600 parameters
  ),
)                   # Total: 436 arrays, 11_841_894_400 parameters, 93.758 KiB.

We define some helper functions for the text generation. Here we are doing the simple greedy decoding. It can be replaced with other decoding algorithm like beam search. The k in top_k_sample decide the number of possible choices at each generation step. The default k = 1 is simply argmax.

In [6]:

using Flux
using StatsBase

function temp_softmax(logits; temperature = 1.2)
    return softmax(logits ./ temperature)
end

function top_k_sample(probs; k = 1)
    sorted = sort(probs, rev = true)
    indexes = partialsortperm(probs, 1:k, rev=true)
    index = sample(indexes, ProbabilityWeights(sorted[1:k]), 1)
    return index
end

Out[6]:

top_k_sample (generic function with 1 method)

The main generation loop is defined as follows:

The prompt is first preprocessed and encoded with the tokenizer textenc. The encode function return a NamedTuple where .token is the one-hot representation of our context tokens.
At each iteration, we copy the tokens to gpu and feed them to the model. The model also return a NamedTuple where .logit is the predictions of our model. We then apply the greedy decoding scheme to get the prediction of next token. The token will be appended to the end of context tokens. The iteration stop if we exceed the maximum generation length or the predicted token is an end token.
After the loop, we decode the one-hot encoding back to text tokens. The decode function convert the onehots to texts and also perform some post-processing to get the final list of strings.

In [7]:

using Transformers.TextEncoders

function generate_text(textenc, model, context = ""; max_length = 512, k = 1, temperature = 1.2, ends = textenc.endsym)
    encoded = encode(textenc, context).token
    ids = encoded.onehots
    ends_id = lookup(textenc.vocab, ends)
    for i in 1:max_length
        input = (; token = encoded) |> todevice
        outputs = model(input)
        logits = @view outputs.logit[:, end, 1]
        probs = temp_softmax(logits; temperature)
        new_id = top_k_sample(collect(probs); k)[1]
        push!(ids, new_id)
        new_id == ends_id && break
    end
    return decode(textenc, encoded)
end

Out[7]:

generate_text (generic function with 2 methods)

We use the same prompt of dolly defined in instruct_pipeline.py

In [8]:

function generate(textenc, model, instruction; max_length = 512, k = 1, temperature = 1.2)
    prompt = """
    Below is an instruction that describes a task. Write a response that appropriately completes the request.
    
    ### Instruction:
    $instruction
    
    ### Response:
    """    
    text_token = generate_text(textenc, model, prompt; max_length, k, temperature, ends = "### End")
    gen_text = join(text_token)
    println(gen_text)
end

Out[8]:

generate (generic function with 1 method)

In [9]:

generate(textenc, model, "Explain to me the difference between nuclear fission and fusion.")

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Explain to me the difference between nuclear fission and fusion.

### Response:
Nuclear fission and fusion are both methods by which the nucleus of an atom splits and combines, releasing energy in the process. In nuclear fission, the nucleus is split into two or more smaller pieces. This releases a lot of energy in the form of heat and light, but the pieces are often unstable and will decay into smaller pieces over time. Nuclear fusion occurs when two or more nuclei combine to form a larger nucleus. This process releases less energy than nuclear fission but is more stable, and the energy can be captured and released more efficiently.

### End