Subword Tokenization¶

Implementation from Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., ACL 2016)

In [1]:

import re, collections

# Count number of pairs
def get_stats(vocab):
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i],symbols[i+1]] += freq
    return pairs

# Merge most frequent pairs in the vocabulary
def merge_vocab(pair, v_in):
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')  
    for word in v_in:
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word]
    return v_out

# A simplified vocabulary, which you would collect from a real-word text corpus
vocab = {'l o w </w>' : 5, 'l o w e r </w>' : 2,
'n e w e s t </w>':6, 'w i d e s t </w>' : 3}

num_merges = 5
for i in range(num_merges):
    pairs = get_stats(vocab)
    best = max(pairs, key=pairs.get)
    vocab = merge_vocab(best, vocab)
    print(best)
print(vocab)

('e', 's')
('es', 't')
('est', '</w>')
('l', 'o')
('lo', 'w')
{'low </w>': 5, 'low e r </w>': 2, 'n e w est</w>': 6, 'w i d est</w>': 3}

Using a Pre-trained Tokenizer¶

In [2]:

import tiktoken
enc = tiktoken.encoding_for_model("gpt-3.5-turbo")

In [3]:

token_ids = enc.encode('Prüfungsvorleistung')
token_ids

Out[3]:

[3617, 2448, 79706, 3576, 269, 273, 84314]

In [4]:

[enc.decode([i]) for i in token_ids]

Out[4]:

['Pr', 'ü', 'fung', 'sv', 'or', 'le', 'istung']

Phenotyping with LLMs¶

We will show how to use ChatGPT through the OpenAPI API for zero-shot and few-shot smoking status classification, which is a kind of phenotyping task. Note: if you want to run the notebook yourself, make sure to provide an API key: https://github.com/openai/openai-python#usage

In [5]:

from openai import OpenAI
import os

client = OpenAI()

In [6]:

# Helper function to send messages to OpenAI API (ChatGPT model)
def get_completion(prompt, model="gpt-3.5-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        messages=messages,
        model="gpt-3.5-turbo",
        temperature=0, # this is the degree of randomness of the model's output
    )
    return response.choices[0].message.content.replace('```', '')

Zero-Shot Inference¶

In [7]:

text = "Social History: No alcohol use and quit tobacco greater than 25 years ago with a 10-pack year smoking history."

Prompt 1¶

Describes the task

In [8]:

prompt1 = "What is the smoking status of the person described in this clinical note? ```{}```"

In [9]:

prompt1.format(text)

Out[9]:

'What is the smoking status of the person described in this clinical note? ```Social History: No alcohol use and quit tobacco greater than 25 years ago with a 10-pack year smoking history.```'

In [10]:

get_completion(prompt1.format(text))

Out[10]:

'The smoking status of the person described in this clinical note is that they quit tobacco greater than 25 years ago.'

Prompt 2¶

Describes the task and valid response options (for classification)

In [11]:

prompt2 = ("What is the smoking status of the person described in this clinical note?"
" The valid options are: smoker, non-smoker, ex-smoker "
" Input: ```{}```")

In [12]:

get_completion(prompt2.format(text))

Out[12]:

'The smoking status of the person described in this clinical note is "ex-smoker".'

Prompt 3¶

Describes the task, valid response options, and output format (JSON)

In [13]:

prompt3 = ("What is the smoking status of the person described in this clinical note?"
" The valid options are: current smoker, non-smoker, ex-smoker "
" Please return the answer as a JSON of the format {{ label : <label> }} without any explanations."
" Input: ```{}```")

In [14]:

response = get_completion(prompt3.format(text))
response

Out[14]:

'{"label": "ex-smoker"}'

In [15]:

import json
json.loads(response)

Out[15]:

{'label': 'ex-smoker'}

Prompt 4¶

In addition to the previous prompt, we ask the model to perform an annotational task, i.e., retrieving a piece of text that justifies the classification

In [16]:

prompt4 = ("What is the smoking status of the person described in this clinical note?"
" The valid options are: current smoker, non-smoker, ex-smoker "
" Please return the answer as a JSON of the format {{ label : <label>, evidence: <keyphrase> }} without any explanations. "
" 'evidence' should contain the shortest possible substring from the input that can be used to justify the label."
" Input: ```{}```")

In [17]:

get_completion(prompt4.format(text))

Out[17]:

'{"label": "ex-smoker", "evidence": "quit tobacco greater than 25 years ago"}'

In-Context Learning¶

Instead of describing the task in detail, we can provide some training example inside the prompt

In [18]:

prompt_few_shot = ('Your task is to determine the smoking status of the person described in a clinical note.\n'
'Please return the answer as a JSON of the format {{ label : <label>, evidence: <keyphrase> }} without any explanations.\n'
'Here are some examples:\n'
'Input: ```Smoker until 1999``` Output: ```{{ "label" : "ex-smoker", "keyphrase": "Smoker until 1999"}}```\n'
'Input: ```… SOCIAL HISTORY: Widowed since 1972, no tobacco, no alcohol, lives alone.``` Output: ```{{ "label" : "non-smoker", "keyphrase": "no tobacco"}}```\n'
'Input: ```He is a heavy smoker and drinks 2–3 shots per day at times.``` Output: ```{{ "label" : "current smoker", "keyphrase": "heavy smoker"}}```\n'
'Input: ```{}``` Output: ')

In [19]:

print(prompt_few_shot.format(text))

Your task is to determine the smoking status of the person described in a clinical note.
Please return the answer as a JSON of the format { label : <label>, evidence: <keyphrase> } without any explanations.
Here are some examples:
Input: ```Smoker until 1999``` Output: ```{ "label" : "ex-smoker", "keyphrase": "Smoker until 1999"}```
Input: ```… SOCIAL HISTORY: Widowed since 1972, no tobacco, no alcohol, lives alone.``` Output: ```{ "label" : "non-smoker", "keyphrase": "no tobacco"}```
Input: ```He is a heavy smoker and drinks 2–3 shots per day at times.``` Output: ```{ "label" : "current smoker", "keyphrase": "heavy smoker"}```
Input: ```Social History: No alcohol use and quit tobacco greater than 25 years ago with a 10-pack year smoking history.``` Output:

In [20]:

response = get_completion(prompt_few_shot.format(text))
response

Out[20]:

'{ "label" : "ex-smoker", "keyphrase": "quit tobacco greater than 25 years ago" }'

A More Complicated Example¶

We will use a German-language case report from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7988258/

In [21]:

german_text_long = """
Fallschilderung
Anamnese. Der 67-jährige Patient ist bekannt in Ihrer allgemeininternistischen Praxis.
Letzte Vorstellung vor 2 Monaten: Erstdiagnose Magenkarzinom, Beginn neoadjuvante Chemotherapie (FLOT-Protokoll mit 5‑Fluorouracil, Folinsäure, Oxaliplatin und Docetaxel) in potenziell kurativem Setting im lokalen Klinikum
Heutige Vorstellung: seit etwa 10 Tagen zunehmende Dyspnoe (erst unter Belastung, mittlerweile auch in Ruhe, Orthopnoe nachts).
Gelegentlich etwas trockener Husten, zweimalig Temperatur von 37,7 °C in den letzten 10 Tagen.
In 3 Tagen steht der nächste Chemotherapiezyklus an. Der Patient legt einen Arztbrief vor.
Inhalt: stationäre Behandlung vor 3 Wochen aufgrund einer Pneumonie (linksseitig)

Vorerkrankungen. Linksherzinsuffizienz („heart failure with preserved ejection fraction“ [HFpEF], linksventrikuläre Ejektionsfraktion 50 %), chronisch-obstruktive Lungenerkrankung im Stadium I nach Global Initiative for Chronic Obstructive Lung Disease (GOLD), Risikoklasse A, florider Nikotinabusus (kumulativ 20 Packungsjahre), Magenkarzinom Stadium IIA
Körperliche Untersuchung. Auskultation: beidseits vesikuläres Atemgeräusch mit gering verlängertem Exspirium und basaler Dämpfung links, keine Rasselgeräusche. Perkussion: sonorer Klopfschall bis auf links basal – hier hyposonor, Lungengrenze links nicht atemverschieblich. Herztöne rhythmisch, rein und normofrequent. Knöchelödeme"""

In [22]:

print(prompt_few_shot.format(german_text_long))

Your task is to determine the smoking status of the person described in a clinical note.
Please return the answer as a JSON of the format { label : <label>, evidence: <keyphrase> } without any explanations.
Here are some examples:
Input: ```Smoker until 1999``` Output: ```{ "label" : "ex-smoker", "keyphrase": "Smoker until 1999"}```
Input: ```… SOCIAL HISTORY: Widowed since 1972, no tobacco, no alcohol, lives alone.``` Output: ```{ "label" : "non-smoker", "keyphrase": "no tobacco"}```
Input: ```He is a heavy smoker and drinks 2–3 shots per day at times.``` Output: ```{ "label" : "current smoker", "keyphrase": "heavy smoker"}```
Input: ```
Fallschilderung
Anamnese. Der 67-jährige Patient ist bekannt in Ihrer allgemeininternistischen Praxis.
Letzte Vorstellung vor 2 Monaten: Erstdiagnose Magenkarzinom, Beginn neoadjuvante Chemotherapie (FLOT-Protokoll mit 5‑Fluorouracil, Folinsäure, Oxaliplatin und Docetaxel) in potenziell kurativem Setting im lokalen Klinikum
Heutige Vorstellung: seit etwa 10 Tagen zunehmende Dyspnoe (erst unter Belastung, mittlerweile auch in Ruhe, Orthopnoe nachts).
Gelegentlich etwas trockener Husten, zweimalig Temperatur von 37,7 °C in den letzten 10 Tagen.
In 3 Tagen steht der nächste Chemotherapiezyklus an. Der Patient legt einen Arztbrief vor.
Inhalt: stationäre Behandlung vor 3 Wochen aufgrund einer Pneumonie (linksseitig)

Vorerkrankungen. Linksherzinsuffizienz („heart failure with preserved ejection fraction“ [HFpEF], linksventrikuläre Ejektionsfraktion 50 %), chronisch-obstruktive Lungenerkrankung im Stadium I nach Global Initiative for Chronic Obstructive Lung Disease (GOLD), Risikoklasse A, florider Nikotinabusus (kumulativ 20 Packungsjahre), Magenkarzinom Stadium IIA
Körperliche Untersuchung. Auskultation: beidseits vesikuläres Atemgeräusch mit gering verlängertem Exspirium und basaler Dämpfung links, keine Rasselgeräusche. Perkussion: sonorer Klopfschall bis auf links basal – hier hyposonor, Lungengrenze links nicht atemverschieblich. Herztöne rhythmisch, rein und normofrequent. Knöchelödeme``` Output:

In [23]:

get_completion(prompt_few_shot.format(german_text_long))

Out[23]:

'{"label": "current smoker", "keyphrase": "florider Nikotinabusus"}'

In [24]:

get_completion(prompt_few_shot.format("Der Patient litt unter Kopfschmerzen."))

Out[24]:

'{ "label" : "unknown", "keyphrase": ""}'

In [ ]: