This example shows a simplistic and somewhat problematic regular expression for matching TNM expressions. A more realistic solution can be found here: https://github.com/hpi-dhc/onco-nlp/blob/master/onconlp/classification/rulebased_tnm.py
import re
tnm_pattern = r"T\d+[a-zA-Z]*N\d+[a-zA-Z]*M\d+[a-zA-Z]*"
def check_valid(text):
print("valid" if re.match(tnm_pattern, text) else "not valid")
check_valid('T1N0M1')
valid
check_valid('T1aN2M0')
valid
check_valid('T123')
not valid
check_valid('pT1N0M1')
not valid
check_valid('T1')
not valid
check_valid('T8N9M9')
valid
check_valid('T1 N0 M1')
not valid
Here, we are using the spaCy library with scispaCy models for domain-specific entity extraction. We also use scispaCy's entity linker to map entities to the MeSH vocabulary for normalization.
# Note: on some systems, installing scispaCy fails due to build errors of nmslib. This can usually be circumvented by installing a pre-built nmslib version from conda
#!conda install nmslib
!pip install -q scispacy==0.5.1
!pip install -q https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz
import spacy
from scispacy.linking import EntityLinker
nlp = spacy.load('en_core_sci_sm')
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "mesh", "k" : 5})
/Users/phlobo/miniconda3/envs/dm4dh/lib/python3.11/site-packages/sklearn/base.py:348: InconsistentVersionWarning: Trying to unpickle estimator TfidfTransformer from version 0.22.2.post1 when using version 1.3.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to: https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations warnings.warn( /Users/phlobo/miniconda3/envs/dm4dh/lib/python3.11/site-packages/sklearn/base.py:348: InconsistentVersionWarning: Trying to unpickle estimator TfidfVectorizer from version 0.22.2.post1 when using version 1.3.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to: https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations warnings.warn(
<scispacy.linking.EntityLinker at 0x1664cbe50>
text = "The patient underwent a CT scan in April. It did not reveal any abnormalities."
doc = nlp(text)
Boundary detection / sentence splitting
for s in doc.sents:
print(s)
The patient underwent a CT scan in April. It did not reveal any abnormalities.
sentence = list(doc.sents)[0]
Tokenization
for token in sentence:
print(token)
The patient underwent a CT scan in April .
Part-of-speech tagging
for token in sentence:
print(token, token.pos_)
The DET patient NOUN underwent VERB a DET CT PROPN scan NOUN in ADP April PROPN . PUNCT
Noun chunking
for token in sentence.noun_chunks:
print(token)
The patient a CT scan
Dependency parsing
from spacy import displacy
displacy.render(sentence, style="dep", jupyter=True, options={'distance' : 100})
Entity extraction
for e in sentence.ents:
print('Entity:', e)
Entity: patient Entity: CT scan
Entity normalization / linking
from IPython.display import display_markdown
linker = nlp.get_pipe("scispacy_linker")
for e in sentence.ents:
display_markdown(f'__Entity: {e}__', raw=True)
for entity_id, prob in e._.kb_ents:
mesh_term = linker.kb.cui_to_entity[entity_id]
print('Probability:', prob)
print(mesh_term)
Entity: patient
Probability: 0.8386321067810059 CUI: D019727, Name: Proxy Definition: A person authorized to decide or act for another person, for example, a person having durable power of attorney. TUI(s): Aliases: (total: 2): Patient Agent, Proxy Probability: 0.7973071336746216 CUI: D010361, Name: Patients Definition: Individuals participating in the health care system for the purpose of receiving therapeutic, diagnostic, or preventive procedures. TUI(s): Aliases: (total: 2): Patients, Clients Probability: 0.7851048707962036 CUI: D005791, Name: Patient Care Definition: Care rendered by non-professionals. TUI(s): Aliases: (total: 2): Informal care, Patient Care Probability: 0.7439237833023071 CUI: D000070659, Name: Patient Comfort Definition: Patient care intended to prevent or relieve suffering in conditions that ensure optimal quality living. TUI(s): Aliases: (total: 2): Comfort Care, Patient Comfort Probability: 0.7175934910774231 CUI: D064406, Name: Patient Harm Definition: A measure of PATIENT SAFETY considering errors or mistakes which result in harm to the patient. They include errors in the administration of drugs and other medications (MEDICATION ERRORS), errors in the performance of procedures or the use of other types of therapy, in the use of equipment, and in the interpretation of laboratory findings and preventable accidents involving patients. TUI(s): Aliases: (total: 1): Patient Harm
Entity: CT scan
Probability: 0.8230447173118591 CUI: D000072098, Name: Single Photon Emission Computed Tomography Computed Tomography Definition: An imaging technique using a device which combines TOMOGRAPHY, EMISSION-COMPUTED, SINGLE-PHOTON and TOMOGRAPHY, X-RAY COMPUTED in the same session. TUI(s): Aliases: (total: 5): CT SPECT Scan, Single Photon Emission Computed Tomography Computed Tomography, CT SPECT, SPECT CT Scan, SPECT CT Probability: 0.8186503648757935 CUI: D000072078, Name: Positron Emission Tomography Computed Tomography Definition: An imaging technique that combines a POSITRON-EMISSION TOMOGRAPHY (PET) scanner and a CT X RAY scanner. This establishes a precise anatomic localization in the same session. TUI(s): Aliases: (total: 7): PET-CT Scan, PET-CT, CT PET Scan, Positron Emission Tomography Computed Tomography, PET CT Scan, Positron Emission Tomography-Computed Tomography, CT PET Probability: 0.7265672087669373 CUI: D056973, Name: Four-Dimensional Computed Tomography Definition: Three-dimensional computed tomographic imaging with the added dimension of time, to follow motion during imaging. TUI(s): Aliases: (total: 8): 4D CT Scan, 4D CT, Four-Dimensional CT, Four-Dimensional CT Scan, Four-Dimensional Computed Tomography, 4D Computed Tomography, 4D CAT Scan, Four-Dimensional CAT Scan
!pip install -q https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bionlp13cg_md-0.5.1.tar.gz
text = """Dual MAPK pathway inhibition with BRAF and MEK inhibitors in BRAF(V600E)-mutant NSCLC
might improve efficacy over BRAF inhibitor monotherapy based on observations in BRAF(V600)-mutant melanoma"""
Specialized model for biological entities
bionlp = spacy.load('en_ner_bionlp13cg_md')
biodoc = bionlp(text)
for e in biodoc.ents:
print('Entity:', e, ', Label:', e.label_)
Entity: MAPK , Label: GENE_OR_GENE_PRODUCT Entity: BRAF , Label: GENE_OR_GENE_PRODUCT Entity: MEK , Label: GENE_OR_GENE_PRODUCT Entity: BRAF(V600E)-mutant NSCLC , Label: CANCER Entity: BRAF , Label: GENE_OR_GENE_PRODUCT Entity: melanoma , Label: CELL
displacy.render(biodoc, style='ent', jupyter=True)