Named Entity Recognition in French biomedical text

by Andrés Soto Villaverde

LinkedIn profile

Named-Entity Recognition (NER) is a subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

In this notebook, we will face the problem of identifying entities in French biomedical texts taken from The QUAERO French Medical Corpus, developed as a resource for Named Entity Recognition and a gold standard set of normalized entities for French biomedical text. A selection of MEDLINE titles and EMEA documents were manually annotated, following the concepts in the Unified Medical Language System (UMLS).

In this corpus, ten types of clinical entities were annotated: Anatomy, Chemical and Drugs, Devices, Disorders, Geographic Areas, Living Beings, Objects, Phenomena, Physiology, Procedures with the labels: ANAT, CHEM, DEVI, DISO, GEOG, LIVB, OBJC, PHEN, PHYS, PROC.

For this notebook, we will only use the MEDLINE texts. MEDLINE is the U.S. National Library of Medicine® (NLM) premier bibliographic database that contains more than 25 million references to journal articles in life sciences with a concentration on biomedicine.

Let's show a sample annotation for a MEDLINE text:

Sample MEDLINE title 1

*Chirurgie de la communication interauriculaire du type " sinus venosus " .*

Sample MEDLINE title 1 annotations

T1 PROC 0 9 Chirurgie

T2 DISO 16 46 communication interauriculaire

This means that the text between characters 0 and 9 is assigned a label PROC (= procedure). The token which corresponds to this text is “Chirurgie”. Second annotation is for the text between characters 16 and 46 (which covers tokens “communication interauriculaire”) and is assigned label DISO (= disorder).

Therefore, we are interested to train a classifier able to extract those text segments and identify them with the correct label. We will use a class of statistical modeling method used for structured prediction known as Conditional Random Fields (CRFs), which falls into the sequence modeling family. Whereas a discrete classifier predicts a label for a single sample without considering "neighboring" samples, a CRF can take context into account. They are used to encode known relationships between observations and construct consistent interpretations and are often used for labeling or parsing of sequential data.

The corpus contains three subdirectories: train, test and dev. For this notebook, we will use only the first one. It contains 1670 files, including 4 files about configuration and statistics. The rest of the files is divided in two types: .TXT files which contain the text of the sentences and annotations files (.ann) with information about the text segments, its types, etc., as we explained below.

We will divide this presentation in two sections:

  1. the preprocessing section, which I will explain below, and
  2. the train and testing section which will be explained in another notebook, which you can read following this link.

Following there are a list of functions used for preprocessing the data.

In [1]:
# path to the data train set 
path_train = "C:/Users/Andres/Jupyter Notebooks/Shedd test/Named Entity Recognition in French biomedical text/data/MEDLINE_train"

The following function reads a file from the 'path' with extension 'ext' and name 'file' as a list of lines and returns this list.

In [2]:
def read_file(path,file,ext):    
    f = path+'\\'+file+ext
    with open(f, 'rt', encoding='utf-8') as myfile:
        data = myfile.readlines()
    return data
In [3]:
# To read a file and obtain its content
data = read_file(path_train,"14448",".txt")
["L' OMS planifie pour l' Europe l' application du processus des soins infirmiers . Compte rendu de la session du groupe technique d' experts en soins infirmiers et obstétricaux du Bureau régional de l' Europe de l' OMS , Nottingham , 14 - 17 décembre 1976\n"]
In [4]:
import pprint

data = read_file(path_train,"14448",".ann")
['T1\tGEOG 24 30\tEurope\n',
 '#1\tAnnotatorNotes T1\tC0015176\n',
 'T2\tPHEN 49 58\tprocessus\n',
 '#2\tAnnotatorNotes T2\tC1522240\n',
 'T3\tPROC 63 79\tsoins infirmiers\n',
 '#3\tAnnotatorNotes T3\tC0028682\n',
 'T4\tLIVB 69 79\tinfirmiers\n',
 '#4\tAnnotatorNotes T4\tC0028676\n',
 'T5\tPROC 143 159\tsoins infirmiers\n',
 '#5\tAnnotatorNotes T5\tC0028682\n',
 'T6\tLIVB 149 159\tinfirmiers\n',
 '#6\tAnnotatorNotes T6\tC0028676\n',
 'T7\tGEOG 201 207\tEurope\n',
 '#7\tAnnotatorNotes T7\tC0015176\n']

Observe that the first file read was "14448.txt" while the second one was "14448.ann"

To transform the text contained into a .ANN file to a dictionary, we use the following function to process its lines. The function ignores the comments (i.e. lines beginning with '#'). For the other lines, it split the lines in 3 parts separated by 'TAB' char:

  1. the line id which indicates the dictionary key for this segment
  2. the label part (i.e. a label and two integers)
  3. the text segment that indicated in the label part

Furthermore, it checks if the label part contains ';'. In that case, it inserts ' ' before and after the ';'. Then, the label part is split in its elements separated by ' ' and the '\n' character at the end of the text is removed. It returns a dictionary with this information

In [5]:
def ann_text2dict(lines):
    d = {}
    for l in lines:
        if not l.startswith('#'):
            t = l.split('\t')
            if ';' in t[1]:
                t[1] = t[1].replace(';',' ; ')
            d[t[0]] = {
                'label':t[1].split(' '),
    return d
In [6]:
d = ann_text2dict(data)
{'T1': {'label': ['GEOG', '24', '30'], 'text': 'Europe'},
 'T2': {'label': ['PHEN', '49', '58'], 'text': 'processus'},
 'T3': {'label': ['PROC', '63', '79'], 'text': 'soins infirmiers'},
 'T4': {'label': ['LIVB', '69', '79'], 'text': 'infirmiers'},
 'T5': {'label': ['PROC', '143', '159'], 'text': 'soins infirmiers'},
 'T6': {'label': ['LIVB', '149', '159'], 'text': 'infirmiers'},
 'T7': {'label': ['GEOG', '201', '207'], 'text': 'Europe'}}

The function collect_files collects a list with the names of the files with extension .TXT and .ANN contained in the indicated path. The file names without the extension are stored in two independent lists and each list is saved in a pickle file. The file name contains the set name and the extension name. We also included two functions to save and load pickle files easily. Pickle is used for serializing and de-serializing Python object structures, also called marshalling or flattening. Serialization refers to the process of converting an object in memory to a byte stream that can be stored on disk or sent over a network. Later on, this character stream can then be retrieved and de-serialized back to a Python object.

In [7]:
import os
import pickle

def save_pickle(data,file):
    pick_file = open(file+".pkl", "wb")
    pickle.dump(data, pick_file)

def load_pickle(file):
    pick_file = open(file+".pkl", "rb")
    data = pickle.load(pick_file)
    return data

def collect_files(path,set):
    dirs = os.listdir(path)
    ltxt = []
    lann = []
    for f in dirs:
        if f.endswith('.txt'):
            f = f.replace('.txt','')
        elif f.endswith('.ann'):
            f = f.replace('.ann','')
In [8]:

The funtion ann_files2dict processes all the files .ANN contained in path, transforming the text contained into each file to a dictionary. The dictionary are collected into a list and saved into a pickle file.

In [9]:
def ann_files2dict(pic_file,path,set):
    lann = load_pickle(pic_file)
    lnew = []
    c = 0
    for ann in lann:
        data = read_file(path, ann, ".ann")
        dic = ann_text2dict(data)

    return lnew
In [10]:
lnew = ann_files2dict('train_ann',path_train,'train')

print("# of ann files",len(lnew))
# of ann files 833

There is one situation that we didn’t mentioned before: it is possible that more labels are assigned to the same token (annotations overlap). In this case, we will only choose one of them and discard the other. For example, let’s assume that we have the following text:

Prévalence des marqueurs des virus des hépatites A , B , C à La Réunion ( Hôpital sud et prison de Saint Pierre ).

With the following annotations :

T1 CHEM 15 24 marqueurs

T2 LIVB 29 34 virus

T3 DISO 39 50 hépatites A

T4 DISO 39 48;53 54 hépatites B

T5 DISO 39 48;57 58 hépatites C

T6 GEOG 61 71 La Réunion

T7 LIVB 29 48;57 58 virus des hépatites C

T8 LIVB 29 48;53 54 virus des hépatites B

T9 LIVB 29 50 virus des hépatites A

You can see that:

  • annotation T2 identifies the word 'virues' (characters 29-34) as a Living Being (LIVB),
  • annotation T9 identifies the segment 'virus des hépatites A' (characters 29-50) as a Living Being (LIVB),
  • annotation T8 identifies the segment 'virus des hépatites B' (characters 29-48 and 53-54) as a Living Being (LIVB), and
  • annotation T7 identifies the segment 'virus des hépatites C' (characters 29-48 and 57-58) as a Living Being (LIVB)

In those cases, we will discard the annotation T2 which is included into the others and keep T7, T8 and T9.

How to detect those situations? Looking at the indexes of each text segment, we can see that the text segment between characters 29-34 (word 'virus') is contained in the other three segments.

Let's define a function to verify if range r1 is contained into range r2.

In [11]:
def is_subrange(r1,r2):
    [a1,b1] = r1
    [a2,b2] = r2
    if int(a2) <= int(a1) and int(b1) <= int(b2): # [a1,b1] subrange of [a2,b2]
        return True
        return False

Using the following function 'contained', we will verify if a certain segment identified by its key (i.e. T2 in the previous example) is contained into another of the segments included into the annotation, returning True or False according to this question.

In [12]:
def contained(key, ann_dic):
    piv = ann_dic[key]['label']
    for k in ann_dic.keys():
        if not k == key:
            lab_field = ann_dic[k]['label']
            if len(lab_field) == 3:
                if len(piv) == 3:
                    if is_subrange([piv[1],piv[2]],
                        return True
    return False

NOTE: observe that the function only considers simple segments (i.e. continuous segments), not the complex ones with more than one range. Later we will talk more about that.

The following function verifies, for each one of the keys in an annotation dictionary, if its range is contained in some other range. In those cases, the contained ranges are eliminated

In [13]:
def remove_contained(ann_dic):
    lrem = []
    for k in ann_dic.keys():
        if contained(k,ann_dic):
    for i in lrem:
        del ann_dic[i]
    return ann_dic

Let's try it with the annotation dictionary previously obtained d

In [14]:
{'T1': {'label': ['GEOG', '24', '30'], 'text': 'Europe'},
 'T2': {'label': ['PHEN', '49', '58'], 'text': 'processus'},
 'T3': {'label': ['PROC', '63', '79'], 'text': 'soins infirmiers'},
 'T4': {'label': ['LIVB', '69', '79'], 'text': 'infirmiers'},
 'T5': {'label': ['PROC', '143', '159'], 'text': 'soins infirmiers'},
 'T6': {'label': ['LIVB', '149', '159'], 'text': 'infirmiers'},
 'T7': {'label': ['GEOG', '201', '207'], 'text': 'Europe'}}
In [15]:
d1 = remove_contained(d)
{'T1': {'label': ['GEOG', '24', '30'], 'text': 'Europe'},
 'T2': {'label': ['PHEN', '49', '58'], 'text': 'processus'},
 'T3': {'label': ['PROC', '63', '79'], 'text': 'soins infirmiers'},
 'T5': {'label': ['PROC', '143', '159'], 'text': 'soins infirmiers'},
 'T7': {'label': ['GEOG', '201', '207'], 'text': 'Europe'}}

We can see that segments T4 and T6 were removed because T4 was contained into T3 and T6 was contained into T5.

The following functions count the number of non continuous segments and the total number of segments associated with 'ann_dic' and the number of non continuous segments and the total number of segments associated with all the dictionaries contained in the train set.

In [16]:
def cont_ncont(ann_dic):
    nnc = 0 # number of non continuous segments
    ntot = 0 # total number of segments
    for k in ann_dic.keys():
        piv = ann_dic[k]['label']
        if not len(piv) == 3:
    return [nnc,ntot]

def count_non_continuous(set):
    ldics = load_pickle(set + '_ann_dics')
    cnc = 0 # cont non continuous segments
    ctot = 0 # cont total segments
    for i in range(len(ldics)):
        [nnc,ntot] = cont_ncont(ldics[i])
    print("Number of non continuous segments",cnc,'%',(cnc/ctot)*100)
    print("Total number of segments",ctot)
In [17]:
set train
Number of non continuous segments 13 % 0.43420173680694724
Total number of segments 2994

It shows that the number of non-contiguous segments is very low and that is why we decided to ignore them for this version.

The information that we need to use to train the classifier is contained in two independent structures: the .TXT files and the annotation dictionaries. Let's now combine and simplfy them.

The function simple_dic will be used to simplify the dictionary structure.

In [18]:
def simple_dic(ann_dic):
    lista = []
    for t in ann_dic.keys():
        pt = ann_dic[t]
        dic = {
            'label' : pt['label'][0],
            'range' : pt['label'][1:],
            'text' : pt['text']
    return lista
In [19]:
sdic = simple_dic(d1)
[{'label': 'GEOG', 'range': ['24', '30'], 'text': 'Europe'},
 {'label': 'PHEN', 'range': ['49', '58'], 'text': 'processus'},
 {'label': 'PROC', 'range': ['63', '79'], 'text': 'soins infirmiers'},
 {'label': 'PROC', 'range': ['143', '159'], 'text': 'soins infirmiers'},
 {'label': 'GEOG', 'range': ['201', '207'], 'text': 'Europe'}]

The function mix_txt_ann combines both the 'ann' structure and the corresponding text of the sentence from the 'txt file in just one dictionary with two keys: 'txt' and 'ann_dic'. The resulting dictionaries are stored in a list and saved into a pickle file called 'train_txt_ann'

In [20]:
def mix_txt_ann(pic_file,path,set):
    ltxt = load_pickle(pic_file)
    lnew = []
    for i in range(len(ltxt)):
        data = read_file(path, ltxt[i], ".txt")
        ldics = load_pickle(set + '_ann_dics')
        ann_dic = remove_contained(ldics[i])
        ndic ={
            'ann_dic': simple_dic(ann_dic)
In [21]:
set = 'train'
lista = load_pickle(set+'_txt_ann')
{'ann_dic': [{'label': 'PROC', 'range': ['0', '10'], 'text': 'Traitement'},
             {'label': 'DISO',
              'range': ['15', '36'],
              'text': 'métastases hépatiques'},
             {'label': 'DISO',
              'range': ['41', '60'],
              'text': 'cancers colorectaux'}],
 'txt': ['Traitement des métastases hépatiques des cancers colorectaux : '
         "jusqu' où aller ?\n"]}

We can simplify more this structure converting the list of dictionaries that corresponds to the annotation part in a list of tuples. We need also to tag all segments included in the TXT segments. We already have some of them tagged, but others don't. We will tag them as 'NONE' indicating that this tag is none of the others.

The funstion ldic2ltup converts a list of dictionaries into a list of tuples, while the function complete_segments tag with 'NONE' the other non-tagged segments.

In [22]:
def ldic2ltup(i,listai):
    ann_dic = listai['ann_dic']
    txt = listai['txt'][0]
    ltup = []
    for dic in ann_dic:
        etiq = dic['label']
        rango = dic['range']
        if len(rango) < 3:
            # print("<3")
            a = int(rango[0])
            b = int(rango[1])
            ltup.append((a, b, etiq, txt[a:b]))
        #    print("varios tuplos")
    return ltup
In [23]:
def complete_segments(set):
    lista = load_pickle(set + '_txt_ann')
    newl = []
    for i in range(len(lista)):       # len(lista)):
        txt = lista[i]['txt'][0] # a list with one elem
        ltup = ldic2ltup(i, lista[i])
        lt1 = []
        if len(ltup) == 0:
            tup = (0,len(txt),'NONE',txt)
        if ltup[0][0]>0:
            a = 0
            b = ltup[0][0]-1
            tup = (a,b,'NONE',txt[a:b])
        for j in range(len(ltup)-1):
            if ltup[j][1]+1 == ltup[j+1][0]: # consecutives
            else: # non consecutives
                a = ltup[j][1]+1
                lt1.append(ltup[j]) # previous one
                b = ltup[j+1][0]-1
                tup = (a,b,'NONE',txt[a:b])
                lt1.append(tup) # new one
        if ltup[-1][1] < len(txt):
            a = ltup[-1][1]+1
            b = len(txt) -1
            tup = (a,b,'NONE',txt[a:b])
        newl.append( {
            'ann_dic' : lt1,
            'txt' : txt
        } )
        save_pickle(newl,set + '_txt_ann2')
In [24]:
set = 'train'
new1 = load_pickle(set + '_txt_ann2')
{'ann_dic': [(0, 10, 'PROC', 'Traitement'),
             (11, 14, 'NONE', 'des'),
             (15, 36, 'DISO', 'métastases hépatiques'),
             (37, 40, 'NONE', 'des'),
             (41, 60, 'DISO', 'cancers colorectaux'),
             (61, 80, 'NONE', ": jusqu' où aller ?")],
 'txt': "Traitement des métastases hépatiques des cancers colorectaux : jusqu' "
        'où aller ?\n'}

With the help of function ldic2ltok_lab, we will tokenize each one of the text segments and tag each token with the corresponding tag i.e. the segment tag

In [25]:
from nltk import RegexpTokenizer

def ldic2ltok_lab(lsent):
    ls_tok_lab = []
    toknizer = RegexpTokenizer(r'''\w'|\w+|[^\w\s]''')
    for sent in lsent:
        ltup = sent['ann_dic']
        lt_tok_lab = []
        for (a, b, lab, txt) in ltup: # un segmento
            lts = toknizer.tokenize(txt)
            ltoks = [(t, lab) for t in lts]
    # print(ls_tok_lab)
    return ls_tok_lab
In [26]:
ls_tok_lab = ldic2ltok_lab(new1)
In [27]:
save_pickle(ls_tok_lab,set + '_txt_ann3')

In this way we finish the preprocessing section. You can access the training and test section through the following link. Thank you very much for reading me and I hope you have found interesting the explanation.

In [ ]: