#!/usr/bin/env python # coding: utf-8 # # Named Entity Recognition for Danish # :::{note} # This section, "Working in Languages Beyond English," is co-authored with Quinn Dombrowski, the Academic Technology Specialist at Stanford University and a leading voice in multilingual digital humanities. I'm grateful to Quinn for helping expand this textbook to serve languages beyond English. # ::: # In this lesson, we're going to learn about a text analysis method called *Named Entity Recognition* (NER) as applied to Danish. This method will help us computationally identify people, places, and things (of various kinds) in a text or collection of texts. # --- # ## Dataset # The example text for Danish is *Evangelines Genvordigheder: Til Kvinder med rødt Haar* by Elinor Glyn [from Project Gutenberg](http://www.gutenberg.org/ebooks/33632). # **Here's a preview of spaC's NER tagging *Evangelines Genvordigheder: Til Kvinder med rødt Haar*.** # # If you compare the results to the [English example](Named-Entity-Recognition), you'll notice that the Danish NER is much less good at recognizing entities, and is especially bad ata distinguishing different kinds of entities, like ORG vs LOC vs PER. You need a lot of examples to train a model to distinguish different entity types; currently, English is the only model that does a decent job of it. # # You can read more about the [data sources used to train Danish](https://spacy.io/models/da) on the spaCy model page. # In[19]: displacy.render(document, style="ent") # --- # ## NER with spaCy # If you've already used the pre-processing notebook for this language, you can skip the steps for installing spaCy and downloading the language model. # ### Install spaCy # In[ ]: get_ipython().system('pip install -U spacy') # ### Import Libraries # We're going to import `spacy` and `displacy`, a special spaCy module for visualization. # In[16]: import spacy from spacy import displacy from collections import Counter import pandas as pd pd.options.display.max_rows = 600 pd.options.display.max_colwidth = 400 # We're also going to import the `Counter` module for counting people, places, and things, and the `pandas` library for organizing and displaying data (we're also changing the pandas default max row and column width display setting). # ### Download Language Model # Next we need to download the Danish-language model (`da_core_news_lg`), which will be processing and making predictions about our texts. You can read more about the [data sources used to train Danish](https://spacy.io/models/da) on the spaCy model page. # In[1]: get_ipython().system('python -m spacy download da_core_news_md') # ### Load Language Model # Once the model is downloaded, we need to load it. There are two ways to load a spaCy language model. # **1.** We can import the model as a module and then load it from the module. # In[17]: import da_core_news_md nlp = da_core_news_md.load() # **2.** We can load the model by name. # In[4]: #nlp = spacy.load('da_core_news_md') # If you just downloaded the model for the first time, it's advisable to use Option 1. Then you can use the model immediately. Otherwise, you'll likely need to restart your Jupyter kernel (which you can do by clicking Kernel -> Restart Kernel.. in the Jupyter Lab menu). # ## Process Document # We first need to process our `document` with the loaded NLP model. Most of the heavy NLP lifting is done in this line of code. # # After processing, the `document` object will contain tons of juicy language data — named entities, sentence boundaries, parts of speech — and the rest of our work will be devoted to accessing this information. # # In the cell below, we open and the example document. Then we run `nlp()` on the text and create our document. # In[18]: filepath = '../texts/da.txt' text = open(filepath, encoding='utf-8').read() document = nlp(text) # ## Get Named Entities # All the named entities in our `document` can be found in the `document.ents` property. If we check out `document.ents`, we can see all the entities from the example document. # In[4]: document.ents # Each of the named entities in `document.ents` contains [more information about itself](https://spacy.io/usage/linguistic-features#accessing), which we can access by iterating through the `document.ents` with a simple `for` loop. # # For each `named_entity` in `document.ents`, we will extract the `named_entity` and its corresponding `named_entity.label_`. # In[5]: for named_entity in document.ents: print(named_entity, named_entity.label_) # To extract just the named entities that have been identified as `PER` (person), we can add a simple `if` statement into the mix: # In[6]: for named_entity in document.ents: if named_entity.label_ == "PER": print(named_entity) # ## NER with Long Texts or Many Texts # In[20]: import math number_of_chunks = 80 chunk_size = math.ceil(len(text) / number_of_chunks) text_chunks = [] for number in range(0, len(text), chunk_size): text_chunk = text[number:number+chunk_size] text_chunks.append(text_chunk) # In[21]: chunked_documents = list(nlp.pipe(text_chunks)) # ## Get People # To extract and count the people, we will use an `if` statement that will pull out words only if their "ent" label matches "PER." # In[22]: people = [] for document in chunked_documents: for named_entity in document.ents: if named_entity.label_ == "PER": people.append(named_entity.text) people_tally = Counter(people) df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count']) df # ## Get Places # To extract and count places, we can follow the same model as above, except we will change our `if` statement to check for "ent" labels that match "LOC." # In[23]: places = [] for document in chunked_documents: for named_entity in document.ents: if named_entity.label_ == "LOC": places.append(named_entity.text) places_tally = Counter(places) df = pd.DataFrame(places_tally.most_common(), columns=['place', 'count']) df # ## Get NER in Context # In[10]: from IPython.display import Markdown, display import re def get_ner_in_context(keyword, document, desired_ner_labels= False): if desired_ner_labels != False: desired_ner_labels = desired_ner_labels else: # all possible labels desired_ner_labels = list(nlp.get_pipe('ner').labels) #Iterate through all the sentences in the document and pull out the text of each sentence for sentence in document.sents: #process each sentence sentence_doc = nlp(sentence.text) for named_entity in sentence_doc.ents: #Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase) if keyword.lower() in named_entity.text.lower() and named_entity.label_ in desired_ner_labels: #Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization #sentence_text = sentence.text sentence_text = re.sub('\n', ' ', sentence.text) sentence_text = re.sub(f"{named_entity.text}", f"**{named_entity.text}**", sentence_text, flags=re.IGNORECASE) print('---') display(Markdown(f"**{named_entity.label_}**")) display(Markdown(sentence_text)) # In[13]: for document in chunked_documents: get_ner_in_context('Paris', document) # In[ ]: