#!/usr/bin/env python # coding: utf-8 # # Part-of-Speech Tagging for Russian #

Note

# This section, "Working in Languages Beyond English," is co-authored with Quinn Dombrowski, the Academic Technology Specialist at Stanford University and a leading voice in multilingual digital humanities. I'm grateful to Quinn for helping expand this textbook to serve languages beyond English. #

# In this lesson, we're going to learn about the textual analysis methods *part-of-speech tagging* and *keyword extraction* for Russian-language texts. These methods will help us computationally parse sentences and better understand words in context. # --- # ## spaCy and Natural Language Processing (NLP) # To computationally identify parts of speech, we're going to use the natural language processing library spaCy. For a more extensive introduction to NLP and spaCy, see the previous lesson. # # To parse sentences, spaCy relies on machine learning models that were trained on large amounts of labeled text data. If you've used the preprocessing or named entity recognition notebooks for this language, you can skip the steps for installing spaCy and downloading the language model. # ## Install spaCy # To use spaCy, we first need to install the library. # # Russian models are only available starting in spaCy 3.0. # # If you run into errors because spaCy 2.x is installed, you can run `!pip uninstall spacy -y` first, then run the cell below. # In[ ]: get_ipython().system('pip install -U spacy>=3.0') # ## Import Libraries # Then we're going to import `spacy` and `displacy`, a special spaCy module for visualization. # In[ ]: import spacy from spacy import displacy from collections import Counter import pandas as pd pd.set_option("max_rows", 400) pd.set_option("max_colwidth", 400) # We're also going to import the `Counter` module for counting nouns, verbs, adjectives, etc., and the `pandas` library for organizing and displaying data (we're also changing the pandas default max row and column width display setting). # ## Download Language Model # Next we need to download the Russian-language model (`ru_core_news_md`), which will be processing and making predictions about our texts. You can read more about the corpus it was trained on, on the [spaCy model page](https://spacy.io/models/ru). You can download the `ru_core_news_md` model by running the cell below: # In[ ]: get_ipython().system('python -m spacy download ru_core_news_md') # *Note: spaCy offers [models for other languages](https://spacy.io/usage/models#languages) including German, French, Spanish, Portuguese, Russian, Italian, Dutch, Greek, Norwegian, and Lithuanian*. # # *spaCy offers language and tokenization support for other language via external dependencies — such as [PyviKonlpy](https://github.com/konlpy/konlpy) for Korean.* # ## Load Language Model # Once the model is downloaded, we need to load it with `spacy.load()` and assign it to the variable `nlp`. # In[3]: nlp = spacy.load('ru_core_news_md') # ## Create a Processed spaCy Document # Whenever we use spaCy, our first step will be to create a processed spaCy `document` with the loaded NLP model `nlp()`. Most of the heavy NLP lifting is done in this line of code. After processing, the `document` object will contain tons of juicy language data — named entities, sentence boundaries, parts of speech — and the rest of our work will be devoted to accessing this information. # In[4]: filepath = '../texts/other-languages/ru.txt' text = open(filepath, encoding='utf-8').read() document = nlp(text) # ## spaCy Part-of-Speech Tagging # The tags that spaCy uses for part-of-speech are based on work done by [Universal Dependencies](https://universaldependencies.org/), an effort to create a set of part-of-speech tags that work across many different languages. Texts from various languages are annotated using this common set of tags, and contributed to a common repository that can be used to train models like spaCy. # # The Universal Dependencies page has information about the annotated corpora available for each language; it's worth looking into the corpora that were annotated for your language. # | POS | Description | Examples | # |:-----:|:-------------------------:|:---------------------------------------------:| # | ADJ | adjective | big, old, green, incomprehensible, first | # | ADP | adposition | in, to, during | # | ADV | adverb | very, tomorrow, down, where, there | # | AUX | auxiliary | is, has (done), will (do), should (do) | # | CONJ | conjunction | and, or, but | # | CCONJ | coordinating conjunction | and, or, but | # | DET | determiner | a, an, the | # | INTJ | interjection | psst, ouch, bravo, hello | # | NOUN | noun | girl, cat, tree, air, beauty | # | NUM | numeral | 1, 2017, one, seventy-seven, IV, MMXIV | # | PART | particle | ’s, not, | # | PRON | pronoun | I, you, he, she, myself, themselves, somebody | # | PROPN | proper noun | Mary, John, London, NATO, HBO | # | PUNCT | punctuation | ., (, ), ? | # | SCONJ | subordinating conjunction | if, while, that | # | SYM | symbol | $, %, §, ©, +, −, ×, ÷, =, :), 😝 | # | VERB | verb | run, runs, running, eat, ate, eating | # | X | other | sfpksdpsxmsa | # | SPACE | space | | # # Above is a POS chart taken from [spaCy's website](https://spacy.io/api/annotation#named-entities), which shows the different parts of speech that spaCy can identify as well as their corresponding labels. To quickly see spaCy's POS tagging in action, we can use the [spaCy module `displacy`](https://spacy.io/usage/visualizers#ent) on our sample `document` with the `style=` parameter set to "dep" (short for dependency parsing): # ## Get Part-Of-Speech Tags # To get part of speech tags for every word in a document, we have to iterate through all the tokens in the document and pull out the `.lemma_` attribute for each token, which gives us the un-inflected version of the word. We'll also pull out the `.pos_` attribute for each token. We can get even finer-grained dependency information with the attribute `.dep_`. # # In[5]: for token in document: print(token.lemma_, token.pos_, token.dep_) # ## Practicing with the example text # When working with languages that have inflection, we typically use `token.lemma_` instead of `token.text` like you'll find in the English examples. This is important when we're counting, so that differently-inflected forms of a word (e.g. masculine vs. feminine or singular vs. plural) aren't counted as if they were different words. # In[6]: filepath = "../texts/other-languages/ru.txt" document = nlp(open(filepath, encoding="utf-8").read()) # ## Get Adjectives # | POS | Description | Examples | # |:-----:|:-------------------------:|:---------------------------------------------:| # | ADJ | adjective | big, old, green, incomprehensible, first | # To extract and count the adjectives in the example text, we will follow the same model as above, except we'll add an `if` statement that will pull out words only if their POS label matches "ADJ." #

Python Review

# # While we demonstrate how to extract parts of speech in the sections below, we're also going to reinforce some integral Python skills. Notice how we use `for` loops and `if` statements to `.append()` specific words to a list. Then we count the words in the list and make a pandas dataframe from the list. #

# Here we make a list of the adjectives identified in the example text: # In[7]: adjs = [] for token in document: if token.pos_ == 'ADJ': adjs.append(token.lemma_) # In[8]: adjs # Then we count the unique adjectives in this list with the `Counter()` module: # In[9]: adjs_tally = Counter(adjs) # In[10]: adjs_tally.most_common() # Then we make a dataframe from this list: # In[11]: df = pd.DataFrame(adjs_tally.most_common(), columns=['adj', 'count']) df[:100] # ## Get Nouns # | POS | Description | Examples | # |:-----:|:-------------------------:|:---------------------------------------------:| # | NOUN | noun | girl, cat, tree, air, beauty | # To extract and count nouns, we can follow the same model as above, except we will change our `if` statement to check for POS labels that match "NOUN". # In[12]: nouns = [] for token in document: if token.pos_ == 'NOUN': nouns.append(token.lemma_) nouns_tally = Counter(nouns) df = pd.DataFrame(nouns_tally.most_common(), columns=['noun', 'count']) df[:100] # ## Get Verbs # | POS | Description | Examples | # |:-----:|:-------------------------:|:---------------------------------------------:| # | VERB | verb | run, runs, running, eat, ate, eating | # To extract and count works of art, we can follow a similar-ish model to the examples above. This time, however, we're going to make our code even more economical and efficient (while still changing our `if` statement to match the POS label "VERB"). #

Python Review

# # We can use a [*list comprehension*](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Python/More-Lists-Loops.html#List-Comprehensions) to get our list of verbs in a single line of code! Closely examine the first line of code below: #

# In[13]: verbs = [token.lemma_ for token in document if token.pos_ == 'VERB'] verbs_tally = Counter(verbs) df = pd.DataFrame(verbs_tally.most_common(), columns=['verb', 'count']) df[:100] # # Keyword Extraction # ## Get Sentences with Keyword # spaCy can also identify sentences in a document. To access sentences, we can iterate through `document.sents` and pull out the `.text` of each sentence. # We can use spaCy's sentence-parsing capabilities to extract sentences that contain particular keywords, such as in the function below. Note that the function assumes that the keyword provided will be exactly the same as it appears in the text. # # With the function `find_sentences_with_keyword()`, we will iterate through `document.sents` and pull out any sentence that contains a particular "keyword." Then we will display these sentence with the keywords bolded. # In[14]: import re from IPython.display import Markdown, display # In[15]: def find_sentences_with_keyword(keyword, document): #Iterate through all the sentences in the document and pull out the text of each sentence for sentence in document.sents: sentence = sentence.text #Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase) if keyword.lower() in sentence.lower(): #Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization sentence = re.sub('\n', ' ', sentence) sentence = re.sub(f"{keyword}", f"**{keyword}**", sentence, flags=re.IGNORECASE) display(Markdown(sentence)) # In[16]: find_sentences_with_keyword(keyword="хороший", document=document) # ## Get Keyword in Context # We can also find out about a keyword's more immediate context — its neighboring words to the left and right — and we can fine-tune our search with POS tagging. # # To do so, we will first create a list of what's called *ngrams*. "Ngrams" are any sequence of *n* tokens in a text. They're an important concept in computational linguistics and NLP. (Have you ever played with [Google's *Ngram* Viewer](https://books.google.com/ngrams)?) # # Below we're going to make a list of *bigrams*, that is, all the two-word combinations from the sample text. We're going to use these bigrams to find the neighboring words that appear alongside particular keywords. # In[17]: #Make a list of tokens and POS labels from document if the token is a word tokens_and_labels = [(token.text, token.pos_) for token in document if token.is_alpha] # In[18]: #Make a function to get all two-word combinations def get_bigrams(word_list, number_consecutive_words=2): ngrams = [] adj_length_of_word_list = len(word_list) - (number_consecutive_words - 1) #Loop through numbers from 0 to the (slightly adjusted) length of your word list for word_index in range(adj_length_of_word_list): #Index the list at each number, grabbing the word at that number index as well as N number of words after it ngram = word_list[word_index : word_index + number_consecutive_words] #Append this word combo to the master list "ngrams" ngrams.append(ngram) return ngrams # In[19]: bigrams = get_bigrams(tokens_and_labels) # Let's take a peek at the bigrams: # In[20]: bigrams[5:20] # Now that we have our list of bigrams, we're going to make a function `get_neighbor_words()`. This function will return the most frequent words that appear next to a particular keyword. The function can also be fine-tuned to return neighbor words that match a certain part of speech by changing the `pos_label` parameter. # In[21]: def get_neighbor_words(keyword, bigrams, pos_label = None): neighbor_words = [] keyword = keyword.lower() for bigram in bigrams: #Extract just the lowercased words (not the labels) for each bigram words = [word.lower() for word, label in bigram] #Check to see if keyword is in the bigram if keyword in words: for word, label in bigram: #Now focus on the neighbor word, not the keyword if word.lower() != keyword: #If the neighbor word matches the right pos_label, append it to the master list if label == pos_label or pos_label == None: neighbor_words.append(word.lower()) return Counter(neighbor_words).most_common() # In[25]: get_neighbor_words("сад", bigrams) # In[24]: get_neighbor_words("сад", bigrams, pos_label='VERB') # ## Your Turn! # Try out `find_sentences_with_keyword()` and `get_neighbor_words` with your own keywords of interest. # In[ ]: find_sentences_with_keyword(keyword="YOUR KEY WORD", document=document) # In[ ]: get_neighbor_words(keyword="YOUR KEY WORD", bigrams, pos_label=None)