This notebook is focused on words. For a specific language, you can:

Before experimenting with any of the options described above, it is necessary to set and execute the cells under the Read sentences section.

If you're new to Jupyter, please click on Cell > Run All from the top menu to see what the notebook does. You should see that cells that are running have an In[*] that will become In[n] when their execution is finished (n is a number). To run a specific cell, click in it and press Shift + Enter or click the Run button of the top menu. Note that some cells, such as those that define a function, will not have output, but still need to be executed.

In any case, to be able to use the notebook correctly, please run the two following cells first.

In [ ]:
import pandas as pd
import nltk
import csv
import tarfile
import string
from collections import Counter
from nltk import RegexpTokenizer
nltk.download('stopwords')
nltk.download('punkt')
In [ ]:
pd.set_option('display.max_colwidth', -1) # To display full content of the column
# pd.set_option('display.max_rows', None) # To display ALL rows of the dataframe (otherwise you can decide the max number)

Read sentences

Reading all sentences takes a long time, so let's split the process into two steps. You only need to run the two following cells once.

In [ ]:
!cat sentences_detailed.tar.bz2.part* > sentences_detailed.tar.bz2
def read_sentences_file():
    with tarfile.open('./sentences_detailed.tar.bz2', 'r:*') as tar:
        csv_path = tar.getnames()[0]
        sentences = pd.read_csv(tar.extractfile(csv_path), 
                sep='\t', 
                header=None, 
                names=['sentenceID', 'ISO', 'Text', 'Username', 'Date added', 'Date last modified'],
                quoting=csv.QUOTE_NONE)
        print(f"{len(sentences):,} sentences fetched.")
        return sentences
In [ ]:
all_sentences = read_sentences_file()

Now, you can fetch sentences of a specific language using the following cells. If you want to change your target language, you can start again from here.

Note that by default, we get rid of the ISO (that is, ISO 639 three-letter language code), Date added, Date last modified, and Username columns.
If you need any of these columns, you can comment out the lines you need by adding a # at the beginning of the corresponding lines of the next cell.

So run the following cell

In [ ]:
def sentences_of_language(sentences, language):
    target_sentences = sentences[sentences['ISO'] == language]
    del target_sentences['Date added']
    del target_sentences['Date last modified']
    del target_sentences['ISO']
    del target_sentences['Username']
    target_sentences = target_sentences.set_index("sentenceID")
    print(f"{len(target_sentences):,} sentences fetched.")
    return target_sentences

Choose your target language as a 3-letter ISO code (cmn, fra, jpn, eng, etc.), and run the next one.

In [ ]:
language = 'fra'  # <-- Modify this value
sentences = sentences_of_language(all_sentences, language)

Now, the variable sentences contains the sentences of the language you specified. Wanna check? The following cell displays five random sentences in your set, just for a quick check.

In [ ]:
sentences.sample(5)

By default, you can see two columns: sentenceID and Text. sentenceID is the same as on Tatoeba, so you can easily access that sentence page there.

To only get the text of sentence with a specific sentenceID, use the following syntax sentences.loc[<sentenceID>].Text as in the following cell.

Note: if <sentenceID> is not in your sentences set, you will get an error.

In [ ]:
sentences.loc[1115].Text

Sentences containing a specific word

As its name indicates, you can use this section to fetch sentences containing a specific word.

Run the following cell (you don't have to modify it).

In [ ]:
def get_sentences(word, sentences):
    frame = sentences[sentences['Text'].str.contains(word)]
    frame = frame.append(sentences[sentences['Text'].str.contains(word.capitalize())])
    frame = frame.append(sentences[sentences['Text'].str.contains(word.upper())])
    frame = frame.append(sentences[sentences['Text'].str.contains(word.lower())])
    return frame.drop_duplicates()

Choose the word you want to search, run the cell, and all sentences (from your current sentences set) containing your word will be displayed.

The occurrences that will match are your word with any capitalization AND any words containing it.

For example, if you look for beauty, sentences containing Beauty will also match.
If you look for sho, sentences containing shoot, short, should and every word containing sho will also match. That is useful in some cases, but annoying in some others :)

In [ ]:
word = "skis"  # <-- Modify this value
get_sentences(word, sentences)

If you want to check (some of) the sentences containing one exact match for the word (that is, the matching is case-sensitive), you can use the following cell

In [ ]:
word = "exemple"  # <-- sentences matching this word EXACTLY will be fetched
sentences[sentences['Text'].str.contains(word)]

Checking how many sentences contain words from a list

Suppose that you want to check how many sentences contain a specific word. You could use get_sentences above and count the results. However, if you have several words in mind, and you only want to know how many sentences contain them, you can use this section.

/!\ Currently, only sentences matching exactly your word will be counted (no uppercase, no capitalization, etc.) /!\

Run the following cell (you don't have to modify it).

In [ ]:
def how_many_sentences(word_list, sentences):
    for w in word_list:
        print(w + "\t\t" + str(len(sentences[sentences['Text'].str.contains(w)])))
#         if len(sentences[sentences['Text'].str.contains(w)]) <= 10:
#             print(w + "\t\t" + str(len(sentences[sentences['Text'].str.contains(w)])))

Then, replace word_list by the words you are interested in.
Do not forget the brackets and the quotes. word_list format should be word_list = ["word1", "word2", ..., "wordn"]

In [ ]:
word_list = ["manger", "skis", "mirage", "oasis"]  # <-- Modify these values
how_many_sentences(word_list, sentences)

Now, suppose that you only want to check the words from your list that appear in fewer than n sentences.

Run the following cell (you don't have to modify it)

In [ ]:
def how_many_sentences_under_threshold(word_list, threshold, sentences):
    for w in word_list:
        nb_occurences = len(sentences[sentences['Text'].str.contains(w)])
        if nb_occurences <= threshold:
            print(w + "\t\t" + str(nb_occurences))

Write your own list of words (just like before) and set n to the number of sentences you want to set as a threshold.
For example, if n is set to 10, only words that appear in fewer than 10 sentences will be returned, along with the number of sentences in which they appear.

In [ ]:
words_list = ["manger", "skis", "mirage", "oasis"]  # <-- Modify these values
n = 10  # <-- Modify this threshold
how_many_sentences_under_threshold(words_list, n, sentences)

 

 

 

Word analysis

This section is a little bit complex and may be challenging to configure as you like. Basically it runs an analysis of word frequency in your (current) set of sentences.

First, we need to ignore some symbols, like punctuation. Run the following cell to see some standard symbols to ignore.

In [ ]:
string.punctuation

You should add punctuation specific to your target language to additional_punctuation in the cell below. The format is a little bit complicated. You should enclose single quotes between double, and double quotes between single.

Run the following cell, and if an error occurred, it is probably because one of your enclosing quotation marks is not correct.

In [ ]:
additional_punctuation = ['``', "''", '``', "''", '...', '’', '``', "''", '«', '»',]

The following cell will display a list of words that will be ignored. Those are common stop words PLUS all the punctuation symbols defined above.
If you're not happy with this list, you can limit it to only punctuation by removing nltk.corpus.stopwords.words(), or extend it by adding another list to useless_words

This list of stop words uses the stopwords corpus of the nltk package. Note that a limited number of languages are available. Currently available are
arabic, azerbaijani, danish, dutch, english, finnish, french, german, greek, hungarian, indonesian, italian, kazakh, nepali, norwegian, portuguese, romanian, russian, slovenian, spanish, swedish, tajik, turkish

Run the cell and see what words will be ignored!

In [ ]:
language = "french"  # <-- Modify this value
useless_words = nltk.corpus.stopwords.words(language)
useless_words += list(string.punctuation)
useless_words += additional_punctuation
useless_words

All right! A little bit more preparation :)

Run the following cell (you don't have to modify it). It creates a list of all non-ignored words present in your (current) set of sentences, so it may take some time!

In [ ]:
# List of words in sentences['Text']
texts = [word for word in sentences['Text']]
all_words = [word for text in texts for word in nltk.word_tokenize(text)]
# "Raw" number of words
print(f'{len(all_words):,} tokens')

Probably the hardest cell to configure correctly. We want to get rid of language-specific symbols that get between words and hinder the analysis. For example, the "apostrophe" in French. The best option is to comment out the toknizer you don't need and use one specific to your language.

/!\ We hope to provide correct toknizer for several languages out of the box in the future, but that is unfortunately not the case for now /!\

Run the following cell after setting a correct toknizer. It will display the number of words after filtering (removing digits, splitting at apostrophes, etc.)

In [ ]:
# Using a RegexpTokenizer to improve tokenizing of French sentences.
# We want to split at apostrophes.
toknizer = RegexpTokenizer(r"''\w'|\w+|[^\w\s]''")
filtered_words = [word.lower() for text in texts for word in toknizer.tokenize(text) if not word.lower() in useless_words]
# Filter numbers written with digits
filtered_words = [word for word in filtered_words if not word.isdigit()]
# Number of filtered words
print(f'{len(filtered_words):,} filtered words')

The following cell simply prints the number of unique words.

In [ ]:
# Number of unique words
print(f'{len(set(filtered_words)):,} unique words')

Finally over!

The following cells gives you the most common words along with how many times they appear! If you know how to use Counter you can probably get more information :)

In [ ]:
word_counter = Counter(filtered_words)
most_common_words = word_counter.most_common()
most_common_words

By running the following cell, you can see a frequency by word rank graph. Well, the number of times the most common words appear is given by the cell above, and it's very likely that the number of times the less common words appear is 1 :)

In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
sorted_word_counts = sorted(list(word_counter.values()), reverse=True)

plt.loglog(sorted_word_counts)
plt.ylabel("Freq")
plt.xlabel("Word Rank")

The following cell puts the list of common words and their occurrence count into a dataframe that is easier to use, and more beautiful to see. Make sure to run it if you want to go to the following section!

In [ ]:
df = pd.DataFrame.from_dict(word_counter, orient='index')
df = df.rename(columns={'index':'word', 0:'count'})
df = df.sort_values(by='count', ascending=False)
df

Words that appear a certain number of times

After the complexity of the previous section, this section will be easy :) Mainly, we play with the dataframe we created above: slice it, display it, from the top, from the bottom, ...

The following cell gives you the list of words that appear only once. You don't have to modify it except if you prefer to see words that appear not once but twice, or any number of times.

In [ ]:
unique_words = df[df['count'] == 1]  # <-- You may modify the 1 here. You can also use >, <, >= and <= signs instead of ==
unique_words

Use df.head(n) for the n most used words.
Use df.tail(n) for n of the less used words.
You can use df[m:n] for the words between the m-th and n-th most used.

For example, you can use this to go through words that are used only once to quickly find typos or erroneous words. First check the words that are used only once by df.tail(n), then use sentences[sentences['Text'].str.contains(word)] with the words you fetched. That way, you can quickly check the sentence containing that word.

In [ ]:
# First ten elements
df.head(10)
In [ ]:
# From 11th to 20th
df[10:20]
In [ ]:
# Last ten elements
df.tail(10)
In [ ]:
# From 15th to the last until 10th to the last
df[len(df)-15:len(df)-10]

The following display the 15 least used words along with the sentences that contain them. Notice however that it is a simplistic approach that may not exactly return what you want. If the word is cat, this will return Cat, cats, and so on.

You have to run the cell containing the definition of the get_sentences function if you want the cell to work. Otherwise, you'll get an error
NameError: name 'get_sentences' is not defined

In [ ]:
n = 15  # <-- Modify this
target_slice = most_common_words[len(df)-n:len(df)]  # <-- Modify this if you need
check_list = [t[0] for t in target_slice]  
for word in check_list:
    print(word)
    display(get_sentences(word, sentences))
In [ ]: