This notebook is focused on words. For a specific language, you can:
Before experimenting with any of the options described above, it is necessary to set and execute the cells under the Read sentences section.
If you're new to Jupyter, please click on Cell > Run All
from the top menu to see what the notebook does. You should see that cells that are running have an In[*]
that will become In[n]
when their execution is finished (n
is a number). To run a specific cell, click in it and press Shift + Enter
or click the Run
button of the top menu. Note that some cells, such as those that define a function, will not have output, but still need to be executed.
In any case, to be able to use the notebook correctly, please run the two following cells first.
import pandas as pd
import nltk
import csv
import tarfile
import string
from collections import Counter
from nltk import RegexpTokenizer
nltk.download('stopwords')
nltk.download('punkt')
pd.set_option('display.max_colwidth', -1) # To display full content of the column
# pd.set_option('display.max_rows', None) # To display ALL rows of the dataframe (otherwise you can decide the max number)
Reading all sentences takes a long time, so let's split the process into two steps. You only need to run the two following cells once.
!cat sentences_detailed.tar.bz2.part* > sentences_detailed.tar.bz2
def read_sentences_file():
with tarfile.open('./sentences_detailed.tar.bz2', 'r:*') as tar:
csv_path = tar.getnames()[0]
sentences = pd.read_csv(tar.extractfile(csv_path),
sep='\t',
header=None,
names=['sentenceID', 'ISO', 'Text', 'Username', 'Date added', 'Date last modified'],
quoting=csv.QUOTE_NONE)
print(f"{len(sentences):,} sentences fetched.")
return sentences
all_sentences = read_sentences_file()
Now, you can fetch sentences of a specific language using the following cells. If you want to change your target language, you can start again from here.
Note that by default, we get rid of the ISO
(that is, ISO 639 three-letter language code), Date added
, Date last modified
, and Username
columns.
If you need any of these columns, you can comment out the lines you need by adding a #
at the beginning of the corresponding lines of the next cell.
So run the following cell
def sentences_of_language(sentences, language):
target_sentences = sentences[sentences['ISO'] == language]
del target_sentences['Date added']
del target_sentences['Date last modified']
del target_sentences['ISO']
del target_sentences['Username']
target_sentences = target_sentences.set_index("sentenceID")
print(f"{len(target_sentences):,} sentences fetched.")
return target_sentences
Choose your target language
as a 3-letter ISO code (cmn
, fra
, jpn
, eng
, etc.), and run the next one.
language = 'fra' # <-- Modify this value
sentences = sentences_of_language(all_sentences, language)
Now, the variable sentences
contains the sentences of the language you specified. Wanna check? The following cell displays five random sentences in your set, just for a quick check.
sentences.sample(5)
By default, you can see two columns: sentenceID
and Text
. sentenceID
is the same as on Tatoeba, so you can easily access that sentence page there.
To only get the text of sentence with a specific sentenceID
, use the following syntax sentences.loc[<sentenceID>].Text
as in the following cell.
Note: if <sentenceID>
is not in your sentences
set, you will get an error.
sentences.loc[1115].Text
As its name indicates, you can use this section to fetch sentences containing a specific word.
Run the following cell (you don't have to modify it).
def get_sentences(word, sentences):
frame = sentences[sentences['Text'].str.contains(word)]
frame = frame.append(sentences[sentences['Text'].str.contains(word.capitalize())])
frame = frame.append(sentences[sentences['Text'].str.contains(word.upper())])
frame = frame.append(sentences[sentences['Text'].str.contains(word.lower())])
return frame.drop_duplicates()
Choose the word
you want to search, run the cell, and all sentences (from your current sentences set) containing your word will be displayed.
The occurrences that will match are your word with any capitalization AND any words containing it.
For example, if you look for beauty
, sentences containing Beauty
will also match.
If you look for sho
, sentences containing shoot
, short
, should
and every word containing sho
will also match. That is useful in some cases, but annoying in some others :)
word = "skis" # <-- Modify this value
get_sentences(word, sentences)
If you want to check (some of) the sentences containing one exact match for the word (that is, the matching is case-sensitive), you can use the following cell
word = "exemple" # <-- sentences matching this word EXACTLY will be fetched
sentences[sentences['Text'].str.contains(word)]
Suppose that you want to check how many sentences contain a specific word. You could use get_sentences
above and count the results. However, if you have several words in mind, and you only want to know how many sentences contain them, you can use this section.
/!\ Currently, only sentences matching exactly your word will be counted (no uppercase, no capitalization, etc.) /!\
Run the following cell (you don't have to modify it).
def how_many_sentences(word_list, sentences):
for w in word_list:
print(w + "\t\t" + str(len(sentences[sentences['Text'].str.contains(w)])))
# if len(sentences[sentences['Text'].str.contains(w)]) <= 10:
# print(w + "\t\t" + str(len(sentences[sentences['Text'].str.contains(w)])))
Then, replace word_list
by the words you are interested in.
Do not forget the brackets and the quotes. word_list
format should be word_list = ["word1", "word2", ..., "wordn"]
word_list = ["manger", "skis", "mirage", "oasis"] # <-- Modify these values
how_many_sentences(word_list, sentences)
Now, suppose that you only want to check the words from your list that appear in fewer than n
sentences.
Run the following cell (you don't have to modify it)
def how_many_sentences_under_threshold(word_list, threshold, sentences):
for w in word_list:
nb_occurences = len(sentences[sentences['Text'].str.contains(w)])
if nb_occurences <= threshold:
print(w + "\t\t" + str(nb_occurences))
Write your own list of words (just like before) and set n
to the number of sentences you want to set as a threshold.
For example, if n
is set to 10, only words that appear in fewer than 10 sentences will be returned, along with the number of sentences in which they appear.
words_list = ["manger", "skis", "mirage", "oasis"] # <-- Modify these values
n = 10 # <-- Modify this threshold
how_many_sentences_under_threshold(words_list, n, sentences)
This section is a little bit complex and may be challenging to configure as you like. Basically it runs an analysis of word frequency in your (current) set of sentences.
First, we need to ignore some symbols, like punctuation. Run the following cell to see some standard symbols to ignore.
string.punctuation
You should add punctuation specific to your target language to additional_punctuation
in the cell below. The format is a little bit complicated. You should enclose single quotes between double, and double quotes between single.
Run the following cell, and if an error occurred, it is probably because one of your enclosing quotation marks is not correct.
additional_punctuation = ['``', "''", '``', "''", '...', '’', '``', "''", '«', '»',]
The following cell will display a list of words that will be ignored. Those are common stop words PLUS all the punctuation symbols defined above.
If you're not happy with this list, you can limit it to only punctuation by removing nltk.corpus.stopwords.words()
, or extend it by adding another list to useless_words
This list of stop words uses the stopwords
corpus of the nltk package. Note that a limited number of languages are available. Currently available are
arabic
, azerbaijani
, danish
, dutch
, english
, finnish
, french
, german
, greek
, hungarian
, indonesian
, italian
, kazakh
, nepali
, norwegian
, portuguese
, romanian
, russian
, slovenian
, spanish
, swedish
, tajik
, turkish
Run the cell and see what words will be ignored!
language = "french" # <-- Modify this value
useless_words = nltk.corpus.stopwords.words(language)
useless_words += list(string.punctuation)
useless_words += additional_punctuation
useless_words
All right! A little bit more preparation :)
Run the following cell (you don't have to modify it). It creates a list of all non-ignored words present in your (current) set of sentences, so it may take some time!
# List of words in sentences['Text']
texts = [word for word in sentences['Text']]
all_words = [word for text in texts for word in nltk.word_tokenize(text)]
# "Raw" number of words
print(f'{len(all_words):,} tokens')
Probably the hardest cell to configure correctly. We want to get rid of language-specific symbols that get between words and hinder the analysis. For example, the "apostrophe" in French. The best option is to comment out the toknizer
you don't need and use one specific to your language.
/!\ We hope to provide correct toknizer
for several languages out of the box in the future, but that is unfortunately not the case for now /!\
Run the following cell after setting a correct toknizer
. It will display the number of words after filtering (removing digits, splitting at apostrophes, etc.)
# Using a RegexpTokenizer to improve tokenizing of French sentences.
# We want to split at apostrophes.
toknizer = RegexpTokenizer(r"''\w'|\w+|[^\w\s]''")
filtered_words = [word.lower() for text in texts for word in toknizer.tokenize(text) if not word.lower() in useless_words]
# Filter numbers written with digits
filtered_words = [word for word in filtered_words if not word.isdigit()]
# Number of filtered words
print(f'{len(filtered_words):,} filtered words')
The following cell simply prints the number of unique words.
# Number of unique words
print(f'{len(set(filtered_words)):,} unique words')
Finally over!
The following cells gives you the most common words along with how many times they appear! If you know how to use Counter
you can probably get more information :)
word_counter = Counter(filtered_words)
most_common_words = word_counter.most_common()
most_common_words
By running the following cell, you can see a frequency by word rank graph. Well, the number of times the most common words appear is given by the cell above, and it's very likely that the number of times the less common words appear is 1 :)
%matplotlib inline
import matplotlib.pyplot as plt
sorted_word_counts = sorted(list(word_counter.values()), reverse=True)
plt.loglog(sorted_word_counts)
plt.ylabel("Freq")
plt.xlabel("Word Rank")
The following cell puts the list of common words and their occurrence count into a dataframe that is easier to use, and more beautiful to see. Make sure to run it if you want to go to the following section!
df = pd.DataFrame.from_dict(word_counter, orient='index')
df = df.rename(columns={'index':'word', 0:'count'})
df = df.sort_values(by='count', ascending=False)
df
After the complexity of the previous section, this section will be easy :) Mainly, we play with the dataframe we created above: slice it, display it, from the top, from the bottom, ...
The following cell gives you the list of words that appear only once. You don't have to modify it except if you prefer to see words that appear not once but twice, or any number of times.
unique_words = df[df['count'] == 1] # <-- You may modify the 1 here. You can also use >, <, >= and <= signs instead of ==
unique_words
Use df.head(n)
for the n
most used words.
Use df.tail(n)
for n
of the less used words.
You can use df[m:n]
for the words between the m-th and n-th most used.
For example, you can use this to go through words that are used only once to quickly find typos or erroneous words. First check the words that are used only once by df.tail(n)
, then use sentences[sentences['Text'].str.contains(word)]
with the words you fetched. That way, you can quickly check the sentence containing that word.
# First ten elements
df.head(10)
# From 11th to 20th
df[10:20]
# Last ten elements
df.tail(10)
# From 15th to the last until 10th to the last
df[len(df)-15:len(df)-10]
The following display the 15 least used words along with the sentences that contain them. Notice however that it is a simplistic approach that may not exactly return what you want. If the word is cat
, this will return Cat
, cats
, and so on.
You have to run the cell containing the definition of the get_sentences
function if you want the cell to work. Otherwise, you'll get an error
NameError: name 'get_sentences' is not defined
n = 15 # <-- Modify this
target_slice = most_common_words[len(df)-n:len(df)] # <-- Modify this if you need
check_list = [t[0] for t in target_slice]
for word in check_list:
print(word)
display(get_sentences(word, sentences))