In this notebook, you can:
First of all, please run the cells under the Read sentences section.
If you're new to Jupyter, please click on Cell > Run All
from the top menu to see what the notebook does. You should see that cells that are running have an In[*]
that will become In[n]
when their execution is finished (n
is a number). To run a specific cell, click in it and press Shift + Enter
or click the Run
button of the top menu. Note that some cells, such as those that define a function, will not have output, but still need to be executed.
In any case, to be able to use the notebook correctly, please run the two following cells first.
import pandas as pd
import csv
import tarfile
pd.set_option('display.max_colwidth', -1) # To display full content of the column
# pd.set_option('display.max_rows', None) # To display ALL rows of the dataframe (otherwise you can decide the max number)
Reading all sentences takes a long time so let's split the process into two steps. You only need to run the two following cells once.
!cat sentences_detailed.tar.bz2.part* > sentences_detailed.tar.bz2
def read_sentences_file():
with tarfile.open('./sentences_detailed.tar.bz2', 'r:*') as tar:
csv_path = tar.getnames()[0]
sentences = pd.read_csv(tar.extractfile(csv_path),
sep='\t',
header=None,
names=['sentenceID', 'ISO', 'Text', 'Username', 'Date added', 'Date last modified'],
quoting=csv.QUOTE_NONE)
print(f"{len(sentences):,} sentences fetched.")
return sentences
all_sentences = read_sentences_file()
Now, you can fetch sentences of a specific language using the following cells. If you want to change you target language, you can start again from here.
Note that by default, we get rid of the ISO
(that is, ISO 639 three-letter language code), Date added
, Date last modified
, and Username
columns.
If you need any of these columns, you can comment out the lines you need by adding a #
at the beginning of the corresponding lines of the next cell.
So run the following cell
def sentences_of_language(sentences, language):
target_sentences = sentences[sentences['ISO'] == language]
del target_sentences['Date added']
del target_sentences['Date last modified']
del target_sentences['ISO']
del target_sentences['Username']
target_sentences = target_sentences.set_index("sentenceID")
print(f"{len(target_sentences):,} sentences fetched.")
return target_sentences
Choose your target language
as a 3-letter ISO code (cmn
, fra
, jpn
, eng
, etc.), and run the next one.
language = 'fra'
sentences = sentences_of_language(all_sentences, language)
Now, the variable sentences
contains the sentences of the language you specified. Wanna check? The following cell displays five random sentences in your set, just for a quick check.
sentences.sample(5)
To check the sentences that do not have final punctuation, we first need to define correct final punctuation signs.
You can use one of the list provided in the cell below or define a list of all the correct punctuations that can be found in your target language. If you define it yourself, be careful to use the same format.
So make sure that one variable has all the signs you need and run the cell below.
# Punctuation I expect to find at the end of French sentences
french_end_punctuation = ('!', '?', '.', '»', '…')
# Punctuation I expect to find at the end of English sentences
english_end_punctuation = ('!', '?', '.', '"', '…')
# Punctuation I expect to find at the end of Japanese sentences
japanese_end_punctuation = ('!', '!', '?', '?', '。', '」')
# Punctuation I expect to find at the end of English sentences
german_end_punctuation = ('!', '?', '.', '“', '…')
# Punctuation I expect to find at the end of Esperanto sentences
esperanto_end_punctuation = ('!', '?', '.', '"', '…')
Set end_punctuation
to the list you need and run the following cell.
end_punctuation = french_end_punctuation # Replace by the one you need
no_punc = sentences[~sentences['Text'].str.endswith(end_punctuation)]
print(f'{len(no_punc):,} sentences found.')
no_punc
contains the list of sentences not ending by any of the characters you specified earlier. Note that if a sentence seems to end correctly, it is probably because there exists a space after the punctuation symbol.
no_punc
You may have noticed that if no_punc
is too long, only the first and last 30 rows are displayed. It is better to avoid displaying its entire content at once, but you can still explore it by slices.
no_punc[n:m]
will give you the sentences between the n-th and the m-th (excluded).
no_punc[15:50]
If you want to check sentences that do not start with a capital letter, simply run the following cells. You do not need to modify anything this time.
no_capital = sentences[[x[0].islower() for x in sentences['Text']]]
print(f'{len(no_capital):,} sentences found.')
no_capital
contains the list of sentences found in your current set of sentences. Run the following cell to display them.
Note: You may find sentences that look like they start by a capital I (i). That is a font issue. The sentence actually starts by a lower l (L).
no_capital