This notebook is focused on users' sentences and audio contributions. You can

Before experimenting with any of the options described above, it is necessary to set and execute the cells under the Read sentences section.

If you're new to Jupyter, please click on Cell > Run All from the top menu to see what the notebook does. You should see that cells that are running have an In[*] that will become In[n] when their execution is finished (n is a number). To run a specific cell, click in it and press Shift + Enter or click the Run button in the top menu. Note that some cells, such as those that define a function, will not have output, but still need to be executed.

In any case, to be able to use the notebook correctly, please run the two following cells first.

In [ ]:
import pandas as pd
import csv
import tarfile
In [ ]:
pd.set_option('display.max_colwidth', -1) # To display the full content of the column
# pd.set_option('display.max_rows', None) # To display ALL rows of the dataframe (otherwise you can decide the max number)

Read sentences

Reading all sentences takes a long time so let's split the process into two steps. You only need to run the two following cells once.

In [ ]:
!cat sentences_detailed.tar.bz2.part* > sentences_detailed.tar.bz2
def read_sentences_file():
    with tarfile.open('./sentences_detailed.tar.bz2', 'r:*') as tar:
        csv_path = tar.getnames()[0]
        sentences = pd.read_csv(tar.extractfile(csv_path), 
                sep='\t', 
                header=None, 
                names=['sentenceID', 'ISO', 'Text', 'Username', 'Date added', 'Date last modified'],
                quoting=csv.QUOTE_NONE)
        print(f"{len(sentences):,} sentences fetched.")
        return sentences
In [ ]:
all_sentences = read_sentences_file()

Now, you can fetch sentences of a specific language using the following cells. If you want to change your target language, you can start again from here.

Note that by default, we get rid of the ISO (that is, ISO 639 three-letter language code), Date added, and Date last modified columns.
If you need any of these columns, you can comment out the lines you need by adding a # at the beginning of the corresponding lines of the next cell.

So run the following cell

In [ ]:
def sentences_of_language(sentences, language):
    target_sentences = sentences[sentences['ISO'] == language]
    del target_sentences['Date added']
    del target_sentences['Date last modified']
    del target_sentences['ISO']
    target_sentences = target_sentences.set_index("sentenceID")
    # Unknown users have their names set to <unknown>
    target_sentences.loc[target_sentences['Username'] == r'\N', 'Username'] = '<unknown>'
    print(f"{len(target_sentences):,} sentences fetched.")
    return target_sentences

Choose your target language as a 3-letter ISO code (cmn, fra, jpn, eng, etc.), and run the next one.

In [ ]:
language = 'eng'  # <-- Modify this value
sentences = sentences_of_language(all_sentences, language)

Now, the variable sentences contains the sentences of the language you specified. Wanna check? The following cell displays five random sentences in your set, just for a quick check.

In [ ]:
sentences.sample(5)

By default, you can see three columns: sentenceID, Text, and Username. sentenceID is the same as on Tatoeba, so you can easily access that sentence page there.

Filter sentences by their owner

At its name indicates, you can use this section to fetch sentences belonging or not to some users.

Run the following cell (you don't have to modify it).

In [ ]:
def get_sentences_of_users(sentences, users_included, users_excluded):
    if users_to_include == []:  # Include everybody
        target = sentences[~sentences['Username'].isin(users_excluded)]
    else:
        target = sentences[sentences['Username'].isin(users_included) & ~sentences['Username'].isin(users_excluded)]
    print(f"{len(target):,} sentences fetched.")
    return target.sort_values(by='Username') # Modify this to change the sort order

You can specify the users to include or exclude in the next cell. Obviously, users to include go into the users_to_include variable and the users to exclude go into the users_to_exclude variable. The syntax is a bit special here. If you want to do an "include" search, set users_to_exclude = []. If you want to do an "exclude" search, set users_to_include = []. For example

  • users_to_include = ['user1', 'user2'], users_to_exclude = [] will fetch every sentence belonging to user1 or user2.
  • users_to_include = [], users_to_exclude = ['userA', 'userB'] will fetch every sentence that do not belong to userA or userB.

By running the cell, you will fetch the sentences in the language you set above.

In [ ]:
users_to_include = ['AlanF_US', 'CK']  # <-- Modify these values
users_to_exclude = []  # <-- Modify these values
user_sentences = get_sentences_of_users(sentences, users_to_include, users_to_exclude)

The following cell displays a small sample of the sentences you fetched, just to verify that everything looks fine.

In [ ]:
user_sentences.sample(10)

Audio

In this section, you can filter your set of sentences to fetch only the ones having audio (or the ones that don't). Note that you need to have prepared the user_sentences variables from one of the previous sections.

Run the following cell (you don't have to modify it).

In [ ]:
def read_audio_file():
    with tarfile.open('./sentences_with_audio.tar.bz2', 'r:*') as tar:
        csv_path = tar.getnames()[0]
        sentences = pd.read_csv(tar.extractfile(csv_path), 
                sep='\t', 
                header=None, 
                names=['sentenceID', 'Username', 'License', 'Attribution URL'], 
                quoting=csv.QUOTE_NONE)
    del sentences['License']
    del sentences['Attribution URL']
    return sentences

sentences_with_audio = read_audio_file()
print(f'{len(sentences_with_audio):,} sentences have audio.')
audio_ids = sentences_with_audio['sentenceID'].values

Now, you should have all sentences having audio inside the sentences_with_audio variable. You can quickly check whether sentences_with_audio looks OK by running the following cell.

In [ ]:
sentences_with_audio.sample(10)

The following section allows you to check which sentences of your current user_sentences set have / do not have audio. That's where you'll need to make sure to have run the cells of one of the previous sections.

Note that user_sentences was filtered by the sentence owner, not the audio contributor.

Run the following cell (you don't have to modify it).

In [ ]:
def subset_with_audio(sentences, audio_ids):
    target = user_sentences[user_sentences.index.isin(audio_ids)]
    print(f"{len(target):,} sentences with audio fetched.")
    return target

def subset_without_audio(sentences, audio_ids):
    target = user_sentences[~user_sentences.index.isin(audio_ids)]
    print(f"{len(target):,} sentences without audio fetched.")
    return target

user_sentences_with_audio = subset_with_audio(user_sentences, audio_ids)
user_sentences_without_audio = subset_without_audio(user_sentences, audio_ids)

Now, user_sentences_with_audio and users_sentences_without_audio contain the sentences with / without audio belonging to your current set. You can fetch them and play with them.
That way, you can extract sentences you want to record in the way you want: in order, in random, containing specific words, etc. (You may need to have a look to other notebooks, or get your hands dirty to achieve what you want, though).

Notice also that taking sentences at random may require a bit more management if you want to do it often. Being random, it may happen that you get the same sentences several times. Using files, copy-pasting directly, etc.

As a side note, by default, a maximum of 60 rows will be displayed (the first 30 and the last 30). If you want to display more, you can use

pandas.set_option('display.max_rows', n)

where n is the number of rows you want to display at maximum.

Slicing is often better than displaying everything, but in this particular case, we may need to display one or two hundred sentences so...

Just for an illustration, here is how to take the first 50 sentences of user_sentences_without_audio.

In [ ]:
user_sentences_without_audio[:50]

50 sentences at random:

In [ ]:
user_sentences_without_audio.sample(50)

More filtering, limiting to one user

Suppose you created a set with sentences belonging to several users, and created the user_sentences_without_audio above. Now, if you want to filter the set to only one user, of course you can go back to the Get sentences of specific users section and do everything again. However, there is a simpler way!

The sentences without audio owned by a specific user can be fetched in the following manner. Here, we include only the sentences belonging to the user CK.

In [ ]:
username = 'CK'  # <-- Modify this value
one_user_no_audio = user_sentences_without_audio[user_sentences_without_audio['Username'] == username]
print(f'{one_user_no_audio.shape[0]} sentences without audio belonging to {username}.')

Of course, you can fetch the one_user_no_audio set the same way as previously. For example, for sample_size random sentences, run the following.

In [ ]:
sample_size = 50  # <-- Modify this value
if (sample_size > len(one_user_no_audio)):
    print(f'Warning: Your specified sample size ({sample_size}) is larger than the number of items that match ({len(one_user_no_audio)}). \
    \n Printing all items that match.')
    sample_size = len(one_user_no_audio)
one_user_no_audio.sample(sample_size)

For more advanced way of fetching sentences, check the other books and add code below!

In [ ]: