This notebook is focused on users. You can:

Before experimenting with any of the options described above, it is necessary to set and execute the cells under the Languages of users section.

If you're new to Jupyter, please click on Cell > Run All from the top menu to see what the notebook does. You should see that cells that are running have an In[*] that will become In[n] when their execution is finished (n is a number). To run a specific cell, click in it and press Shift + Enter or click the Run button of the top menu. Note that some cells, such as those that define a function, will not have output, but still need to be executed.

In any case, to be able to use the notebook correctly, please run the two following cells first.

In [ ]:
import pandas as pd
import csv
import tarfile

import requests as req
from bs4 import BeautifulSoup
In [ ]:
pd.set_option('display.max_colwidth', -1) # To display full content of the column
# pd.set_option('display.max_rows', None) # To display ALL rows of the dataframe (otherwise you can decide the max number)

Languages of users

Run the two following cell (you don't have to modify it).

Note that by default, we remove unknown users.

In [ ]:
def read_user_languages():
    data = pd.read_csv('./user_languages.csv', 
            sep='\t', 
            header=None, 
            names=['Language', 'Level', 'Username', 'Details'],
            quoting=csv.QUOTE_NONE)
    # The next two lines remove unknown users
    data = data[data['Username'] != r'\N']
    data = data.dropna(subset=['Username'])
    return data.fillna('')

user_infos = read_user_languages()
print(f"{len(user_infos):,} users found.")

The cell below displays 10 random rows, just to give you an overview of the structure of the data.

In [ ]:
user_infos.sample(10)

Languages of a specific user

Run the following cell (you don't have to modify it).

In [ ]:
def languages_of_user(username, user_frame):
    return user_frame[user_frame['Username'] == username].sort_values(by='Level', ascending=False)

Replace username by the username you want to check, and run the following cell. The results are displayed by descending Level order.

In [ ]:
username = 'nimfeo'  # <-- Modify this value
languages_of_user(username, user_infos)

Users of a specific language

Run the following cell. You don't have to modify it unless you want to sort the final results by an attribute other than Username.

In [ ]:
def users_of_language(iso, user_frame):
    frame = user_frame[user_frame['Language'] == iso].sort_values(by='Username')  # <-- Modify by=<value> to change the sort order
    print(f"{len(frame):,} users found.")
    return frame

Specify your target language as a 3-letter ISO code (cmn, fra, jpn, eng, etc.) and run the next cell to obtain the list of all its speakers.

In [ ]:
language = 'fra'  # <-- Modify this value
users_of_language(language, user_infos)

Natives of a specific language

Run the following cell (you don't have to modify it).

In [ ]:
def natives_of_language(iso, user_frame):
    frame = user_frame[user_frame['Language'] == iso].sort_values(by='Username')  # <-- Modify by=<value> to change the sort order
    frame = frame[frame['Level'] == '5']
    print(f"{len(frame):,} native users found.")
    return frame

Specify your target language as a 3-letter ISO code (cmn, fra, jpn, eng, etc.) and run the following cells to obtain the list of all its native speakers.

In [ ]:
language = 'fra'  # <-- Modify this value
natives_of_language(language, user_infos)

Natives of X speaking Y

Run the following cell (you don't have to modify it).

In [ ]:
def natives_speaking_other(main_language, other_language, user_frame):
    native_frame = user_frame[user_frame['Language'] == main_language]
    native_users = native_frame[native_frame['Level'] == '5'].Username.tolist()
    second_frame = user_frame[user_frame['Language'] == other_language]
    second_users = second_frame.Username.tolist()
    target_users = list(set(native_users).intersection(second_users))
    result = user_frame[user_frame['Username'].isin(target_users) & user_frame['Language'].isin([main_language, other_language])].sort_values(by='Username')
    print(f'{len(result) // 2:,} users found.')
    return result

The following cell will fetch users who are natives in main_language but also speak other_language.
Specify your target languages as 3-letter ISO codes (cmn, fra, jpn, eng, etc.) and run the cell.

In [ ]:
main_language = 'fra'  # <-- Modify this value
other_language = 'eng'  # <-- Modify this value
natives_speaking_other(main_language, other_language, user_infos)

You can get the list of usernames from any frame you built by appending .Username.tolist(). A list may be easier to export and work with. Try it for the natives of X speaking Y you built above:

In [ ]:
natives_speaking_other(main_language, other_language, user_infos).Username.tolist()

Other Users information

Date of the last contribution of some users

Run the following cell (you don't have to modify it).

In [ ]:
def last_contribution_date(users):
    dates = []
    for user in users:
        response = req.get(f"https://tatoeba.org/eng/contributions/of_user/{user}")
        soup = BeautifulSoup(response.text, features="html.parser")
        logs = soup.find(id="logs")
        p_tag = logs.find("p")
        if p_tag is None:
            dates.append("Unknown")
        else:
            text = p_tag.text
            dates.append(text[text.find(',') + 2:])
    frame = {"Username":users, 'Last contribution':dates} 
    return pd.DataFrame(frame)

To check the date of the last contributed sentence by specific users, replace the values in users by the username you're interested in.
Don't forget the brackets. For example, for only one user, write users = ['usersname'].

Note: The execution of the cell may take some time, especially if many usernames are given.

In [ ]:
users = ['Ergulis', 'Abdellah', 'TRANG']  # <-- Modify these values
last_contribution_date(users)
In [ ]: