This notebook is focused on users. You can:
Before experimenting with any of the options described above, it is necessary to set and execute the cells under the Languages of users section.
If you're new to Jupyter, please click on Cell > Run All
from the top menu to see what the notebook does. You should see that cells that are running have an In[*]
that will become In[n]
when their execution is finished (n
is a number). To run a specific cell, click in it and press Shift + Enter
or click the Run
button of the top menu. Note that some cells, such as those that define a function, will not have output, but still need to be executed.
In any case, to be able to use the notebook correctly, please run the two following cells first.
import pandas as pd
import csv
import tarfile
import requests as req
from bs4 import BeautifulSoup
pd.set_option('display.max_colwidth', -1) # To display full content of the column
# pd.set_option('display.max_rows', None) # To display ALL rows of the dataframe (otherwise you can decide the max number)
Run the two following cell (you don't have to modify it).
Note that by default, we remove unknown users.
def read_user_languages():
data = pd.read_csv('./user_languages.csv',
sep='\t',
header=None,
names=['Language', 'Level', 'Username', 'Details'],
quoting=csv.QUOTE_NONE)
# The next two lines remove unknown users
data = data[data['Username'] != r'\N']
data = data.dropna(subset=['Username'])
return data.fillna('')
user_infos = read_user_languages()
print(f"{len(user_infos):,} users found.")
The cell below displays 10 random rows, just to give you an overview of the structure of the data.
user_infos.sample(10)
Run the following cell (you don't have to modify it).
def languages_of_user(username, user_frame):
return user_frame[user_frame['Username'] == username].sort_values(by='Level', ascending=False)
Replace username
by the username you want to check, and run the following cell. The results are displayed by descending Level
order.
username = 'nimfeo' # <-- Modify this value
languages_of_user(username, user_infos)
Run the following cell. You don't have to modify it unless you want to sort the final results by an attribute other than Username
.
def users_of_language(iso, user_frame):
frame = user_frame[user_frame['Language'] == iso].sort_values(by='Username') # <-- Modify by=<value> to change the sort order
print(f"{len(frame):,} users found.")
return frame
Specify your target language as a 3-letter ISO code (cmn
, fra
, jpn
, eng
, etc.) and run the next cell to obtain the list of all its speakers.
language = 'fra' # <-- Modify this value
users_of_language(language, user_infos)
Run the following cell (you don't have to modify it).
def natives_of_language(iso, user_frame):
frame = user_frame[user_frame['Language'] == iso].sort_values(by='Username') # <-- Modify by=<value> to change the sort order
frame = frame[frame['Level'] == '5']
print(f"{len(frame):,} native users found.")
return frame
Specify your target language as a 3-letter ISO code (cmn
, fra
, jpn
, eng
, etc.) and run the following cells to obtain the list of all its native speakers.
language = 'fra' # <-- Modify this value
natives_of_language(language, user_infos)
Run the following cell (you don't have to modify it).
def natives_speaking_other(main_language, other_language, user_frame):
native_frame = user_frame[user_frame['Language'] == main_language]
native_users = native_frame[native_frame['Level'] == '5'].Username.tolist()
second_frame = user_frame[user_frame['Language'] == other_language]
second_users = second_frame.Username.tolist()
target_users = list(set(native_users).intersection(second_users))
result = user_frame[user_frame['Username'].isin(target_users) & user_frame['Language'].isin([main_language, other_language])].sort_values(by='Username')
print(f'{len(result) // 2:,} users found.')
return result
The following cell will fetch users who are natives in main_language
but also speak other_language
.
Specify your target languages as 3-letter ISO codes (cmn
, fra
, jpn
, eng
, etc.) and run the cell.
main_language = 'fra' # <-- Modify this value
other_language = 'eng' # <-- Modify this value
natives_speaking_other(main_language, other_language, user_infos)
You can get the list of usernames from any frame you built by appending .Username.tolist()
. A list may be easier to export and work with. Try it for the natives of X speaking Y you built above:
natives_speaking_other(main_language, other_language, user_infos).Username.tolist()
Run the following cell (you don't have to modify it).
def last_contribution_date(users):
dates = []
for user in users:
response = req.get(f"https://tatoeba.org/eng/contributions/of_user/{user}")
soup = BeautifulSoup(response.text, features="html.parser")
logs = soup.find(id="logs")
p_tag = logs.find("p")
if p_tag is None:
dates.append("Unknown")
else:
text = p_tag.text
dates.append(text[text.find(',') + 2:])
frame = {"Username":users, 'Last contribution':dates}
return pd.DataFrame(frame)
To check the date of the last contributed sentence by specific users, replace the values in users
by the username you're interested in.
Don't forget the brackets. For example, for only one user, write users = ['usersname']
.
Note: The execution of the cell may take some time, especially if many usernames are given.
users = ['Ergulis', 'Abdellah', 'TRANG'] # <-- Modify these values
last_contribution_date(users)