This example illustrates how to solve typical side-by-side task with Crowdom.
In our example, we ask workers which transcript is more suitable for the given audio.
You may want to first study image classification example because it contains more detailed comments of overall process.
%pip install crowdom
%pip install pyyaml
%pip install markdown2
from datetime import timedelta
import logging.config
import pandas as pd
import markdown2
import yaml
from crowdom import base, datasource, client, objects, pricing, params as labeling_params
with open('logging.yaml') as f:
logging.config.dictConfig(yaml.full_load(f.read()))
from IPython.display import clear_output, display
toloka_client = client.create_toloka_client(token=os.getenv('TOLOKA_TOKEN') or input('Enter your token: '))
clear_output()
lang = 'EN'
instruction = {}
for worker_lang in ['EN', 'RU']:
with open(f'instruction_{worker_lang}.md') as f:
instruction[worker_lang] = markdown2.markdown(f.read())
task_spec = base.TaskSpec(
id='audio-transcript-sbs',
function=base.SbSFunction(inputs=(objects.Text,), hints=(objects.Audio,)),
name={
'EN': 'Audio transcripts comparison',
'RU': 'Сравнение расшифровок аудиозаписей',
},
description={
'EN': 'From the two transcripts, choose the more suitable for given audio recording',
'RU': 'Из двух расшифровок выберите более подходящую аудиозаписи',
},
instruction=instruction,
)
task_spec_en = client.PreparedTaskSpec(task_spec, lang)
client.define_task(task_spec_en, toloka_client)
task_duration_hint = timedelta(seconds=10) # audios are about 1-5 seconds each
input_objects = datasource.read_tasks('tasks.json', task_spec_en.task_mapping)
control_objects = datasource.read_tasks('control_tasks.json', task_spec_en.task_mapping, has_solutions=True)
client.TaskPreview(input_objects[0], task_spec=task_spec_en).display_link()
params_form = labeling_params.get_interface(task_spec_en, task_duration_hint, toloka_client)
params = params_form.get_params()
artifacts = client.launch_sbs(
task_spec_en,
params,
input_objects,
control_objects,
toloka_client,
interactive=True,
)
clear formula, which does not account edge cases like min commission and incomplete assignments
more precise formula, which accounts more edge cases
Output()
2022-10-31 12:44:52,918 - crowdom.client.launch:_launch:193 - INFO: - classification has started
results = artifacts.results
with pd.option_context('max_colwidth', 100):
display(results.predict())
audio_hint | text_a | text_b | result | confidence | overlap | |
---|---|---|---|---|---|---|
0 | https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_026.wav | is was accurate | is this accurate | b | 1.0 | 1 |
1 | https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_300.wav | what has altered | has altered | a | 1.0 | 1 |
2 | https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_120.wav | does this | does this mean | b | 1.0 | 1 |
3 | https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_044.wav | we will push | we will pull | a | 1.0 | 1 |
4 | https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_405.wav | we have not received a letter from the danish | we have not yet received a letter from the irish | b | 1.0 | 1 |