This example is a demonstration of Crowdom's data labeling workflow for annotation tasks.

In annotation tasks, unlike classification tasks, there are an "unlimited" (comparing to fixed classification label set) number of possible solutions.

Data labeling quality control measures differ in this case – unlike control tasks in classification, we ask other workers to check received solutions (annotations).

For annotation task workflow example, we chose an audio transcription task - we ask workers to write down the words they hear in the audios.

If this is your first time with Crowdom workflow structure – visit image classification workflow example.

Setup environment¶

%pip install crowdom

from datetime import timedelta
import os
import pandas as pd

import toloka.client as toloka

from crowdom import base, datasource, client, objects, pricing, params as labeling_params

Logging customization¶

import yaml
import logging.config

with open('logging.yaml') as f:
    logging.config.dictConfig(yaml.full_load(f.read()))

Crowdsourcing platfrom authorization¶

from IPython.display import clear_output, display

token = os.getenv('TOLOKA_TOKEN') or input('Enter your token: ')
clear_output()

Authorization¶

toloka_client = client.create_toloka_client(token=token)

Test environment¶

toloka_client = client.create_toloka_client(token=token, environment=toloka.TolokaClient.Environment.SANDBOX)

Labeling task definition¶

We are dealing with annotation task, and we transcribe Audio into Text:

annotation_function = base.AnnotationFunction(
    inputs=(objects.Audio,),
    outputs=(objects.Text,)
)

Worker interface preview¶

example_url = 'https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_299.wav'
example_audio = (objects.Audio(url=example_url),)

client.TaskPreview(example_audio, task_function=annotation_function, lang='EN').display_link()

Workers instruction¶

instruction = {
    'RU': 'Запишите звучащие на аудио слова, без знаков препинания и заглавных букв.',
    'EN': 'Transcribe the audio, without any punctuation or capitalization.'}

Labeling task specification¶

task_spec = base.TaskSpec(
    id='audio-transcription',
    function=annotation_function,
    name={'EN': 'Audio transcription', 'RU': 'Расшифровка аудио'},
    description={'EN': 'Transcribe short audios', 'RU': 'Расшифровка коротких аудио'},
    instruction=instruction)

Workers in their task feed will see your task for EN language like this, depending on where they are doing the tasks:

Browser	Mobile app

Language of your data:

lang = 'EN'

Localized version of annotation_task_spec:

task_spec_en = client.AnnotationTaskSpec(task_spec, lang)

Importing source data¶

Expected file format – JSON list, each object having keys from name typed like type. As is image classification example, for media types, such as Audio, we expect URLs.

datasource.file_format(task_spec_en.task_mapping)

input_objects = datasource.read_tasks('tasks.json', task_spec_en.task_mapping)
control_objects = None

Reference labeling¶

In addition to the source data, a reference labeling is expected in the file. For our task, reference labeling is correct transcription, located in text field.

datasource.file_format(task_spec_en.task_mapping, has_solutions=True)

control_objects = datasource.read_tasks(
    'control_tasks.json',
    task_spec_en.task_mapping,
    has_solutions=True,
)

Define task duration hint.

# audios are 3-10 seconds each, and workers need time to transcribe them
task_duration_hint = timedelta(seconds=20)

Task verification and feedback¶

Experts reward definition¶

Define estimated task duration hint for experts:

task_duration_hint = timedelta(seconds=30)

from crowdom import experts, project

scenario = project.Scenario.EXPERT_LABELING_OF_TASKS
experts_task_spec = client.AnnotationTaskSpec(task_spec, lang, scenario)


if control_objects:
    objects = control_objects
    experts_task_spec = experts_task_spec.check
else:
    objects = input_objects

Inhouse experts¶

avg_price_per_hour = None

External Toloka experts¶

avg_price_per_hour = 3.5  # USD

Pricing config¶

pricing_options = pricing.get_expert_pricing_options(
    task_duration_hint, experts_task_spec.task_mapping, avg_price_per_hour)
pricing_config = pricing.choose_default_expert_option(pricing_options, avg_price_per_hour)

Getting feedback¶

client.define_task(experts_task_spec, toloka_client)

raw_feedback = client.launch_experts(
    experts_task_spec,
    client.ExpertParams(
        task_duration_hint=task_duration_hint,
        pricing_config=pricing_config,
    ),
    objects[:10],
    experts.ExpertCase.TASK_VERIFICATION,
    toloka_client,
    interactive=True)

clear formula, which does not account edge cases like min commission and incomplete assignments

precise formula, which accounts all edge cases

run expert labeling of 10 objects for 0.01$? [Y/n] Y
2022-05-11 11:42:27,838 - crowdom.client.launch:launch_experts:207 - INFO: - expert labeling has started

worker_id_to_name = {'fd060a4d57b00f9bba4421fe4c7c22f3': 'bob'}  # {'< hex 32-digit id >': '< username >'}

feedback = client.ExpertLabelingResults(raw_feedback, experts_task_spec, worker_id_to_name)

feedback_df = feedback.get_results()

with pd.option_context("max_colwidth", 100):
    display(feedback_df)

task_duration_hint = feedback_df['duration'].mean().to_pytimedelta()  # with reference labeling
# task_duration_hint = timedelta(seconds=experts_proposed_value)  # without reference labeling
task_duration_hint

datetime.timedelta(seconds=29, microseconds=908080)

During the annotation process, as a measure of quality control, we show gathered annotations to other workers and ask them to evaluate them – we refer to this process as annotation check. This process, however, needs its own quality control measures – so, we can create control objects and training for annotation check as well as training for main annotation process.

control_objects, _ = feedback.get_correct_objects(client.ExpertLabelingApplication.CONTROL_TASKS)

Creating workers training¶

Annotation training¶

training_objects, comments = feedback.get_correct_objects(application=client.ExpertLabelingApplication.TRAINING)

training_config = pricing.choose_default_training_option(
    pricing.get_training_options(task_duration_hint, len(training_objects), training_time=timedelta(minutes=2)))

client.define_task(task_spec_en, toloka_client)

client.create_training(
    task_spec_en,
    training_objects,
    comments,
    toloka_client,
    training_config)

2022-05-11 12:06:27,566 - toloka.client:create_training:1736 - INFO: - A new training with ID "1173848" has been created.

Annotation check training¶

check_training_objects, check_comments = feedback.get_correct_objects(application=client.ExpertLabelingApplication.ANNOTATION_CHECK_TRAINING)

training_config = pricing.choose_default_training_option(
    pricing.get_training_options(task_duration_hint, len(training_objects), training_time=timedelta(minutes=2)))

client.define_task(task_spec_en, toloka_client)

client.create_training(
    task_spec_en.check,
    check_training_objects,
    check_comments,
    toloka_client,
    training_config)

2022-05-11 18:15:37,631 - toloka.client:create_training:1736 - INFO: - A new training with ID "1174158" has been created.

Labeling efficiency optimization¶

You can skip any customization in this section and use default options, which we consider suitable for a wide range of typical tasks, or tune parameters to you liking.

For general information about labeling efficiency optimization and for information about customization for classification task – view image classification example.

Annotation labeling process consists of two distinct subprocesses – annotation and check steps. You can interactively customize parameters for each of these steps independently.

Most of parameters for annotation step are the same as for classification. There's a new addition – Assignment check sample.

With this option enabled, only a portion of tasks would be checked from each assignment - you can change this number with Max tasks to check option. After that if enough of these tasks were done correctly, this whole assignment would be finalized - all tasks from it would be considered checked, and no more checks would be created for them. You can change this threshold number of correctly done tasks with Accuracy threshold option.

Assignment check sample can reduce cost and time of labeling process, but low check coverage can't guarantee high quality for unchecked solutions.

You can specify different task_duration_hints for main process and check, if they require significantly different time to complete.

params_form = labeling_params.get_annotation_interface(
    task_spec=task_spec_en,
    check_task_duration_hint=task_duration_hint,
    annotation_task_duration_hint=task_duration_hint,
    toloka_client=toloka_client)

check_params, annotation_params = params_form.get_params()

Efficiency customization¶

You can define your own pricing config for labeling.

However, you can only specify real_task_count and assignment_price for it, we cannot use control tasks directly for labeling quality control.

from crowdom import classification, classification_loop, control, evaluation, worker

pricing_config = pricing.PoolPricingConfig(assignment_price=0.05, real_tasks_count=20, control_tasks_count=0)

assert pricing_config.control_tasks_count == 0

Define quality and control params:

assignment_check_sample = evaluation.AssignmentCheckSample(
    max_tasks_to_check=15,
    assignment_accuracy_finalization_threshold=0.85,
)

You can specify a custom overlap, minimum number attempts for annotation should always be 1:

correct_done_task_ratio_for_acceptance = 0.5

control_params = control.Control(
    rules=control.RuleBuilder().add_static_reward(
        threshold=correct_done_task_ratio_for_acceptance).add_speed_control(
            # if worker complete tasks in 10% of expected time, we will reject assignment assuming fraud/scripts/random clicking
            # specify 0 to disable this control option
            ratio_rand=.1,
            # if worker complete tasks in 30% of expected time, we will block him for a while, suspecting poor performance
            # specify 0 to disable this control option
            ratio_poor=.3,
        ).build())


annotation_params = client.AnnotationParams(
    task_duration_hint=task_duration_hint,
    pricing_config=pricing_config,
    overlap=classification_loop.DynamicOverlap(min_overlap=1, max_overlap=3, confidence=0.85),
    control=control_params,
    assignment_check_sample=assignment_check_sample,
    worker_filter=worker.WorkerFilter(
        filters=[
             worker.WorkerFilter.Params(
                 langs={worker.LanguageRequirement(lang=lang)},
                 regions=worker.lang_to_default_regions.get(lang, {}),
                 age_range=(18, None),
             ),
        ],
        training_score=None,
    ),
)

assert isinstance(annotation_params.overlap, classification_loop.DynamicOverlap)

Labeling of your data¶

client.define_task(task_spec_en, toloka_client)

assert control_objects, 'No control objects supplied'
assert isinstance(control_objects[0], tuple)

try:
    task_spec_en.check.task_mapping.validate_objects(control_objects[0][0])
except:
    control_objects = [(task + solution, (base.BinaryEvaluation(ok=True),)) for (task, solution) in control_objects]

artifacts = client.launch_annotation(
    task_spec_en,
    annotation_params,
    check_params,
    input_objects,
    control_objects,
    toloka_client)

2022-05-11 13:55:59,666 - crowdom.client.launch:launch_annotation:266 - INFO: - annotation has started

results = artifacts.results

Results study¶

Ground truth (most probable option):

with pd.option_context("max_colwidth", 100):
    display(results.predict())

All gathered annotations with respective confidence values:

with pd.option_context("max_colwidth", 100):
    display(results.predict_proba())

Detailed information about each annotation and each check for it:

with pd.option_context('max_colwidth', 150), pd.option_context('display.max_rows', 100):
    display(results.worker_labels())

Labeling quality verification¶

Quality verification closely resembles task verification with reference labeling. However, it slightly differs in options. You can run verification on random sample of labeled objects:

import random

scenario = project.Scenario.EXPERT_LABELING_OF_SOLVED_TASKS
experts_task_spec = client.AnnotationTaskSpec(task_spec, lang, scenario)

sample_size = min(20, int(0.1 * len(input_objects)))
objects = random.sample(client.select_control_tasks(input_objects, results.raw, min_confidence=.0), sample_size)

client.define_task(experts_task_spec, toloka_client)

2022-05-16 15:45:21,607 - crowdom.client.task:define_task:125 - INFO: - no changes in task

raw_feedback = client.launch_experts(
    experts_task_spec,
    client.ExpertParams(
        task_duration_hint=task_duration_hint,
        pricing_config=pricing_config,
    ),
    objects,
    experts.ExpertCase.LABELING_QUALITY_VERIFICATION,
    toloka_client,
    interactive=True)

clear formula, which does not account edge cases like min commission and incomplete assignments

precise formula, which accounts all edge cases

run expert labeling of 10 objects for 0.01$? [Y/n] Y
2022-05-16 15:40:26,018 - crowdom.client.launch:launch_experts:209 - INFO: - expert labeling has started

test_results = client.ExpertLabelingResults(raw_feedback, experts_task_spec)

test_results.get_accuracy()

1.0

	audio	text	eval	_ok	_comment	worker	duration
0	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_216.wav	he said that healthy eating was high on the council agenda	True	True	correct transcription: "he said that healthy eating was high on the council agenda"	bob	0 days 00:01:29.653100
1	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_146.wav	it was deployed in the gold war	False	True	correct transcription: "it was deployed in the gulf war" - "gulf war", not "gold war"	bob	0 days 00:01:29.653100
2	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_099.wav	they were under a lot of pressure from the other clubs	True	True	correct transcription: "they were under a lot of pressure from the other clubs"	bob	0 days 00:01:29.653100
3	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_178.wav	its the real thing for sure	True	False	not clear what to do with contractions - "it is", "its" or "it's"	bob	0 days 00:01:29.653100
4	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_166.wav	you are not going in blind	True	True		bob	0 days 00:01:29.653100
5	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_347.wav	sex offender programmes to be retained by public sector	True	True		bob	0 days 00:01:29.653100
6	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_164.wav	that must be left to the parole board	True	True		bob	0 days 00:01:29.653100
7	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_261.wav	similar measures are expected in England and Wales	False	True	correct transcription: "similar measures are expected in england and wales", no punctiuation is ...	bob	0 days 00:01:29.653100
8	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_105.wav	that is no use	True	False	not clear what to do with contractions - "that is", "thats" or "that's"	bob	0 days 00:01:29.653100
9	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_029.wav	their courage and their honesty should be respected	True	True	correct transcription: "their courage and their honesty should be respected"	bob	0 days 00:01:29.653100

	audio	text
0	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_398.wav	but they are still unlikely to be involved in combat
2	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_213.wav	he is slightly confused
6	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_398.wav	i was very young
10	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_017.wav	others have tried to explain phenomenon physically
14	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_013.wav	some have excepted it as a miracle without any physical explanation
19	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_259.wav	it is very nice
23	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_296.wav	it was not just the character and the energy of the playing
26	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_264.wav	the work between musicians and the fire is very important
28	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_345.wav	instead he was informed to join the line of creditors
30	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_001.wav	please call stella
34	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_042.wav	you must always attempt to raise the bar
36	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_101.wav	however the intensive care unit at the southern general hospital is full
38	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_434.wav	the bill would be massive
43	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_411.wav	you wanted the evidence
45	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_306.wav	it should be equal
47	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_093.wav	he was possessed
51	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_023.wav	if the red of the second bow falls upon the green of the first the result is to give a bow with ...
55	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_333.wav	like his acting it was an accident
57	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_309.wav	we are now looking at the degrees of injury
61	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_362.wav	you must have a government and a good civil service

	audio	text	confidence
0	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_398.wav	but they are still unlikely to be involved in combat	0.991803
2	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_213.wav	he is slightly confused	0.995690
4	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_213.wav	he was slightly confused	0.008197
6	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_398.wav	i was very young	0.995690
10	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_017.wav	others have tried to explain phenomenon physically	0.995690
12	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_017.wav	others have tried to explain phenomena physically	0.008197
14	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_013.wav	some have excepted it as a miracle without any physical explanation	0.995690
16	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_013.wav	some have accepted it as a miracle without physical explanation	0.045455
19	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_259.wav	it is very nice	0.995690
21	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_259.wav	its very nice	0.008197
23	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_296.wav	it was not just the character and the energy of the playing	0.995690
25	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_296.wav	it wasnt just the character and the energy of the play	NaN
26	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_264.wav	the work between musicians and the fire is very important	0.991803
28	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_345.wav	instead he was informed to join the line of creditors	0.991803
30	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_001.wav	please call stella	0.995690
32	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_001.wav	please call Stella	0.008197
34	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_042.wav	you must always attempt to raise the bar	0.991803
36	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_101.wav	however the intensive care unit at the southern general hospital is full	0.991803
38	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_434.wav	the bill would be massive	0.995690
40	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_434.wav	the bill will be massive	0.045455
43	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_411.wav	you wanted the evidence	0.991803
45	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_306.wav	it should be equal	0.991803
47	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_093.wav	he was possessed	0.995690
51	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_023.wav	if the red of the second bow falls upon the green of the first the result is to give a bow with ...	0.995690
55	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_333.wav	like his acting it was an accident	0.991803
57	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_309.wav	we are now looking at the degrees of injury	0.995690
61	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_362.wav	you must have a government and a good civil service	0.991803

	audio	text	annotator	annotation_overlap	confidence	evaluation_overlap	eval	evaluator	evaluator_weight
0	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_398.wav	but they are still unlikely to be involved in combat	fd060a4d57b00f9bba4421fe4c7c22f3	1	0.991803	2	True	fd060a4d57b00f9bba4421fe4c7c22f3	0.916667
1	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_398.wav	but they are still unlikely to be involved in combat	fd060a4d57b00f9bba4421fe4c7c22f3	1	0.991803	2	True	6d85abd870df2592ef79175f99b5b93c	0.916667
2	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_213.wav	he is slightly confused	6d85abd870df2592ef79175f99b5b93c	2	0.995690	2	True	155593e4339bc240a2863da87fcdb856	0.916667
3	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_213.wav	he is slightly confused	6d85abd870df2592ef79175f99b5b93c	2	0.995690	2	True	f87548bd9c317ed987e22c8ebe3dea3c	0.954545
4	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_213.wav	he was slightly confused	fd060a4d57b00f9bba4421fe4c7c22f3	2	0.008197	2	False	fd060a4d57b00f9bba4421fe4c7c22f3	0.916667
5	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_213.wav	he was slightly confused	fd060a4d57b00f9bba4421fe4c7c22f3	2	0.008197	2	False	6d85abd870df2592ef79175f99b5b93c	0.916667
6	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_398.wav	i was very young	6d85abd870df2592ef79175f99b5b93c	2	0.995690	2	True	155593e4339bc240a2863da87fcdb856	0.916667
7	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_398.wav	i was very young	6d85abd870df2592ef79175f99b5b93c	2	0.995690	2	True	f87548bd9c317ed987e22c8ebe3dea3c	0.954545
8	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_398.wav	i was very young	fd060a4d57b00f9bba4421fe4c7c22f3	2	0.995690	2	True	155593e4339bc240a2863da87fcdb856	0.916667
9	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_398.wav	i was very young	fd060a4d57b00f9bba4421fe4c7c22f3	2	0.995690	2	True	f87548bd9c317ed987e22c8ebe3dea3c	0.954545
10	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_017.wav	others have tried to explain phenomenon physically	6d85abd870df2592ef79175f99b5b93c	2	0.995690	2	True	155593e4339bc240a2863da87fcdb856	0.916667
11	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_017.wav	others have tried to explain phenomenon physically	6d85abd870df2592ef79175f99b5b93c	2	0.995690	2	True	f87548bd9c317ed987e22c8ebe3dea3c	0.954545
12	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_017.wav	others have tried to explain phenomena physically	fd060a4d57b00f9bba4421fe4c7c22f3	2	0.008197	2	False	fd060a4d57b00f9bba4421fe4c7c22f3	0.916667
13	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_017.wav	others have tried to explain phenomena physically	fd060a4d57b00f9bba4421fe4c7c22f3	2	0.008197	2	False	6d85abd870df2592ef79175f99b5b93c	0.916667
14	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_013.wav	some have excepted it as a miracle without any physical explanation	6d85abd870df2592ef79175f99b5b93c	2	0.995690	2	True	155593e4339bc240a2863da87fcdb856	0.916667
15	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_013.wav	some have excepted it as a miracle without any physical explanation	6d85abd870df2592ef79175f99b5b93c	2	0.995690	2	True	f87548bd9c317ed987e22c8ebe3dea3c	0.954545
16	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_013.wav	some have accepted it as a miracle without physical explanation	fd060a4d57b00f9bba4421fe4c7c22f3	2	0.045455	3	True	fd060a4d57b00f9bba4421fe4c7c22f3	0.916667
17	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_013.wav	some have accepted it as a miracle without physical explanation	fd060a4d57b00f9bba4421fe4c7c22f3	2	0.045455	3	False	6d85abd870df2592ef79175f99b5b93c	0.916667
18	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_013.wav	some have accepted it as a miracle without physical explanation	fd060a4d57b00f9bba4421fe4c7c22f3	2	0.045455	3	False	f87548bd9c317ed987e22c8ebe3dea3c	0.954545
19	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_259.wav	it is very nice	6d85abd870df2592ef79175f99b5b93c	2	0.995690	2	True	155593e4339bc240a2863da87fcdb856	0.916667
20	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_259.wav	it is very nice	6d85abd870df2592ef79175f99b5b93c	2	0.995690	2	True	f87548bd9c317ed987e22c8ebe3dea3c	0.954545
21	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_259.wav	its very nice	fd060a4d57b00f9bba4421fe4c7c22f3	2	0.008197	2	False	fd060a4d57b00f9bba4421fe4c7c22f3	0.916667
22	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_259.wav	its very nice	fd060a4d57b00f9bba4421fe4c7c22f3	2	0.008197	2	False	6d85abd870df2592ef79175f99b5b93c	0.916667
23	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_296.wav	it was not just the character and the energy of the playing	6d85abd870df2592ef79175f99b5b93c	2	0.995690	2	True	155593e4339bc240a2863da87fcdb856	0.916667
24	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_296.wav	it was not just the character and the energy of the playing	6d85abd870df2592ef79175f99b5b93c	2	0.995690	2	True	f87548bd9c317ed987e22c8ebe3dea3c	0.954545
25	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_296.wav	it wasnt just the character and the energy of the play	fd060a4d57b00f9bba4421fe4c7c22f3	2	NaN	0	NaN	NaN	NaN
26	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_264.wav	the work between musicians and the fire is very important	fd060a4d57b00f9bba4421fe4c7c22f3	1	0.991803	2	True	fd060a4d57b00f9bba4421fe4c7c22f3	0.916667
27	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_264.wav	the work between musicians and the fire is very important	fd060a4d57b00f9bba4421fe4c7c22f3	1	0.991803	2	True	6d85abd870df2592ef79175f99b5b93c	0.916667
28	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_345.wav	instead he was informed to join the line of creditors	fd060a4d57b00f9bba4421fe4c7c22f3	1	0.991803	2	True	fd060a4d57b00f9bba4421fe4c7c22f3	0.916667
29	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_345.wav	instead he was informed to join the line of creditors	fd060a4d57b00f9bba4421fe4c7c22f3	1	0.991803	2	True	6d85abd870df2592ef79175f99b5b93c	0.916667
30	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_001.wav	please call stella	6d85abd870df2592ef79175f99b5b93c	2	0.995690	2	True	155593e4339bc240a2863da87fcdb856	0.916667
31	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_001.wav	please call stella	6d85abd870df2592ef79175f99b5b93c	2	0.995690	2	True	f87548bd9c317ed987e22c8ebe3dea3c	0.954545
32	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_001.wav	please call Stella	fd060a4d57b00f9bba4421fe4c7c22f3	2	0.008197	2	False	fd060a4d57b00f9bba4421fe4c7c22f3	0.916667
33	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_001.wav	please call Stella	fd060a4d57b00f9bba4421fe4c7c22f3	2	0.008197	2	False	6d85abd870df2592ef79175f99b5b93c	0.916667
34	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_042.wav	you must always attempt to raise the bar	fd060a4d57b00f9bba4421fe4c7c22f3	1	0.991803	2	True	fd060a4d57b00f9bba4421fe4c7c22f3	0.916667
35	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_042.wav	you must always attempt to raise the bar	fd060a4d57b00f9bba4421fe4c7c22f3	1	0.991803	2	True	6d85abd870df2592ef79175f99b5b93c	0.916667
36	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_101.wav	however the intensive care unit at the southern general hospital is full	fd060a4d57b00f9bba4421fe4c7c22f3	1	0.991803	2	True	fd060a4d57b00f9bba4421fe4c7c22f3	0.916667
37	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_101.wav	however the intensive care unit at the southern general hospital is full	fd060a4d57b00f9bba4421fe4c7c22f3	1	0.991803	2	True	6d85abd870df2592ef79175f99b5b93c	0.916667
38	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_434.wav	the bill would be massive	6d85abd870df2592ef79175f99b5b93c	2	0.995690	2	True	155593e4339bc240a2863da87fcdb856	0.916667
39	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_434.wav	the bill would be massive	6d85abd870df2592ef79175f99b5b93c	2	0.995690	2	True	f87548bd9c317ed987e22c8ebe3dea3c	0.954545
40	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_434.wav	the bill will be massive	fd060a4d57b00f9bba4421fe4c7c22f3	2	0.045455	3	False	fd060a4d57b00f9bba4421fe4c7c22f3	0.916667
41	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_434.wav	the bill will be massive	fd060a4d57b00f9bba4421fe4c7c22f3	2	0.045455	3	True	6d85abd870df2592ef79175f99b5b93c	0.916667
42	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_434.wav	the bill will be massive	fd060a4d57b00f9bba4421fe4c7c22f3	2	0.045455	3	False	f87548bd9c317ed987e22c8ebe3dea3c	0.954545
43	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_411.wav	you wanted the evidence	fd060a4d57b00f9bba4421fe4c7c22f3	1	0.991803	2	True	fd060a4d57b00f9bba4421fe4c7c22f3	0.916667
44	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_411.wav	you wanted the evidence	fd060a4d57b00f9bba4421fe4c7c22f3	1	0.991803	2	True	6d85abd870df2592ef79175f99b5b93c	0.916667
45	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_306.wav	it should be equal	fd060a4d57b00f9bba4421fe4c7c22f3	1	0.991803	2	True	fd060a4d57b00f9bba4421fe4c7c22f3	0.916667
46	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_306.wav	it should be equal	fd060a4d57b00f9bba4421fe4c7c22f3	1	0.991803	2	True	6d85abd870df2592ef79175f99b5b93c	0.916667
47	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_093.wav	he was possessed	6d85abd870df2592ef79175f99b5b93c	2	0.995690	2	True	155593e4339bc240a2863da87fcdb856	0.916667
48	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_093.wav	he was possessed	6d85abd870df2592ef79175f99b5b93c	2	0.995690	2	True	f87548bd9c317ed987e22c8ebe3dea3c	0.954545
49	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_093.wav	he was possessed	fd060a4d57b00f9bba4421fe4c7c22f3	2	0.995690	2	True	155593e4339bc240a2863da87fcdb856	0.916667
50	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_093.wav	he was possessed	fd060a4d57b00f9bba4421fe4c7c22f3	2	0.995690	2	True	f87548bd9c317ed987e22c8ebe3dea3c	0.954545
51	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_023.wav	if the red of the second bow falls upon the green of the first the result is to give a bow with an abnormally wide yellow band since red and green...	6d85abd870df2592ef79175f99b5b93c	2	0.995690	2	True	155593e4339bc240a2863da87fcdb856	0.916667
52	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_023.wav	if the red of the second bow falls upon the green of the first the result is to give a bow with an abnormally wide yellow band since red and green...	6d85abd870df2592ef79175f99b5b93c	2	0.995690	2	True	f87548bd9c317ed987e22c8ebe3dea3c	0.954545
53	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_023.wav	if the red of the second bow falls upon the green of the first the result is to give a bow with an abnormally wide yellow band since red and green...	fd060a4d57b00f9bba4421fe4c7c22f3	2	0.995690	2	True	155593e4339bc240a2863da87fcdb856	0.916667
54	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_023.wav	if the red of the second bow falls upon the green of the first the result is to give a bow with an abnormally wide yellow band since red and green...	fd060a4d57b00f9bba4421fe4c7c22f3	2	0.995690	2	True	f87548bd9c317ed987e22c8ebe3dea3c	0.954545
55	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_333.wav	like his acting it was an accident	fd060a4d57b00f9bba4421fe4c7c22f3	1	0.991803	2	True	fd060a4d57b00f9bba4421fe4c7c22f3	0.916667
56	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p257_333.wav	like his acting it was an accident	fd060a4d57b00f9bba4421fe4c7c22f3	1	0.991803	2	True	6d85abd870df2592ef79175f99b5b93c	0.916667
57	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_309.wav	we are now looking at the degrees of injury	6d85abd870df2592ef79175f99b5b93c	2	0.995690	2	True	155593e4339bc240a2863da87fcdb856	0.916667
58	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_309.wav	we are now looking at the degrees of injury	6d85abd870df2592ef79175f99b5b93c	2	0.995690	2	True	f87548bd9c317ed987e22c8ebe3dea3c	0.954545
59	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_309.wav	we are now looking at the degrees of injury	fd060a4d57b00f9bba4421fe4c7c22f3	2	0.995690	2	True	155593e4339bc240a2863da87fcdb856	0.916667
60	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_309.wav	we are now looking at the degrees of injury	fd060a4d57b00f9bba4421fe4c7c22f3	2	0.995690	2	True	f87548bd9c317ed987e22c8ebe3dea3c	0.954545
61	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_362.wav	you must have a government and a good civil service	fd060a4d57b00f9bba4421fe4c7c22f3	1	0.991803	2	True	fd060a4d57b00f9bba4421fe4c7c22f3	0.916667
62	https://tlk.s3.yandex.net/ext_dataset/noisy_speech/noisy_tested_wav/p232_362.wav	you must have a government and a good civil service	fd060a4d57b00f9bba4421fe4c7c22f3	1	0.991803	2	True	6d85abd870df2592ef79175f99b5b93c	0.916667

	name	type
0	audio	str
1	text	str