!pip install transformers[sentencepiece] datasets
Collecting transformers[sentencepiece]
Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
|████████████████████████████████| 4.0 MB 5.4 MB/s
Collecting datasets
Downloading datasets-2.0.0-py3-none-any.whl (325 kB)
|████████████████████████████████| 325 kB 47.4 MB/s
Requirement already satisfied: multiprocess in /usr/local/lib/python3.7/dist-packages (from datasets) (0.70.12.2)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from datasets) (4.11.3)
Collecting fsspec[http]>=2021.05.0
Downloading fsspec-2022.3.0-py3-none-any.whl (136 kB)
|████████████████████████████████| 136 kB 46.9 MB/s
Collecting huggingface-hub<1.0.0,>=0.1.0
Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
|████████████████████████████████| 77 kB 5.5 MB/s
Requirement already satisfied: dill in /usr/local/lib/python3.7/dist-packages (from datasets) (0.3.4)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from datasets) (1.3.5)
Collecting responses<0.19
Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (2.23.0)
Collecting xxhash
Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
|████████████████████████████████| 212 kB 43.9 MB/s
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from datasets) (1.21.5)
Requirement already satisfied: pyarrow>=5.0.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (6.0.1)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from datasets) (21.3)
Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.7/dist-packages (from datasets) (4.63.0)
Collecting aiohttp
Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
|████████████████████████████████| 1.1 MB 42.9 MB/s
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (3.6.0)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (3.13)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (3.10.0.2)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->datasets) (3.0.7)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (2021.10.8)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (1.24.3)
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
|████████████████████████████████| 127 kB 47.9 MB/s
Collecting yarl<2.0,>=1.0
Downloading yarl-1.7.2-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (271 kB)
|████████████████████████████████| 271 kB 47.2 MB/s
Collecting frozenlist>=1.1.1
Downloading frozenlist-1.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (144 kB)
|████████████████████████████████| 144 kB 47.0 MB/s
Collecting multidict<7.0,>=4.5
Downloading multidict-6.0.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (94 kB)
|████████████████████████████████| 94 kB 774 kB/s
Collecting async-timeout<5.0,>=4.0.0a3
Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (21.4.0)
Collecting asynctest==0.13.0
Downloading asynctest-0.13.0-py3-none-any.whl (26 kB)
Requirement already satisfied: charset-normalizer<3.0,>=2.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (2.0.12)
Collecting aiosignal>=1.1.2
Downloading aiosignal-1.2.0-py3-none-any.whl (8.2 kB)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->datasets) (3.7.0)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas->datasets) (1.15.0)
Collecting pyyaml
Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
|████████████████████████████████| 596 kB 47.8 MB/s
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
|████████████████████████████████| 6.5 MB 36.5 MB/s
Collecting sacremoses
Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
|████████████████████████████████| 895 kB 45.4 MB/s
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers[sentencepiece]) (2019.12.20)
Collecting sentencepiece!=0.1.92,>=0.1.91
Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
|████████████████████████████████| 1.2 MB 43.5 MB/s
Requirement already satisfied: protobuf in /usr/local/lib/python3.7/dist-packages (from transformers[sentencepiece]) (3.17.3)
Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers[sentencepiece]) (1.1.0)
Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers[sentencepiece]) (7.1.2)
Installing collected packages: urllib3, multidict, frozenlist, yarl, pyyaml, asynctest, async-timeout, aiosignal, tokenizers, sacremoses, huggingface-hub, fsspec, aiohttp, xxhash, transformers, sentencepiece, responses, datasets
Attempting uninstall: urllib3
Found existing installation: urllib3 1.24.3
Uninstalling urllib3-1.24.3:
Successfully uninstalled urllib3-1.24.3
Attempting uninstall: pyyaml
Found existing installation: PyYAML 3.13
Uninstalling PyYAML-3.13:
Successfully uninstalled PyYAML-3.13
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
Successfully installed aiohttp-3.8.1 aiosignal-1.2.0 async-timeout-4.0.2 asynctest-0.13.0 datasets-2.0.0 frozenlist-1.3.0 fsspec-2022.3.0 huggingface-hub-0.5.1 multidict-6.0.2 pyyaml-6.0 responses-0.18.0 sacremoses-0.0.49 sentencepiece-0.1.96 tokenizers-0.11.6 transformers-4.18.0 urllib3-1.25.11 xxhash-3.0.0 yarl-1.7.2
from datasets import list_datasets
list_datasets()
['assin', 'ar_res_reviews', 'ambig_qa', 'bianet', 'ag_news', 'ajgt_twitter_ar', 'aeslc', 'bc2gm_corpus', 'air_dialogue', 'acronym_identification', 'afrikaans_ner_corpus', 'allegro_reviews', 'ade_corpus_v2', 'adversarial_qa', 'alt', 'billsum', 'amazon_polarity', 'amttl', 'ascent_kb', 'big_patent', 'bn_hate_speech', 'bswac', 'capes', 'arsentd_lev', 'bible_para', 'amazon_reviews_multi', 'ai2_arc', 'beans', 'anli', 'arabic_pos_dialect', 'best2009', 'boolq', 'ar_sarcasm', 'asnq', 'bnl_newspapers', 'amazon_us_reviews', 'arxiv_dataset', 'banking77', 'bookcorpus', 'bprec', 'c3', 'allocine', 'aslg_pc12', 'biosses', 'bbc_hindi_nli', 'americas_nli', 'biomrc', 'blimp', 'aqua_rat', 'aquamuse', 'ar_cov19', 'arabic_speech_corpus', 'arcd', 'blended_skill_talk', 'cail2018', 'blbooksgenre', 'bookcorpusopen', 'bsd_ja_en', 'atomic', 'autshumato', 'blog_authorship_corpus', 'caner', 'app_reviews', 'arabic_billion_words', 'art', 'asset', 'blbooks', 'brwac', 'c4', 'ami', 'assin2', 'babi_qa', 'bbaw_egyptian', 'bing_coronavirus_query_set', 'break_data', 'chr_en', 'covid_qa_deepset', 'cifar10', 'cifar100', 'cdsc', 'catalonia_independence', 'circa', 'cc100', 'cawac', 'clue', 'cdt', 'codah', 'casino', 'clickbait_news_bg', 'cc_news', 'ccaligned_multilingual', 'cedr', 'code_x_glue_cc_cloze_testing_maxmin', 'code_x_glue_cc_code_refinement', 'conv_ai', 'conv_ai_3', 'conv_questions', 'covost2', 'cmrc2018', 'commonsense_qa', 'code_x_glue_ct_code_to_text', 'conceptnet5', 'cord19', 'cnn_dailymail', 'code_x_glue_cc_code_to_code_trans', 'com_qa', 'covid_tweets_japanese', 'civil_comments', 'code_x_glue_tc_text_to_code', 'coqa', 'cosmos_qa', 'cfq', 'code_x_glue_cc_clone_detection_poj104', 'cos_e', 'craigslist_bargains', 'cats_vs_dogs', 'common_voice', 'cbt', 'code_search_net', 'climate_fever', 'code_x_glue_cc_clone_detection_big_clone_bench', 'clinc_oos', 'code_x_glue_tc_nl_code_search_adv', 'cmu_hinglish_dog', 'code_x_glue_tt_text_to_text', 'coached_conv_pref', 'cppe-5', 'cornell_movie_dialog', 'coarse_discourse', 'code_x_glue_cc_cloze_testing_all', 'code_x_glue_cc_code_completion_token', 'code_x_glue_cc_defect_detection', 'competition_math', 'conllpp', 'covid_qa_castorini', 'common_gen', 'common_language', 'compguesswhat', 'conll2003', 'code_x_glue_cc_code_completion_line', 'consumer-finance-complaints', 'conv_ai_2', 'counter', 'conll2000', 'conll2002', 'covid_qa_ucsd', 'crawl_domain', 'dbrd', 'dart', 'daily_dialog', 'curiosity_dialogs', 'cryptonite', 'crd3', 'fashion_mnist', 'danish_political_comments', 'ecb', 'docred', 'doc2dial', 'crime_and_punish', 'crows_pairs', 'dane', 'electricity_load_diagrams', 'eli5', 'empathetic_dialogues', 'europa_ecdc_tm', 'cuad', 'datacommons_factcheck', 'definite_pronoun_resolution', 'dream', 'emo', 'ehealth_kd', 'emotone_ar', 'fquad', 'eli5_category', 'europarl_bilingual', 'fake_news_english', 'gem', 'discovery', 'dialog_re', 'dutch_social', 'emea', 'enriched_web_nlg', 'factckbr', 'finer', 'food101', 'duorc', 'e2e_nlg', 'eu_regulatory_ir', 'event2Mind', 'exams', 'fever', 'deal_or_no_dialog', 'cs_restaurants', 'disfl_qa', 'esnli', 'dyk', 'eth_py150_open', 'ecthr_cases', 'euronews', 'evidence_infer_treatment', 'flores', 'flue', 'eduge', 'emotion', 'eraser_multi_rc', 'farsi_news', 'gap', 'dengue_filipino', 'diplomacy_detection', 'drop', 'e2e_nlg_cleaned', 'dbpedia_14', 'disaster_response_messages', 'eitb_parcc', 'discofuse', 'ethos', 'doqa', 'fake_news_filipino', 'eurlex', 'freebase_qa', 'few_rel', 'financial_phrasebank', 'europa_eac_tm', 'harem', 'gigaword', 'glue', 'hatexplain', 'hellaswag', 'has_part', 'great_code', 'gutenberg_time', 'generics_kb', 'giga_fren', 'go_emotions', 'generated_reviews_enth', 'gooaq', 'hate_speech_offensive', 'german_legal_entity_recognition', 'glucose', 'jigsaw_unintended_bias', 'lama', 'mbpp', 'md_gender_bias', 'med_hop', 'ms_terms', 'newspop', 'pass', 'proto_qa', 'ptb_text_only', 'ronec', 'samsum', 'sanskrit_classic', 'scan', 'sepedi_ner', 'social_bias_frames', 'spider', 'tashkeela', 'time_dial', 'germaner', 'germeval_14', 'hate_speech_filipino', 'hausa_voa_ner', 'hebrew_this_world', 'id_nergrit_corpus', 'inquisitive_qg', 'kilt_tasks', 'kor_ner', 'labr', 'laroseda', 'lj_speech', 'math_qa', 'multi_nli', 'mutual_friends', 'mwsc', 'narrativeqa_manual', 'natural_questions', 'nkjp-ner', 'norne', 'numer_sense', 'ollie', 'onestop_qa', 'opus_ubuntu', 'para_crawl', 'per_sent', 'polyglot_ner', 'quora', 'riddle_sense', 'smartdata', 'snips_built_in_intents', 'squad_it', 'ted_iwlst2013', 'thaiqa_squad', 'google_wellformed_query', 'hate_speech_portuguese', 'id_newspapers_2018', 'hotpot_qa', 'id_panl_bppt', 'interpress_news_category_tr', 'kan_hope', 'journalists_questions', 'kor_sae', 'lince', 'masakhaner', 'm_lama', 'mocha', 'medal', 'newsph', 'mkb', 'opus_montenegrinsubs', 'muchocine', 'parsinlu_reading_comprehension', 'myanmar_news', 'paws-x', 'ncbi_disease', 'ro_sts', 'norwegian_ner', 'scielo', 'opus100', 'scitldr', 'opus_books', 'senti_ws', 'sharc', 'opus_finlex', 'orange_sum', 'sick', 'piaf', 'tatoeba', 'qa_zre', 'the_pile', 'qangaroo', 'tilde_model', 'sem_eval_2014_task_1', 'times_of_india_news_headlines', 'sharc_modified', 'siswati_ner_corpus', 'sms_spam', 'sofc_materials_articles', 'spc', 'sst', 'stsb_mt_sv', 'swda', 'swedish_medical_ner', 'hate_speech_pl', 'hausa_voa_topics', 'hope_edi', 'imdb_urdu_reviews', 'jfleg', 'kor_3i4k', 'kor_nli', 'lm1b', 'mac_morpho', 'mdd', 'msr_zhen_translation_parity', 'newsroom', 'numeric_fused_head', 'openslr', 'pec', 'poleval2019_mt', 'py_ast', 'qed_amara', 'refresd', 'squadshifts', 'superb', 'svhn', 'swag', 'tab_fact', 'tamilmixsentiment', 'text2log', 'the_pile_books3', 'humicroedit', 'irc_disentangle', 'klue', 'lener_br', 'liar', 'liveqa', 'menyo20k_mt', 'movie_rationales', 'opus_fiskmo', 'peer_read', 'pib', 'qasc', 'quac', 'ro_sent', 's2orc', 'sem_eval_2018_task_1', 'sogou_news', 'spanish_billion_words', 'super_glue', 'ted_multi', 'hansards', 'hard', 'hate_offensive', 'indic_glue', 'kor_nlu', 'makhzan', 'math_dataset', 'matinf', 'mlsum', 'msra_ner', 'nlu_evaluation_data', 'ohsumed', 'openwebtext', 'opus_elhuyar', 'persian_ner', 'poem_sentiment', 'prachathai67k', 'psc', 'pubmed_qa', 'quartz', 're_dial', 'rotten_tomatoes', 'scene_parse_150', 'schema_guided_dstc8', 'sciq', 'scitail', 'swahili', 'thai_toxicity_tweet', 'the_pile_openwebtext2', 'hendrycks_test', 'hippocorpus', 'hrwac', 'hyperpartisan_news_detection', 'imppres', 'large_spanish_corpus', 'multi_booked', 'opus_paracrawl', 'paws', 'pn_summary', 'quail', 'red_caps', 'ropes', 'simple_questions_v2', 'squad_adversarial', 'squad_es', 'squad_kor_v2', 'story_cloze', 'swahili_news', 'taskmaster1', 'tep_en_fa_para', 'hindi_discourse', 'gnad10', 'hda_nli_hindi', 'imdb', 'hebrew_sentiment', 'lambada', 'id_clickbait', 'lc_quad', 'id_puisi', 'lst20', 'isixhosa_ner_corpus', 'iwslt2017', 'mc4', 'jnlpba', 'kde4', 'multi_eurlex', 'multi_para_crawl', 'kor_hate', 'multi_re_qa', 'kor_qpair', 'multi_woz_v22', 'linnaeus', 'oclar', 'mrqa', 'offenseval_dravidian', 'msr_text_compression', 'ofis_publik', 'multilingual_librispeech', 'omp', 'neural_code_search', 'openai_humaneval', 'offcombr', 'php', 'opus_dogc', 'reasoning_bg', 'poleval2019_cyberbullying', 'scicite', 'qasper', 'sem_eval_2010_task_8', 'race', 'sem_eval_2020_task_11', 'scifact', 'sesotho_ner_corpus', 'setimes', 'setswana_ner_corpus', 'species_800', 'style_change_detection', 'subjqa', 'tanzil', 'taskmaster2', 'the_pile_stack_exchange', 'hlgd', 'greek_legal_code', 'hrenwac_para', 'iapp_wiki_qa_squad', 'guardian_authorship', 'id_liputan6', 'hate_speech18', 'head_qa', 'igbo_english_machine_translation', 'hover', 'isizulu_ner_corpus', 'jeopardy', 'kannada_news', 'ms_marco', 'limit', 'msr_genomics_kbcomp', 'mc_taco', 'mt_eng_vietnamese', 'norec', 'multi_news', 'opus_dgt', 'multi_nli_mismatch', 'opus_infopankki', 'multi_x_science_sum', 'polemo2', 'narrativeqa', 'quarel', 'quoref', 'ncslgr', 'reddit_tifu', 'openbookqa', 'roman_urdu', 'opus_euconst', 'peoples_daily_ner', 'silicone', 'polsum', 'snow_simplified_japanese_corpus', 'so_stacksample', 'pubmed', 'social_i_qa', 'reuters21578', 'saudinewsnet', 'stsb_multi_mt', 'thaisum', 'sent_comp', 'squad_v2', 'srwac', 'tapaco', 'taskmaster3', 'ted_talks_iwslt', 'health_fact', 'ilist', 'kd_conv', 'librispeech_asr', 'librispeech_lm', 'metrec', 'mkqa', 'mlqa', 'multidoc2dial', 'news_commentary', 'newsph_nli', 'opus_rf', 'opus_tedtalks', 'opus_wikipedia', 'oscar', 'pg19', 'qanta', 'russian_super_glue', 'senti_lex', 'sentiment140', 'squad', 'squad_v1_pt', 'stereoset', 'swiss_judgment_prediction', 'telugu_books', 'hind_encorp', 'hkcancor', 'hybrid_qa', 'indonli', 'kilt_wikipedia', 'kor_sarcasm', 'medical_questions_pairs', 'moroco', 'nell', 'newsgroup', 'nli_tr', 'offenseval2020_tr', 'onestop_english', 'open_subtitles', 'opinosis', 'opus_gnome', 'para_pat', 'qa_srl', 'reclor', 'reddit', 'sberquad', 'scb_mt_enth_2020', 'search_qa', 'sede', 'snli', 'speech_commands', 'swedish_ner_corpus', 'swedish_reviews', 'telugu_news', 'hans', 'igbo_ner', 'interpress_news_category_tr_lite', 'jigsaw_toxicity_pred', 'kinnews_kirnews', 'lex_glue', 'medical_dialog', 'miam', 'msr_sqa', 'nsmc', 'opus_memat', 'opus_openoffice', 'piqa', 'pragmeval', 'qa4mre', 'qed', 'selqa', 'squad_kor_v1', 'thainer', 'grail_qa', 'hebrew_projectbenyehuda', 'igbo_monolingual', 'indonlu', 'kelm', 'meta_woz', 'metooma', 'mnist', 'nchlt', 'newsqa', 'nq_open', 'opus_xhosanavy', 'recipe_nlg', 'ro_sts_parallel', 'scientific_papers', 'ted_hrlr', 'timit_asr', 'un_ga', 'totto', 'tlc', 'wiki_hop', 'trec', 'tweets_ar_en_parallel', 'told-br', 'tmu_gfm_dataset', 'udhr', 'tsac', 'ubuntu_dialogs_corpus', 'tuple_ie', 'turku_ner_corpus', 'urdu_fake_news', 'wiki_bio', 'wiki_source', 'wikiann', 'wili_2018', 'wmt18', 'x_stance', 'trivia_qa', 'ttc4900', 'twi_wordsim353', 'wiki_split', 'wmt16', 'xglue', 'xquad', 'xsum', 'vivos', 'wiki_movies', 'wikihow', 'wisesight_sentiment', 'xquad_r', 'yoruba_bbc_topics', 'turkish_ner', 'tweet_qa', 'wiki_qa', 'wisesight1000', 'wmt_t2t', 'wnut_17', 'wrbsc', 'xcsr', 'xnli', 'turk', 'twi_text_c3', 'un_pc', 'universal_dependencies', 'weibo_ner', 'wiki_qa_ar', 'wikipedia', 'tiny_shakespeare', 'wiki_auto', 'wiki_lingua', 'wikisql', 'wmt14', 'wmt20_mlqe_task3', 'turkish_shrinked_ner', 'tydiqa', 'universal_morphologies', 'wikicorpus', 'wikitext', 'turkish_movie_sentiment', 'turkish_product_reviews', 'tweet_eval', 'wiki_snippets', 'wino_bias', 'winogrande', 'wongnai_reviews', 'woz_dialogue', 'xcopa', 'yelp_polarity', 'zest', 'um005', 'urdu_sentiment_corpus', 'web_of_science', 'winograd_wsc', 'wmt17', 'yahoo_answers_topics', 'turkic_xwmt', 'tweets_hate_speech_detection', 'vctk', 'wiki40b', 'wiki_asp', 'wmt15', 'wmt20_mlqe_task2', 'xor_tydi_qa', 'xsum_factuality', 'xtreme', 'yahoo_answers_qa', 'wiki_dpr', 'wiqa', 'yoruba_text_c3', 'yoruba_wordsim353', 'tunizi', 'web_nlg', 'wider_face', 'wiki_atomic_edits', 'wiki_summary', 'xed_en_fi', 'youtube_caption_corrections', 'wi_locness', 'wikitext_tl39', 'wmt20_mlqe_task1', 'yelp_review_full', 'un_multi', 'web_questions', 'wmt19', 'yoruba_gv_ner', 'AConsApart/anime_subtitles_DialoGPT', 'AI-it/korean-hate-speech', 'Abirate/french_book_reviews', 'ARTeLab/fanpage', '0n1xus/pytorrent-standalone', '0n1xus/codexglue', 'ASCCCCCCCC/amazon_zh_simple', 'AlgoveraAI/CryptoPunks', 'Adnan/Urdu_News_Headlines', 'ARKseal/YFCC14M_subset_webdataset', 'Abdo1Kamr/Arabic_Hadith', 'Akshith/aa', 'AndrewMcDowell/de_corpora_parliament_processed', 'AHussain0418/day2_data', 'AlekseyKorshuk/horror-scripts', 'AI-Sweden/SuperLim', 'AI-it/khs_service_test', 'ARTeLab/mlsum-it', 'ARTeLab/ilpost', 'Abirate/code_net_dev_dataset', 'AhmadSawal/qa', 'Akshith/g_rock', 'Aisha/BAAD16', 'AhmedSSoliman/CoNaLa', 'Aisha/BAAD6', 'AlexMaclean/wikipedia-deletion-compressions', 'AlexZapolskii/zapolskii-amazon', 'ASCCCCCCCC/amazon_zh', 'Akila/ForgottenRealmsWikiDataset', 'AHussain0418/day4data', 'AdWeeb/DravidianMT', 'AlekseyDorkin/extended_tweet_emojis', 'Aliseyfi/event_token_type', 'Akshith/test', 'Alvenir/nst-da-16khz', 'AHussain0418/demo_data', 'Abirate/english_quotes', 'Abirate/code_net_dataset', 'Abirate/code_net_test_final_dataset', 'AlexMaclean/all-deletion-compressions', 'AlekseyKorshuk/comedy-scripts', 'Annabelleabbott/real-fake-news-workshop', 'Anurag-Singh-creator/tasks', 'BSC-TeMU/ancora-ca-ner', 'Binbin/my_dataset', 'DDSC/europarl', 'DanL/scientific-challenges-and-directions-dataset', 'Datatang/accented_english', 'DoyyingFace/github-embeddings-doy', 'GEM/CrossWOZ', 'Atsushi/fungi_diagnostic_chars_comparison_japanese', 'Cyberfish/text_error_correction', 'Datatang/multi_language', 'Emma121/testtest', 'Enes3774/data', 'FRTNX/cosuju', 'CodedotAI/code_clippy', 'DDSC/reddit-da-asr-preprocessed', 'DeskDown/ALTDataset', 'Anurag-Singh-creator/task', 'Baybars/parla_text_corpus', 'Bosio/pacman_descriptions', 'Check/region_7', 'Check/region_9', 'Check/vverify', 'Cyberfish/pos_tagger', 'Chuu/Vhh', 'Exr0n/wiki-entity-similarity', 'DelgadoPanadero/Pokemon', 'DiFronzo/Human_Activity_Recognition', 'ESZER/H', 'Firoj/CrisisBench', 'Fraser/python-state-changes', 'Arnold/hausa_common_voice', 'BSC-TeMU/sts-ca', 'BeIR/beir', 'Annielytics/DoctorsNotes', 'DrishtiSharma/as_opus100_processed', 'DrishtiSharma/br_opus100_processed', 'EMBO/biolang', 'Finnish-NLP/mc4_fi_cleaned', 'Francois/futures_es', 'BSC-TeMU/viquiquad', 'BritishLibraryLabs/EThOS-PhD-metadata', 'Check/region_8', 'BeIR/beir-corpus', 'Crives/haha', 'BSC-TeMU/SQAC', 'Check/a_re_gi', 'Chun/dataset', 'DDSC/dkhate', 'DDSC/partial-danish-gigaword-no-twitter', 'DSCI511G1/COP26_Energy_Transition_Tweets', 'Daniele/dante-corpus', 'Darren/data', 'Datatang/mixed_speech_chinese_english', 'DDSC/angry-tweets', 'Check/region_3', 'DDSC/reddit-da', 'Check/region_6', 'ChristophSchuhmann/MS_COCO_2017_URL_TEXT', 'Davlan/masakhanerV1', 'Doohae/modern_music_re', 'CyranoB/polarity', 'DrishtiSharma/kk_opus100_processed', 'Dumiiii/common-voice-romaniarss', 'DeskDown/ALTDataset_en-to-fil-vi-id-ms-ja-khm', 'DoyyingFace/github-issues-doy', 'Fraser/mnist-text-no-spaces', 'Fraser/mnist-text-small', 'Fraser/short-jokes', 'GEM/ART', 'Emanuel/UD_Portuguese-Bosque', 'Fraser/mnist-text-default', 'Fraser/python-lines', 'BenjaminGalliot/pangloss', 'CAGER/rick', 'CALM/arwiki', 'Check/region_1', 'CodedotAI/code_clippy_github', 'Cropinky/flatearther', 'Datatang/multi_language_conversation', 'DoctorSlimm/yipee', 'FRTNX/worldbank-projects', 'Fraser/wiki_sentences', 'BlakesOrb6/Fred-Flintstone', 'CShorten/KerasBERT', 'Davlan/conll2003_de_noMISC', 'Dmitriy612/1', 'DrishtiSharma/bg_opus100_processed', 'DrishtiSharma/hi_opus100_processed', 'DrishtiSharma/or_opus100_processed', 'Eymen3455/xsum_tr', 'GEM/OrangeSum', 'ApiInferenceTest/asr_dummy', 'CShorten/ZillowPrize', 'CodedotAI/code-clippy-tfrecords', 'Cropinky/rap_lyrics_english', 'DDSC/twitter-sent', 'Davlan/conll2003_noMISC', 'Atsushi/fungi_trait_circus_database', 'AryanLala/autonlp-data-Scientific_Title_Generator', 'Atsushi/fungi_indexed_mycological_papers_japanese', 'BSC-TeMU/tecla', 'Babelscape/wikineural', 'BatuhanYilmaz/github-issues', 'Check/region_2', 'Check/region_4', 'Check/regions', 'Cropinky/wow_fishing_bobber', 'Datatang/accented_mandarin', 'DELith/github-issues', 'Datatang/mandarin_chinese', 'DrishtiSharma/sl_opus100_processed', 'Doohae/klue-mrc-bm25', 'Emon/sobuj', 'Fraser/program-synthesis', 'Bosio/pacman', 'CAiRE/ASCEND', 'Champion/vpc2020_clear_anon_speech', 'Check/region_5', 'Cheranga/test', 'Datatang/chinese_dialect', 'DrishtiSharma/mr_opus100_processed', 'FIG-Loneliness/FIG-Loneliness', 'Felix-ML/quoteli3', 'Fraser/news-category-dataset', 'Avishekavi/Avi', 'BSC-TeMU/xquad-ca', 'Babelscape/rebel-dataset', 'ChadxxxxHall/Inter-vision', 'DDSC/lcc', 'DrishtiSharma/sr_opus100_processed', 'EMBO/sd-nlp', 'FL33TW00D/test-dataset', 'GEM/BiSECT', 'GEM/conversational_weather', 'GEM/e2e_nlg', 'GEM/cochrane-simplification', 'GEM/common_gen', 'GEM/dart', 'GEM/totto', 'GEM/wiki_lingua', 'GEM/Taskmaster', 'GEM/RiSAWOZ', 'GEM/surface_realisation_st_2020', 'GEM/turku_hockey_data2text', 'GEM/indonlg', 'GEM/SIMPITIKI', 'GEM/schema_guided_dialog', 'GEM/dstc10_track2_task2', 'GalacticAI/Noirset', 'Graphcore/vqa', 'HarleyQ/WitcherDialogue', 'HenryAI/KerasCodeExamples.txt', 'Jikiwa/demo3', 'Jikiwa/test-16344349440339', 'Jikiwa/test-16344361893586', 'KBLab/overlim', 'GEM/cs_restaurants', 'Graphcore/gqa-lxmert', 'Graphcore/wikipedia-bert-128', 'HenryAI/KerasAPIReference.txt', 'HenryAI/KerasDeveloperGuides.txt', 'Jack0508/demo', 'Jikiwa/temp-repo-valid', 'KETI-AIR/kor_corpora', 'GEM/viggo', 'GEM/xlsum', 'GEM-submissions/lewtun__hugging-face-test-t5-base.outputs.json-36bf2a59__1645559101', 'GroNLP/ik-nlp-22_slp', 'HHousen/ParaSCI', 'HUPD/hupd', 'Harveenchadha/indic-voice', 'HarveyBWest/mybot', 'Jikiwa/stargazers', 'Jikiwa/test-16344368182003', 'Karavet/ARPA-Armenian-Paraphrase-Corpus', 'GEM/mlb_data_to_text', 'GonzaloA/fake_news', 'IsaacBot/GP-Sentiment', 'Jack0508/test', 'Jikiwa/random_repo', 'Gabriel/quora_swe', 'Husain/intent-classification-en-fr', 'Intel/WEC-Eng', 'Ishwar/Senti', 'JIWON/nil_dataset', 'Jikiwa/demo2', 'Jikiwa/test-16336477963335', 'Jikiwa/test-16344362261113', 'KTH/speechdat', 'GEM/opusparcus', 'GroNLP/ik-nlp-22_winemag', 'HHousen/quora', 'Jikiwa/test-16336486877862', 'Jikiwa/test-16344347234752', 'JustinE/Test', 'GEM-submissions/GEM__bart_base_schema_guided_dialog__1645547915', 'HHousen/msrp', 'IFSTalfredoswald/MBTI', 'Jack0508/vi-ko-TED-txt', 'Jean-Baptiste/wikiner_fr', 'Jikiwa/demo1', 'KETI-AIR/klue', 'Khanoooo/autonlp-data-Corona', 'GEM/RotoWire_English-German', 'GEM/SciDuet', 'Jikiwa/pushedd-to-hub', 'Jikiwa/test-16340052901609', 'Jikiwa/test-16344364230608', 'KETI-AIR/nikl', 'GroNLP/ik-nlp-22_transqe', 'Helsinki-NLP/tatoeba_mt', 'IGESML/pubmed_neg', 'JesseParvess/book_snippets_asr', 'Jikiwa/pushed-to-hub', 'Jikiwa/test-16344360501144', 'JonathanSum/en_corpora_parliament_processed', 'Karavet/ILUR-news-text-classification-corpus', 'GEM/wiki_cat_sum', 'Hellisotherpeople/DebateSum', ...]
from datasets import load_dataset
imdb = load_dataset("imdb")
imdb
Downloading builder script: 0%| | 0.00/1.79k [00:00<?, ?B/s]
Downloading metadata: 0%| | 0.00/1.05k [00:00<?, ?B/s]
Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...
Downloading data: 0%| | 0.00/84.1M [00:00<?, ?B/s]
Generating train split: 0%| | 0/25000 [00:00<?, ? examples/s]
Generating test split: 0%| | 0/25000 [00:00<?, ? examples/s]
Generating unsupervised split: 0%| | 0/50000 [00:00<?, ? examples/s]
Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.
0%| | 0/3 [00:00<?, ?it/s]
DatasetDict({ train: Dataset({ features: ['text', 'label'], num_rows: 25000 }) test: Dataset({ features: ['text', 'label'], num_rows: 25000 }) unsupervised: Dataset({ features: ['text', 'label'], num_rows: 50000 }) })
imdb['train'][0]
{'label': 0, 'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it\'s not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.<br /><br />I do commend the filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film doesn\'t have much of a plot.'}
imdb['test'][:3]
{'label': [0, 0, 0], 'text': ['I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as they have to always say "Gene Roddenberry\'s Earth..." otherwise people would not continue watching. Roddenberry\'s ashes must be turning in their orbit as this dull, cheap, poorly edited (watching it without advert breaks really brings this home) trudging Trabant of a show lumbers into space. Spoiler. So, kill off a main character. And then bring him back as another actor. Jeeez! Dallas all over again.', "Worth the entertainment value of a rental, especially if you like action movies. This one features the usual car chases, fights with the great Van Damme kick style, shooting battles with the 40 shell load shotgun, and even terrorist style bombs. All of this is entertaining and competently handled but there is nothing that really blows you away if you've seen your share before.<br /><br />The plot is made interesting by the inclusion of a rabbit, which is clever but hardly profound. Many of the characters are heavily stereotyped -- the angry veterans, the terrified illegal aliens, the crooked cops, the indifferent feds, the bitchy tough lady station head, the crooked politician, the fat federale who looks like he was typecast as the Mexican in a Hollywood movie from the 1940s. All passably acted but again nothing special.<br /><br />I thought the main villains were pretty well done and fairly well acted. By the end of the movie you certainly knew who the good guys were and weren't. There was an emotional lift as the really bad ones got their just deserts. Very simplistic, but then you weren't expecting Hamlet, right? The only thing I found really annoying was the constant cuts to VDs daughter during the last fight scene.<br /><br />Not bad. Not good. Passable 4.", "its a totally average film with a few semi-alright action sequences that make the plot seem a little better and remind the viewer of the classic van dam films. parts of the plot don't make sense and seem to be added in to use up time. the end plot is that of a very basic type that doesn't leave the viewer guessing and any twists are obvious from the beginning. the end scene with the flask backs don't make sense as they are added in and seem to have little relevance to the history of van dam's character. not really worth watching again, bit disappointed in the end production, even though it is apparent it was shot on a low budget certain shots and sections in the film are of poor directed quality"]}
imdb['train'] = imdb['train'].shuffle(seed=1).select(range(2000))
imdb['train']
Dataset({ features: ['text', 'label'], num_rows: 2000 })
imdb_train_validation = imdb['train'].train_test_split(train_size=0.8)
imdb_train_validation
DatasetDict({ train: Dataset({ features: ['text', 'label'], num_rows: 1600 }) test: Dataset({ features: ['text', 'label'], num_rows: 400 }) })
imdb_train_validation['test']
Dataset({ features: ['text', 'label'], num_rows: 400 })
imdb_train_validation['validation'] = imdb_train_validation.pop('test')
imdb_train_validation
DatasetDict({ train: Dataset({ features: ['text', 'label'], num_rows: 1600 }) validation: Dataset({ features: ['text', 'label'], num_rows: 400 }) })
imdb.update(imdb_train_validation)
imdb
DatasetDict({ train: Dataset({ features: ['text', 'label'], num_rows: 1600 }) test: Dataset({ features: ['text', 'label'], num_rows: 25000 }) unsupervised: Dataset({ features: ['text', 'label'], num_rows: 50000 }) validation: Dataset({ features: ['text', 'label'], num_rows: 400 }) })
imdb['test'] = imdb['test'].shuffle(seed=1).select(range(400))
imdb['test']
Dataset({ features: ['text', 'label'], num_rows: 400 })
imdb['unsupervised'][:3]
{'label': [-1, -1, -1], 'text': ['This is just a precious little diamond. The play, the script are excellent. I cant compare this movie with anything else, maybe except the movie "Leon" wonderfully played by Jean Reno and Natalie Portman. But... What can I say about this one? This is the best movie Anne Parillaud has ever played in (See please "Frankie Starlight", she\'s speaking English there) to see what I mean. The story of young punk girl Nikita, taken into the depraved world of the secret government forces has been exceptionally over used by Americans. Never mind the "Point of no return" and especially the "La femme Nikita" TV series. They cannot compare the original believe me! Trash these videos. Buy this one, do not rent it, BUY it. BTW beware of the subtitles of the LA company which "translate" the US release. What a disgrace! If you cant understand French, get a dubbed version. But you\'ll regret later :)', 'When I say this is my favourite film of all time, that comment is not to be taken lightly. I probably watch far too many films than is healthy for me, and have loved quite a few of them. I first saw "La Femme Nikita" nearly ten years ago, and it still manages to be my absolute favourite. Why?<br /><br />This is more than an incredibly stylish and sexy thriller. Luc Besson\'s great flair for impeccable direction, fashion, and appropriate usage of music makes this a very watchable film. But it is Anne Parillaud\'s perfect rendering of a complex character who transforms from a heartless killer into a compassionate, vibrant young woman that makes this film beautiful. I can\'t keep my eyes off of her when she is on screen.<br /><br />I have seen several of Luc Besson\'s films including "Subway", "The Professional", and the irritating "Fifth Element", and "Nikita" is without a doubt, far superior to any of these. Although this film has tragic elements, it is ultimately extremely hopeful. It is the story of a person who is cruel and merciless, who ultimately comes to realize her own humanity and her own personal power. That, to me is extremely inspiring. If there is hope for Nikita, there is hope for all of us.', 'I saw this movie because I am a huge fan of the TV series of the same name starring Roy Dupuis and Pet Wilson. The movie was really good and I saw how the TV show is based on the movie. A few episodes of the TV series came directly from the movie and their similarity was amazing. To keep things short, any fan of the movie has to watch the series and any fan of the series must see the original Nikita.']}
imdb.pop('unsupervised')
imdb
DatasetDict({ train: Dataset({ features: ['text', 'label'], num_rows: 1600 }) test: Dataset({ features: ['text', 'label'], num_rows: 400 }) validation: Dataset({ features: ['text', 'label'], num_rows: 400 }) })
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('max_colwidth', 250)
imdb.set_format('pandas')
df = imdb['train'][:]
df.sample(frac=1 ,random_state=1).head(10)
text | label | |
---|---|---|
75 | I couldn't believe how lame & pointless this was. Basically there is nothing to laugh at in the movie, hardly any scenes to get you interested in the rest of the movie. This movie pulled in some huge stars but they were all wasted in my opinion. ... | 0 |
1284 | This film is about a family trying to come to terms with the death of the mother/wife by moving to Genova, Italy.<br /><br />The plot of "Genova" sounds promising, but unfortunately it is empty and without focus. The film only consists of a colle... | 0 |
408 | There is only one problem with this website, you can't give a negative rating. Additionally a mate rated this as a D grade movie. I say he was being too nice. A piece of wood could show more emotion that the actors in this movie, and the money us... | 0 |
1282 | It's interesting at first. A naive park ranger (Colin Firth) marries a pretty, mysterious woman (Lisa Zane) he's only known for a short time. They seem to be happy, then she disappears without warning. He searches for her and, after a few dead en... | 0 |
1447 | I remember watching this is its original airing in 1962 as a five or six year old and REALLY enjoying this. I recently had the opportunity to watch it again, for the first time since then, as it was aired on "Walt Disney Presents" on the Disney C... | 1 |
1144 | I have read the short story by Norman Maclean, and the movie did justice to Norman Maclean's writing. My husband tends to reread it occasionally, and I myself have read it over and scenes of the movie keeps coming to mind. We have videos of many ... | 1 |
1381 | Few movies have dashed expectations and upset me as much as Fire has. The movie is pretentious garbage. It does not achieve anything at an artistic level. The only thing it managed to receive is a ban in India. If only it was because of the poor ... | 0 |
181 | I recently saw this film and enjoyed it very much. it gives a insight to indie movie making and how much work is really involved when you have a low budget yet need a name actor/actress to get people, any people to come see it and give the movie ... | 1 |
1183 | William H. Macy is at his most sympathetic and compelling here as a hit-man and loving father who wants to step out of the family business without angering his overbearing parents. Treads much of the same territory as TV's "The Sopranos" in terms... | 1 |
1103 | In the year 1985 (my birth year) Steven Spielberg directed an emotionally strong and unforgettable story of a young African-American girl Celie (Debut role for Whoopi Goldberg) whose life is followed through rough times. The story begins from the... | 1 |
df.loc[0, 'text']
"Hey HULU.com is playing the Elvira late night horror show on their site and this movie is their under the Name Monsteroid, good fun to watch Elvira comment on this Crappy movie ....Have Fun with bad movies. Anyways this movie really has very little value other than to see how bad the 70's were for horror flicks Bad Effects, Bad Dialog, just bad movie making. Avoid this unless you want to laugh at it. While you are at HULU check out the other movies that are their right now there is 10 episodes and some are pretty decent movies with good plots and production and you can watch a lot of them in 480p as long as you have a decent speed connection."
df['text'] = df.text.str.replace('<br />', '')
df.loc[0, 'text']
"Hey HULU.com is playing the Elvira late night horror show on their site and this movie is their under the Name Monsteroid, good fun to watch Elvira comment on this Crappy movie ....Have Fun with bad movies. Anyways this movie really has very little value other than to see how bad the 70's were for horror flicks Bad Effects, Bad Dialog, just bad movie making. Avoid this unless you want to laugh at it. While you are at HULU check out the other movies that are their right now there is 10 episodes and some are pretty decent movies with good plots and production and you can watch a lot of them in 480p as long as you have a decent speed connection."
df.label.value_counts()
0 823 1 777 Name: label, dtype: int64
df["Words per review"] = df["text"].str.split().apply(len)
df.boxplot("Words per review", by="label", grid=False, showfliers=False,
color="black")
plt.suptitle("")
plt.xlabel("")
plt.show()
/usr/local/lib/python3.7/dist-packages/matplotlib/cbook/__init__.py:1376: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. X = np.atleast_1d(X.T if isinstance(X, np.ndarray) else np.asarray(X))
# 0 is negative
# 1 is positive
df[df.text.str.len() < 200]
text | label | Words per review | |
---|---|---|---|
262 | My favorite part of this film was the old man's attempt to cure his neighbor's ills by putting the strong medicine in his bath. There is more than a sense of family, there is a sense of community. | 1 | 38 |
406 | A movie best summed up by the scene where a victim simulates disembowelment by pulling some poor animal's intestines out from under her T-shirt. Too terrible for words. | 0 | 28 |
455 | Brilliant execution in displaying once and for all, this time in the venue of politics, of how "good intentions do actually pave the road to hell". Excellent! | 1 | 27 |
536 | Before Dogma 95: when Lars used movies as art, not just a story. A beautiful painting about love and death. This is one of my favorite movies of all time. The color... The music... Just perfect. | 1 | 36 |
989 | Allison Dean's performance is what stands out in my mind watching this film. She balances out the melancholy tone of the film with an iridescent energy. I would like to see more of her. | 1 | 34 |
1309 | This is actually one of my favorite films, I would recommend that EVERYONE watches it. There is some great acting in it and it shows that not all "good" films are American.... | 1 | 32 |
1340 | "Foxes" is a great film. The four young actresses Jodie Foster, Cherie Currie, Marilyn Kagan and Kandice Stroh are wonderful. The song "On the radio" by Donna Summer is lovely. A great film. ***** | 1 | 34 |
imdb.reset_format()
from transformers import AutoTokenizer
checkpoint = "distilbert-base-cased"
#checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
def tokenize_function(batch):
return tokenizer(batch["text"], padding=True, truncation=True)
imdb_encoded = imdb.map(tokenize_function, batched=True, batch_size=None)
imdb_encoded
Downloading: 0%| | 0.00/29.0 [00:00<?, ?B/s]
Downloading: 0%| | 0.00/411 [00:00<?, ?B/s]
Downloading: 0%| | 0.00/208k [00:00<?, ?B/s]
Downloading: 0%| | 0.00/426k [00:00<?, ?B/s]
0%| | 0/1 [00:00<?, ?ba/s]
0%| | 0/1 [00:00<?, ?ba/s]
0%| | 0/1 [00:00<?, ?ba/s]
DatasetDict({ train: Dataset({ features: ['text', 'label', 'input_ids', 'attention_mask'], num_rows: 1600 }) test: Dataset({ features: ['text', 'label', 'input_ids', 'attention_mask'], num_rows: 400 }) validation: Dataset({ features: ['text', 'label', 'input_ids', 'attention_mask'], num_rows: 400 }) })
print(imdb_encoded['train'][0])
{'text': "Hey HULU.com is playing the Elvira late night horror show on their site and this movie is their under the Name Monsteroid, good fun to watch Elvira comment on this Crappy movie ....Have Fun with bad movies. Anyways this movie really has very little value other than to see how bad the 70's were for horror flicks Bad Effects, Bad Dialog, just bad movie making. Avoid this unless you want to laugh at it. While you are at HULU check out the other movies that are their right now there is 10 episodes and some are pretty decent movies with good plots and production and you can watch a lot of them in 480p as long as you have a decent speed connection.", 'label': 0, 'input_ids': [101, 4403, 145, 2591, 2162, 2591, 119, 3254, 1110, 1773, 1103, 2896, 25740, 1161, 1523, 1480, 5367, 1437, 1113, 1147, 1751, 1105, 1142, 2523, 1110, 1147, 1223, 1103, 10208, 11701, 7874, 117, 1363, 4106, 1106, 2824, 2896, 25740, 1161, 7368, 1113, 1142, 140, 14543, 5005, 2523, 119, 119, 119, 119, 4373, 16068, 1114, 2213, 5558, 119, 10756, 1116, 1142, 2523, 1541, 1144, 1304, 1376, 2860, 1168, 1190, 1106, 1267, 1293, 2213, 1103, 3102, 112, 188, 1127, 1111, 5367, 22302, 1116, 6304, 23009, 117, 6304, 12120, 20151, 117, 1198, 2213, 2523, 1543, 119, 138, 6005, 2386, 1142, 4895, 1128, 1328, 1106, 4046, 1120, 1122, 119, 1799, 1128, 1132, 1120, 145, 2591, 2162, 2591, 4031, 1149, 1103, 1168, 5558, 1115, 1132, 1147, 1268, 1208, 1175, 1110, 1275, 3426, 1105, 1199, 1132, 2785, 11858, 5558, 1114, 1363, 15836, 1105, 1707, 1105, 1128, 1169, 2824, 170, 1974, 1104, 1172, 1107, 18478, 1643, 1112, 1263, 1112, 1128, 1138, 170, 11858, 2420, 3797, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
import transformers
import re
[x for x in dir(transformers) if re.search(r'^AutoModel', x)]
['AutoModel', 'AutoModelForAudioClassification', 'AutoModelForAudioFrameClassification', 'AutoModelForAudioXVector', 'AutoModelForCTC', 'AutoModelForCausalLM', 'AutoModelForImageClassification', 'AutoModelForImageSegmentation', 'AutoModelForInstanceSegmentation', 'AutoModelForMaskedImageModeling', 'AutoModelForMaskedLM', 'AutoModelForMultipleChoice', 'AutoModelForNextSentencePrediction', 'AutoModelForObjectDetection', 'AutoModelForPreTraining', 'AutoModelForQuestionAnswering', 'AutoModelForSemanticSegmentation', 'AutoModelForSeq2SeqLM', 'AutoModelForSequenceClassification', 'AutoModelForSpeechSeq2Seq', 'AutoModelForTableQuestionAnswering', 'AutoModelForTokenClassification', 'AutoModelForVision2Seq', 'AutoModelWithLMHead']
import torch
from transformers import AutoModelForSequenceClassification
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
num_labels = 2
model = (AutoModelForSequenceClassification
.from_pretrained(checkpoint, num_labels=num_labels)
.to(device))
Downloading: 0%| | 0.00/251M [00:00<?, ?B/s]
Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias'] - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
from datasets import DatasetDict
tiny_imdb = DatasetDict()
tiny_imdb['train'] = imdb['train'].shuffle(seed=1).select(range(50))
tiny_imdb['validation'] = imdb['validation'].shuffle(seed=1).select(range(10))
tiny_imdb['test'] = imdb['test'].shuffle(seed=1).select(range(10))
tiny_imdb_encoded = tiny_imdb.map(tokenize_function, batched=True, batch_size=None)
tiny_imdb_encoded
0%| | 0/1 [00:00<?, ?ba/s]
0%| | 0/1 [00:00<?, ?ba/s]
0%| | 0/1 [00:00<?, ?ba/s]
DatasetDict({ train: Dataset({ features: ['text', 'label', 'input_ids', 'attention_mask'], num_rows: 50 }) validation: Dataset({ features: ['text', 'label', 'input_ids', 'attention_mask'], num_rows: 10 }) test: Dataset({ features: ['text', 'label', 'input_ids', 'attention_mask'], num_rows: 10 }) })
from transformers import Trainer, TrainingArguments
batch_size = 8
logging_steps = len(tiny_imdb_encoded["train"]) // batch_size
model_name = f"{checkpoint}-finetuned-tiny-imdb"
training_args = TrainingArguments(output_dir=model_name,
num_train_epochs=2,
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
weight_decay=0.01,
evaluation_strategy="epoch",
disable_tqdm=False,
logging_steps=logging_steps,
log_level="error",
optim='adamw_torch'
)
training_args
TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, debug=[], deepspeed=None, disable_tqdm=False, do_eval=True, do_predict=False, do_train=False, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=IntervalStrategy.EPOCH, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_strategy=HubStrategy.EVERY_SAVE, hub_token=<HUB_TOKEN>, ignore_data_skip=False, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-05, length_column_name=length, load_best_model_at_end=False, local_rank=-1, log_level=40, log_level_replica=-1, log_on_each_node=True, logging_dir=distilbert-base-cased-finetuned-tiny-imdb/runs/Apr11_11-10-15_d484f04cfea1, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=6, logging_strategy=IntervalStrategy.STEPS, lr_scheduler_type=SchedulerType.LINEAR, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=2, optim=OptimizerNames.ADAMW_TORCH, output_dir=distilbert-base-cased-finetuned-tiny-imdb, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=8, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=<PUSH_TO_HUB_TOKEN>, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=distilbert-base-cased-finetuned-tiny-imdb, save_on_each_node=False, save_steps=500, save_strategy=IntervalStrategy.STEPS, save_total_limit=None, seed=42, sharded_ddp=[], skip_memory_metrics=True, tf32=None, tpu_metrics_debug=False, tpu_num_cores=None, use_legacy_prediction_loop=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.01, xpu_backend=None, )
from transformers import Trainer
torch.cuda.empty_cache()
trainer = Trainer(model=model,
args=training_args,
train_dataset=tiny_imdb_encoded["train"],
eval_dataset=tiny_imdb_encoded["validation"],
tokenizer=tokenizer)
trainer.train();
Epoch | Training Loss | Validation Loss |
---|---|---|
1 | 0.698100 | 0.673002 |
2 | 0.690900 | 0.675429 |
preds = trainer.predict(tiny_imdb_encoded['test'])
preds
PredictionOutput(predictions=array([[-0.06958597, 0.08243475], [-0.07667492, 0.11690364], [-0.05978067, 0.05852588], [-0.05062508, 0.09085844], [-0.07219092, 0.10617454], [-0.08734367, 0.11028455], [-0.06684104, 0.08281732], [-0.07786269, 0.10676245], [-0.06891385, 0.09251334], [-0.08195043, 0.10885128]], dtype=float32), label_ids=array([0, 0, 1, 1, 0, 0, 0, 0, 1, 0]), metrics={'test_loss': 0.7379708886146545, 'test_runtime': 0.375, 'test_samples_per_second': 26.664, 'test_steps_per_second': 5.333})
preds.predictions.shape
(10, 2)
preds.predictions.argmax(axis=-1)
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
preds.label_ids
array([0, 0, 1, 1, 0, 0, 0, 0, 1, 0])
from sklearn.metrics import accuracy_score
accuracy_score(preds.label_ids, preds.predictions.argmax(axis=-1))
0.3
def get_accuracy(preds):
predictions = preds.predictions.argmax(axis=-1)
labels = preds.label_ids
accuracy = accuracy_score(preds.label_ids, preds.predictions.argmax(axis=-1))
return {'accuracy': accuracy}
from transformers import Trainer
torch.cuda.empty_cache()
trainer = Trainer(model=model,
compute_metrics=get_accuracy,
args=training_args,
train_dataset=tiny_imdb_encoded["train"],
eval_dataset=tiny_imdb_encoded["validation"],
tokenizer=tokenizer)
trainer.train();
Epoch | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
1 | 0.657800 | 0.671555 | 0.600000 |
2 | 0.658600 | 0.675788 | 0.600000 |
batch_size = 16
logging_steps = len(imdb_encoded["train"]) // batch_size
model_name = f"{checkpoint}-finetuned-imdb"
training_args = TrainingArguments(output_dir=model_name,
num_train_epochs=2,
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
weight_decay=0.01,
evaluation_strategy="epoch",
disable_tqdm=False,
logging_steps=logging_steps,
log_level="error",
optim='adamw_torch'
)
from transformers import Trainer
torch.cuda.empty_cache()
trainer = Trainer(model=model,
args=training_args,
compute_metrics=get_accuracy,
train_dataset=imdb_encoded["train"],
eval_dataset=imdb_encoded["validation"],
tokenizer=tokenizer)
trainer.train();
Epoch | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
1 | 0.461600 | 0.355774 | 0.862500 |
2 | 0.247800 | 0.343082 | 0.872500 |
trainer.evaluate()
{'epoch': 2.0, 'eval_accuracy': 0.895, 'eval_loss': 0.3591071367263794, 'eval_runtime': 13.6299, 'eval_samples_per_second': 29.347, 'eval_steps_per_second': 3.668}
trainer.save_model()
model_name
'bert-base-cased-finetuned-imdb'
from transformers import pipeline
classifier = pipeline('text-classification', model=model_name)
classifier('This is not my idea of fun')
[{'label': 'LABEL_0', 'score': 0.9525713324546814}]
classifier('This was beyond incredible')
[{'label': 'LABEL_1', 'score': 0.8722493052482605}]