Fake News Detection with Hugging Face¶

Hugging Face is an open-source and platform provider of machine learning technologies. You can use install their package to access some interesting pre-built models to use them directly or to fine-tune (retrain it on your dataset leveraging the prior knowledge coming with the first training), then host your trained models on the platform, so that you may use them later on other devices and apps.

Please, go to the website and sign-in to access all the features of the platform.

The Hugging face models are Deep Learning based, so will need a lot of computational GPU power to train them. Please use Colab to do it, or your other GPU cloud provider, or a local machine having NVIDIA GPU.

Find below a simple example, with just 10 epochs of fine-tuning`.

Read more about the fine-tuning concept : here

Installation¶

In [123]:

# !pip install zipfile

!pip install transformers
!pip install datasets
!pip install --upgrade accelerate
!pip install sentencepiece

Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.30.2)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.12.2)
Requirement already satisfied: huggingface-hub<1.0,>=0.14.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.16.2)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.22.4)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (23.1)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2022.10.31)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.27.1)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.13.3)
Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.3.1)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.65.0)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.14.1->transformers) (2023.6.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.14.1->transformers) (4.6.3)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (1.26.16)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2023.5.7)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.0.12)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.4)
Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (2.13.1)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from datasets) (1.22.4)
Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (9.0.0)
Requirement already satisfied: dill<0.3.7,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.3.6)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (1.5.3)
Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.27.1)
Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (4.65.0)
Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets) (3.2.0)
Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from datasets) (0.70.14)
Requirement already satisfied: fsspec[http]>=2021.11.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (2023.6.0)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.8.4)
Requirement already satisfied: huggingface-hub<1.0.0,>=0.11.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.16.2)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (23.1)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (23.1.0)
Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (2.0.12)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.0.4)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.2)
Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.9.2)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.3)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0.0,>=0.11.0->datasets) (3.12.2)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0.0,>=0.11.0->datasets) (4.6.3)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (1.26.16)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (2023.5.7)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (3.4)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2022.7.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->datasets) (1.16.0)
Requirement already satisfied: accelerate in /usr/local/lib/python3.10/dist-packages (0.20.3)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from accelerate) (1.22.4)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from accelerate) (23.1)
Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate) (5.9.5)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate) (6.0)
Requirement already satisfied: torch>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from accelerate) (2.0.1+cu118)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->accelerate) (3.12.2)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->accelerate) (4.6.3)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->accelerate) (1.11.1)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->accelerate) (3.1)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->accelerate) (3.1.2)
Requirement already satisfied: triton==2.0.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->accelerate) (2.0.0)
Requirement already satisfied: cmake in /usr/local/lib/python3.10/dist-packages (from triton==2.0.0->torch>=1.6.0->accelerate) (3.25.2)
Requirement already satisfied: lit in /usr/local/lib/python3.10/dist-packages (from triton==2.0.0->torch>=1.6.0->accelerate) (16.0.6)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.6.0->accelerate) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.6.0->accelerate) (1.3.0)
Requirement already satisfied: sentencepiece in /usr/local/lib/python3.10/dist-packages (0.1.99)

Importing Libraries¶

In [124]:

import huggingface_hub # Importing the huggingface_hub library for model sharing and versioning
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import transformers
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
import os

from datasets import DatasetDict, Dataset
from sklearn.metrics import mean_squared_error, classification_report

from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig
from transformers import TrainingArguments, Trainer
from google.colab import drive
# import zipfile
import torch

Load Dataset and Delete Null Values¶

In [125]:

real_url = "https://raw.githubusercontent.com/KaiDMML/FakeNewsNet/master/dataset/gossipcop_real.csv"
fake_url = "https://raw.githubusercontent.com/KaiDMML/FakeNewsNet/master/dataset/gossipcop_fake.csv"
# fake_url = "https://raw.githubusercontent.com/KaiDMML/FakeNewsNet/master/dataset/politifact_fake.csv"
# real_url = "https://raw.githubusercontent.com/KaiDMML/FakeNewsNet/master/dataset/politifact_real.csv"

# Read the csv file from the url
fake = pd.read_csv(fake_url)
real = pd.read_csv(real_url)

# A way to delete rows with empty or null values
fake = fake[~fake.isna().any(axis=1)]
real = real[~real.isna().any(axis=1)]

In [126]:

fake["label"] = 1
real["label"] = 0

<ipython-input-126-89cb7829fca8>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fake["label"] = 1

In [127]:

df = pd.concat([fake, real], axis =0 )
df.head(10)

Out[127]:

	id	news_url	title	tweet_ids	label
0	gossipcop-2493749932	www.dailymail.co.uk/tvshowbiz/article-5874213/...	Did Miley Cyrus and Liam Hemsworth secretly ge...	284329075902926848\t284332744559968256\t284335...	1
1	gossipcop-4580247171	hollywoodlife.com/2018/05/05/paris-jackson-car...	Paris Jackson & Cara Delevingne Enjoy Night Ou...	992895508267130880\t992897935418503169\t992899...	1
2	gossipcop-941805037	variety.com/2017/biz/news/tax-march-donald-tru...	Celebrities Join Tax March in Protest of Donal...	853359353532829696\t853359576543920128\t853359...	1
3	gossipcop-2547891536	www.dailymail.co.uk/femail/article-3499192/Do-...	Cindy Crawford's daughter Kaia Gerber wears a ...	988821905196158981\t988824206556172288\t988825...	1
4	gossipcop-5476631226	variety.com/2018/film/news/list-2018-oscar-nom...	Full List of 2018 Oscar Nominations – Variety	955792793632432131\t955795063925301249\t955798...	1
5	gossipcop-5189580095	www.townandcountrymag.com/society/tradition/a1...	Here's What Really Happened When JFK Jr. Met P...	890253005299351552\t890401381814870016\t890491...	1
6	gossipcop-9588339534	www.foxnews.com/entertainment/2016/12/16/bigge...	Biggest celebrity scandals of 2016	683226380742557696\t748604615503929345\t748604...	1
7	gossipcop-8753274298	www.eonline.com/news/958257/caitlyn-jenner-add...	Caitlyn Jenner Addresses Rumored Romance With ...	1026891446081728512\t1026891745219543043\t1026...	1
8	gossipcop-8105333868	www.inquisitr.com/3871816/taylor-swift-reporte...	Taylor Swift Reportedly Reacts To Tom Hiddlest...	818928533569437697\t819100640878202880\t819174...	1
9	gossipcop-2803748870	www.huffingtonpost.com/entry/kate-mckinnon-the...	For The Love Of God, Why Can't Anyone Write Ka...	816030248190046212\t816030859484626947\t816049...	1

Splitting the dataset¶

In [128]:

# Split the train data => {train, eval}  train 80%, test 20%
train, eval = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

In [129]:

# get the first 5 rows of the train set to make sure it looks right
train.head()

Out[129]:

	id	news_url	title	tweet_ids
2211	gossipcop-901009	https://www.englishbaby.com/vocab/word/5457	What does "cultural event" mean?	947425653900660737\t947425752525561857\t947425...
16164	gossipcop-882599	https://people.com/country/jessie-james-decker...	Jessie James Decker Says NFL Star Husband Eric...	912398822659178497\t912483537206616065
11087	gossipcop-854980	http://celebrityinsider.org/candace-cameron-bu...	Candace Cameron Bure Discusses Her ‘Addiction’...	865720564496859136\t865720884484374528\t865720...
5830	gossipcop-842466	https://www.thesun.co.uk/tvandshowbiz/7955772/...	David and Victoria Beckham barely speak at fas...	851464446614663168\t851467744243585026\t851468...
10598	gossipcop-890551	https://www.vanityfair.com/hollywood/2017/10/s...	Stranger Things: Noah Schnapp on the Character...	925429690042523648\t925429880153624578\t925430...

In [130]:

# check datatypes of the train set, object can mean text or string
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16516 entries, 2211 to 2430
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         16516 non-null  object
 1   news_url   16516 non-null  object
 2   title      16516 non-null  object
 3   tweet_ids  16516 non-null  object
 4   label      16516 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 774.2+ KB

In [131]:

# get the first 5 rows of the eval or test set
eval.head()

Out[131]:

	id	news_url	title	tweet_ids	label
14713	gossipcop-910751	http://wstale.com/celebrities/5-unexpected-val...	5 Unexpected Valentine’s Day Outfit Ideas—No L...	958827824584056837\t958828582318391296\t958828...	0
3531	gossipcop-9408067324	www.inquisitr.com/4279012/katie-holmes-jamie-f...	Katie Holmes, Jamie Foxx Spending Millions To ...	869600536131227648\t869600543882268672\t869611...	1
4070	gossipcop-900528	http://time.com/money/5084724/golden-globes-20...	How to Watch the 2018 Golden Globes for Free	948650012367564800	0
3665	gossipcop-899642	https://deadline.com/2018/09/the-marvelous-mrs...	‘The Marvelous Mrs. Maisel’ Season 2 To Defini...	940611199519117312\t940615533967396864\t940615...	0
10362	gossipcop-907483	https://www.instyle.com/news/chrissy-teigen-to...	Chrissy Teigen Poses Topless to Show Off the S...	954401657084895233\t954402855338807296\t954403...	0

In [132]:

eval.label.unique()

Out[132]:

array([0, 1])

In [133]:

print(f"new dataframe shapes: train is {train.shape}, eval is {eval.shape}")

new dataframe shapes: train is (16516, 5), eval is (4129, 5)

In [134]:

# 90 true, 10 fake, 70, 30
# 40, 60 good, 55, 45 is good
# Checking if our df is well balanced
label_size = [df['label'].sum(),len(df['label'])-df['label'].sum()]
plt.pie(label_size,explode=[0.1,0.1],colors=['firebrick','navy'],startangle=90,shadow=True,labels=['Fake','True'],autopct='%1.1f%%')

Out[134]:

([<matplotlib.patches.Wedge at 0x7fe432b91780>,
  <matplotlib.patches.Wedge at 0x7fe432b93ee0>],
 [Text(-0.8138650525648107, 0.8818297319855336, 'Fake'),
  Text(0.8138650938462391, -0.8818296938857598, 'True')],
 [Text(-0.4747546139961395, 0.5144006769915612, '23.7%'),
  Text(0.4747546380769727, -0.5144006547666932, '76.3%')])

In [153]:

# # Get the minority class
# minority_class = df[df["label"] == 1]

# # Upsample the minority class (1 which is fake)
# minority_upsampled = resample(minority_class,
#                              replace=True,
#                              n_samples=df["label"].value_counts()[0],
#                              random_state=123)

# # Combine the upsampled minority class with the majority class
# df_upsampled = pd.concat([df[df["label"] == 0], minority_upsampled])

In [154]:

# Get the majority class
majority_class = df[df["label"] == 0]

# downsample the majority class (0 which is real)
majority_downsampled = resample(majority_class,
                             replace=True,
                             n_samples=df["label"].value_counts()[1],
                             random_state=123)

# Combine the downsampled majority class with the majority class
df_downsampled = pd.concat([df[df["label"] == 1], majority_downsampled])

In [155]:

# Checking if our df_resampled is well balanced
label_size = [df_resampled['label'].sum(),len(df_resampled['label'])-df_resampled['label'].sum()]
plt.pie(label_size,explode=[0.1,0.1],colors=['firebrick','navy'],startangle=90,shadow=True,labels=['Fake','True'],autopct='%1.1f%%')

Out[155]:

([<matplotlib.patches.Wedge at 0x7fe432475420>,
  <matplotlib.patches.Wedge at 0x7fe432477700>],
 [Text(-1.2000000000000002, 1.469576158976824e-16, 'Fake'),
  Text(1.2000000000000002, -2.939152317953648e-16, 'True')],
 [Text(-0.7, 8.572527594031472e-17, '50.0%'),
  Text(0.7, -1.7145055188062944e-16, '50.0%')])

In [156]:

# Split the train data => {train, eval}  train 80%, test 20%
train, eval = train_test_split(df_resampled, test_size=0.2, random_state=42, stratify=df_resampled['label'])

Creating a pytorch dataset¶

In [157]:

# transformers library allows you to use pytorch or tensorflow to save your dataset
# pytorch dataset looks like a dictoinary
# using this rep works well with the transformers library

# Create a pytorch dataset to ensure consistency in our data handling

# Create a train and eval datasets using the specified columns from the DataFrame
train_dataset = Dataset.from_pandas(train[['tweet_ids', 'news_url', 'title', 'label']])
eval_dataset = Dataset.from_pandas(eval[['tweet_ids', 'news_url', 'title', 'label']])

# Combine the train and eval datasets into a DatasetDict
dataset = DatasetDict({'train': train_dataset, 'eval': eval_dataset})

# Remove the '__index_level_0__' column from the dataset
dataset = dataset.remove_columns('__index_level_0__')
dataset

Out[157]:

DatasetDict({
    train: Dataset({
        features: ['tweet_ids', 'news_url', 'title', 'label'],
        num_rows: 25195
    })
    eval: Dataset({
        features: ['tweet_ids', 'news_url', 'title', 'label'],
        num_rows: 6299
    })
})

Preprocessing and tokenization¶

In [158]:

# define helper functions

# funtion to replace usernames and links with placeholders.
def preprocess(text):
  # "@user my name is john"
  # "http my name is john"
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

# no need for encoding: Fake=1, True=0 bcuz the target variable called label is already encoded

In [159]:

# Define the apply_preprocess function
def apply_preprocess(dataset, column='title'):
    return dataset.map(lambda example: {column: preprocess(example[column])},
                       remove_columns=[column])

# Apply the preprocess function to the 'title' column in both 'train' and 'eval' datasets
dataset['train'] = apply_preprocess(dataset['train'])
dataset['eval'] = apply_preprocess(dataset['eval'])

Map:   0%|          | 0/25195 [00:00<?, ? examples/s]

Map:   0%|          | 0/6299 [00:00<?, ? examples/s]

Tokenization¶

In [160]:

# Plot histogram of the number of words in train data 'text'
seq_len = [len(text.split()) for text in df['title']]

pd.Series(seq_len).hist(bins = 40,color='firebrick')
plt.xlabel('Number of Words')
plt.ylabel('Number of texts')

Out[160]:

Text(0, 0.5, 'Number of texts')

In [161]:

# define the tokenizer
tokenizer = AutoTokenizer.from_pretrained("jy46604790/Fake-News-Bert-Detect")

def tokenize_data(example):
    return tokenizer(example['title'], padding='max_length', # compress all sentences to maximum of 30 words which is the max_length
                     truncation=True, # cut the sentenced to 30_words
                     max_length=40 # increasing the max length doesn't guarantee a better score
                     )

# Change the tweets to tokens that the models can exploit
dataset = dataset.map(tokenize_data, batched=True)

# Transform	labels and remove the useless columns or columns that are not tokenized
remove_columns = ['tweet_ids', 'news_url', 'title']
dataset = dataset.map(remove_columns=remove_columns)
dataset

Map:   0%|          | 0/25195 [00:00<?, ? examples/s]

Map:   0%|          | 0/6299 [00:00<?, ? examples/s]

Map:   0%|          | 0/25195 [00:00<?, ? examples/s]

Map:   0%|          | 0/6299 [00:00<?, ? examples/s]

Out[161]:

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 25195
    })
    eval: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 6299
    })
})

Trianing¶

In [162]:

# Loading a pretrain model for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained("jy46604790/Fake-News-Bert-Detect")

In [163]:

# Configure the trianing parameters like `num_train_epochs`:
# the number of time the model will repeat the training loop over the dataset
training_args = TrainingArguments("test_trainer",
                                  num_train_epochs=10, # epoch is ow many times you repeat training
                                  load_best_model_at_end=True,
                                  save_strategy='epoch',
                                  evaluation_strategy='epoch',
                                  logging_strategy='epoch',
                                  per_device_train_batch_size=32, # smaller batches take longer to train
                                  )

In [164]:

# set up the optimizer with the PyTorch implementation of AdamW
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5) # I specified the optimizer to avoid a warning message

In [165]:

train_dataset = dataset['train'].shuffle(seed=24)
eval_dataset = dataset['eval'].shuffle(seed=24) # scatter the dataset 24 times randomly

In [166]:

def compute_metrics(eval_pred):      # specify the evaluation metric
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {
        "rmse": mean_squared_error(labels, predictions, squared=False),
        "classification_report": classification_report(labels, predictions)
    }

In [167]:

trainer = Trainer(
    model,
    training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [168]:

trainer.train()  # rmse 0 to 1 closer to 0 means better performance.

/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

[7880/7880 35:39, Epoch 10/10]

Epoch	Training Loss	Validation Loss	Rmse	Classification Report
1	0.439900	0.366631	0.393631	precision recall f1-score support 0 0.93 0.75 0.83 3150 1 0.79 0.94 0.86 3149 accuracy 0.85 6299 macro avg 0.86 0.85 0.84 6299 weighted avg 0.86 0.85 0.84 6299
2	0.261600	0.289143	0.326625	precision recall f1-score support 0 0.96 0.83 0.89 3150 1 0.85 0.96 0.90 3149 accuracy 0.89 6299 macro avg 0.90 0.89 0.89 6299 weighted avg 0.90 0.89 0.89 6299
3	0.176700	0.203825	0.257604	precision recall f1-score support 0 0.94 0.93 0.93 3150 1 0.93 0.94 0.93 3149 accuracy 0.93 6299 macro avg 0.93 0.93 0.93 6299 weighted avg 0.93 0.93 0.93 6299
4	0.134400	0.202296	0.249782	precision recall f1-score support 0 0.94 0.93 0.94 3150 1 0.93 0.95 0.94 3149 accuracy 0.94 6299 macro avg 0.94 0.94 0.94 6299 weighted avg 0.94 0.94 0.94 6299
5	0.100000	0.295655	0.281459	precision recall f1-score support 0 0.97 0.86 0.92 3150 1 0.88 0.98 0.93 3149 accuracy 0.92 6299 macro avg 0.93 0.92 0.92 6299 weighted avg 0.93 0.92 0.92 6299
6	0.084600	0.241749	0.250416	precision recall f1-score support 0 0.97 0.91 0.94 3150 1 0.91 0.97 0.94 3149 accuracy 0.94 6299 macro avg 0.94 0.94 0.94 6299 weighted avg 0.94 0.94 0.94 6299
7	0.062200	0.262496	0.240059	precision recall f1-score support 0 0.97 0.91 0.94 3150 1 0.92 0.97 0.94 3149 accuracy 0.94 6299 macro avg 0.94 0.94 0.94 6299 weighted avg 0.94 0.94 0.94 6299
8	0.052400	0.318661	0.251050	precision recall f1-score support 0 0.97 0.90 0.93 3150 1 0.91 0.98 0.94 3149 accuracy 0.94 6299 macro avg 0.94 0.94 0.94 6299 weighted avg 0.94 0.94 0.94 6299
9	0.039800	0.335541	0.254192	precision recall f1-score support 0 0.98 0.89 0.93 3150 1 0.90 0.98 0.94 3149 accuracy 0.94 6299 macro avg 0.94 0.94 0.94 6299 weighted avg 0.94 0.94 0.94 6299
10	0.030500	0.308559	0.240389	precision recall f1-score support 0 0.97 0.91 0.94 3150 1 0.91 0.98 0.94 3149 accuracy 0.94 6299 macro avg 0.94 0.94 0.94 6299 weighted avg 0.94 0.94 0.94 6299

Trainer is attempting to log a value of "              precision    recall  f1-score   support

           0       0.93      0.75      0.83      3150
           1       0.79      0.94      0.86      3149

    accuracy                           0.85      6299
   macro avg       0.86      0.85      0.84      6299
weighted avg       0.86      0.85      0.84      6299
" of type <class 'str'> for key "eval/classification_report" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "              precision    recall  f1-score   support

           0       0.96      0.83      0.89      3150
           1       0.85      0.96      0.90      3149

    accuracy                           0.89      6299
   macro avg       0.90      0.89      0.89      6299
weighted avg       0.90      0.89      0.89      6299
" of type <class 'str'> for key "eval/classification_report" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "              precision    recall  f1-score   support

           0       0.94      0.93      0.93      3150
           1       0.93      0.94      0.93      3149

    accuracy                           0.93      6299
   macro avg       0.93      0.93      0.93      6299
weighted avg       0.93      0.93      0.93      6299
" of type <class 'str'> for key "eval/classification_report" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "              precision    recall  f1-score   support

           0       0.94      0.93      0.94      3150
           1       0.93      0.95      0.94      3149

    accuracy                           0.94      6299
   macro avg       0.94      0.94      0.94      6299
weighted avg       0.94      0.94      0.94      6299
" of type <class 'str'> for key "eval/classification_report" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "              precision    recall  f1-score   support

           0       0.97      0.86      0.92      3150
           1       0.88      0.98      0.93      3149

    accuracy                           0.92      6299
   macro avg       0.93      0.92      0.92      6299
weighted avg       0.93      0.92      0.92      6299
" of type <class 'str'> for key "eval/classification_report" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "              precision    recall  f1-score   support

           0       0.97      0.91      0.94      3150
           1       0.91      0.97      0.94      3149

    accuracy                           0.94      6299
   macro avg       0.94      0.94      0.94      6299
weighted avg       0.94      0.94      0.94      6299
" of type <class 'str'> for key "eval/classification_report" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "              precision    recall  f1-score   support

           0       0.97      0.91      0.94      3150
           1       0.92      0.97      0.94      3149

    accuracy                           0.94      6299
   macro avg       0.94      0.94      0.94      6299
weighted avg       0.94      0.94      0.94      6299
" of type <class 'str'> for key "eval/classification_report" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "              precision    recall  f1-score   support

           0       0.97      0.90      0.93      3150
           1       0.91      0.98      0.94      3149

    accuracy                           0.94      6299
   macro avg       0.94      0.94      0.94      6299
weighted avg       0.94      0.94      0.94      6299
" of type <class 'str'> for key "eval/classification_report" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "              precision    recall  f1-score   support

           0       0.98      0.89      0.93      3150
           1       0.90      0.98      0.94      3149

    accuracy                           0.94      6299
   macro avg       0.94      0.94      0.94      6299
weighted avg       0.94      0.94      0.94      6299
" of type <class 'str'> for key "eval/classification_report" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "              precision    recall  f1-score   support

           0       0.97      0.91      0.94      3150
           1       0.91      0.98      0.94      3149

    accuracy                           0.94      6299
   macro avg       0.94      0.94      0.94      6299
weighted avg       0.94      0.94      0.94      6299
" of type <class 'str'> for key "eval/classification_report" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.

Out[168]:

TrainOutput(global_step=7880, training_loss=0.13820908251147584, metrics={'train_runtime': 2139.7739, 'train_samples_per_second': 117.746, 'train_steps_per_second': 3.683, 'total_flos': 5178971124840000.0, 'train_loss': 0.13820908251147584, 'epoch': 10.0})

Don't worry the above issue, it is a KeyboardInterrupt that means I stopped the training to avoid taking a long time to finish.

In [169]:

# Launch the final evaluation
trainer.evaluate()  # eval loss is the performance cost of finetuning (0 to 1) 0.5 and above is not suitable.

[788/788 00:16]

Trainer is attempting to log a value of "              precision    recall  f1-score   support

           0       0.94      0.93      0.94      3150
           1       0.93      0.95      0.94      3149

    accuracy                           0.94      6299
   macro avg       0.94      0.94      0.94      6299
weighted avg       0.94      0.94      0.94      6299
" of type <class 'str'> for key "eval/classification_report" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.

Out[169]:

{'eval_loss': 0.20229628682136536,
 'eval_rmse': 0.2497816159996159,
 'eval_classification_report': '              precision    recall  f1-score   support\n\n           0       0.94      0.93      0.94      3150\n           1       0.93      0.95      0.94      3149\n\n    accuracy                           0.94      6299\n   macro avg       0.94      0.94      0.94      6299\nweighted avg       0.94      0.94      0.94      6299\n',
 'eval_runtime': 16.1801,
 'eval_samples_per_second': 389.306,
 'eval_steps_per_second': 48.702,
 'epoch': 10.0}

Pushing to HuggingFace¶

Some checkpoints of the model are automatically saved locally in test_trainer/ during the training.

You may also upload the model on the Hugging Face Platform... Read more

In [170]:

huggingface_hub.notebook_login()
# login to the Hugging Face Hub with your token

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [171]:

# # Push model and tokenizer to HugginFace
model.push_to_hub("ikoghoemmanuell/finetuned_fake_news_bert") # (username/model_name)
tokenizer.push_to_hub("ikoghoemmanuell/finetuned_fake_news_bert")

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <cell line: 2>:2                                                                              │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:803 in push_to_hub             │
│                                                                                                  │
│    800 │   │   else:                                                                             │
│    801 │   │   │   working_dir = repo_id.split("/")[-1]                                          │
│    802 │   │                                                                                     │
│ ❱  803 │   │   repo_id = self._create_repo(                                                      │
│    804 │   │   │   repo_id, private=private, use_auth_token=use_auth_token, repo_url=repo_url,   │
│    805 │   │   )                                                                                 │
│    806                                                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:661 in _create_repo            │
│                                                                                                  │
│    658 │   │   │   │   │   repo_id = repo_id.split("/")[-1]                                      │
│    659 │   │   │   │   repo_id = f"{organization}/{repo_id}"                                     │
│    660 │   │                                                                                     │
│ ❱  661 │   │   url = create_repo(repo_id=repo_id, token=use_auth_token, private=private, exist_  │
│    662 │   │                                                                                     │
│    663 │   │   # If the namespace is not there, add it or `upload_file` will complain            │
│    664 │   │   if "/" not in repo_id and url != f"{HUGGINGFACE_CO_RESOLVE_ENDPOINT}/{repo_id}":  │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py:118 in _inner_fn    │
│                                                                                                  │
│   115 │   │   if check_use_auth_token:                                                           │
│   116 │   │   │   kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.__name__, has_token=ha   │
│   117 │   │                                                                                      │
│ ❱ 118 │   │   return fn(*args, **kwargs)                                                         │
│   119 │                                                                                          │
│   120 │   return _inner_fn  # type: ignore                                                       │
│   121                                                                                            │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/huggingface_hub/hf_api.py:2304 in create_repo            │
│                                                                                                  │
│   2301 │   │   │   # Testing purposes only.                                                      │
│   2302 │   │   │   # See https://github.com/huggingface/huggingface_hub/pull/733/files#r8206044  │
│   2303 │   │   │   json["lfsmultipartthresh"] = self._lfsmultipartthresh  # type: ignore         │
│ ❱ 2304 │   │   headers = self._build_hf_headers(token=token, is_write_action=True)               │
│   2305 │   │   r = get_session().post(path, headers=headers, json=json)                          │
│   2306 │   │                                                                                     │
│   2307 │   │   try:                                                                              │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/huggingface_hub/hf_api.py:5008 in _build_hf_headers      │
│                                                                                                  │
│   5005 │   │   if token is None:                                                                 │
│   5006 │   │   │   # Cannot do `token = token or self.token` as token can be `False`.            │
│   5007 │   │   │   token = self.token                                                            │
│ ❱ 5008 │   │   return build_hf_headers(                                                          │
│   5009 │   │   │   token=token,                                                                  │
│   5010 │   │   │   is_write_action=is_write_action,                                              │
│   5011 │   │   │   library_name=library_name or self.library_name,                               │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py:118 in _inner_fn    │
│                                                                                                  │
│   115 │   │   if check_use_auth_token:                                                           │
│   116 │   │   │   kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.__name__, has_token=ha   │
│   117 │   │                                                                                      │
│ ❱ 118 │   │   return fn(*args, **kwargs)                                                         │
│   119 │                                                                                          │
│   120 │   return _inner_fn  # type: ignore                                                       │
│   121                                                                                            │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_headers.py:122 in                 │
│ build_hf_headers                                                                                 │
│                                                                                                  │
│   119 │   """                                                                                    │
│   120 │   # Get auth token to send                                                               │
│   121 │   token_to_send = get_token_to_send(token)                                               │
│ ❱ 122 │   _validate_token_to_send(token_to_send, is_write_action=is_write_action)                │
│   123 │                                                                                          │
│   124 │   # Combine headers                                                                      │
│   125 │   headers = {                                                                            │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_headers.py:172 in                 │
│ _validate_token_to_send                                                                          │
│                                                                                                  │
│   169 def _validate_token_to_send(token: Optional[str], is_write_action: bool) -> None:          │
│   170 │   if is_write_action:                                                                    │
│   171 │   │   if token is None:                                                                  │
│ ❱ 172 │   │   │   raise ValueError(                                                              │
│   173 │   │   │   │   "Token is required (write-access action) but no token found. You need"     │
│   174 │   │   │   │   " to provide a token or be logged in to Hugging Face with"                 │
│   175 │   │   │   │   " `huggingface-cli login` or `huggingface_hub.login`. See"                 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Token is required (write-access action) but no token found. You need to provide a token or be logged in
to Hugging Face with `huggingface-cli login` or `huggingface_hub.login`. See 
https://huggingface.co/settings/tokens.