Notebook

Note: Uncomment and run the following cells if you work on Google Colab :) Don't forget to change your runtime type to GPU!

In [ ]:

# !git clone https://github.com/kstathou/vector_engine

In [ ]:

# cd vector_engine

In [ ]:

# pip install -r requirements.txt

Let's begin!¶

In [ ]:

%load_ext autoreload

In [ ]:

%autoreload 2
# Used to import data from S3.
import pandas as pd
import s3fs

# Used to create the dense document vectors.
import torch
from sentence_transformers import SentenceTransformer

# Used to create and store the Faiss index.
import faiss
import numpy as np
import pickle
from pathlib import Path

# Used to do vector searches and display the results.
from vector_engine.utils import vector_search, id2details

Stored and processed data in s3

In [ ]:

# Use pandas to read files from S3 buckets!
df = pd.read_csv('s3://vector-search-blog/misinformation_papers.csv')

In [ ]:

df.head(3)

Out[ ]:

	original_title	abstract	year	citations	id	is_EN
0	When Corrections Fail: The Persistence of Poli...	An extensive literature addresses citizen igno...	2010	901	2132553681	1
1	A postmodern Pandora's box: anti-vaccination m...	The Internet plays a large role in disseminati...	2010	440	2117485795	1
2	Spread of (Mis)Information in Social Networks	We provide a model to investigate the tension ...	2010	278	2120015072	1

In [ ]:

print(f"Misinformation, disinformation and fake news papers: {df.id.unique().shape[0]}")

Misinformation, disinformation and fake news papers: 8430

The Sentence Transformers library offers pretrained transformers that produce SOTA sentence embeddings. Checkout this spreadsheet with all the available models.

In this tutorial, we will use the distilbert-base-nli-stsb-mean-tokens model which has the best performance on Semantic Textual Similarity tasks among the DistilBERT versions. Moreover, although it's slightly worse than BERT, it is quite faster thanks to having a smaller size.

I use the same model in Orion's semantic search engine!

In [ ]:

# Instantiate the sentence-level DistilBERT
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
# Check if GPU is available and use it
if torch.cuda.is_available():
    model = model.to(torch.device("cuda"))
print(model.device)

100%|██████████| 245M/245M [00:14<00:00, 16.6MB/s]

cuda:0

In [ ]:

# Convert abstracts to vectors
embeddings = model.encode(df.abstract.to_list(), show_progress_bar=True)

HBox(children=(FloatProgress(value=0.0, description='Batches', max=264.0, style=ProgressStyle(description_widt…

In [ ]:

print(f'Shape of the vectorised abstract: {embeddings[0].shape}')

Shape of the vectorised abstract: (768,)

Vector similarity search with Faiss¶

Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, even ones that do not fit in RAM.

Faiss is built around the Index object which contains, and sometimes preprocesses, the searchable vectors. Faiss has a large collection of indexes. You can even create composite indexes. Faiss handles collections of vectors of a fixed dimensionality d, typically a few 10s to 100s.

Note: Faiss uses only 32-bit floating point matrices. This means that you will have to change the data type of the input before building the index.

To learn more about Faiss, you can read their paper on arXiv.

Here, we will the IndexFlatL2 index:

It's a simple index that performs a brute-force L2 distance search
It scales linearly. It will work fine with our data but you might want to try faster indexes if you work will millions of vectors.

To create an index with the misinformation abstract vectors, we will:

Change the data type of the abstract vectors to float32.
Build an index and pass it the dimension of the vectors it will operate on.
Pass the index to IndexIDMap, an object that enables us to provide a custom list of IDs for the indexed vectors.
Add the abstract vectors and their ID mapping to the index. In our case, we will map vectors to their paper IDs from MAG.

In [ ]:

# Step 1: Change data type
embeddings = np.array([embedding for embedding in embeddings]).astype("float32")

# Step 2: Instantiate the index
index = faiss.IndexFlatL2(embeddings.shape[1])

# Step 3: Pass the index to IndexIDMap
index = faiss.IndexIDMap(index)

# Step 4: Add vectors and their IDs
index.add_with_ids(embeddings, df.id.values)

print(f"Number of vectors in the Faiss index: {index.ntotal}")

Number of vectors in the Faiss index: 8430

Searching the index¶

The index we built will perform a k-nearest-neighbour search. We have to provide the number of neighbours to be returned.

Let's query the index with an abstract from our dataset and retrieve the 10 most relevant documents. The first one must be our query!

In [ ]:

# Paper abstract
df.iloc[5415, 1]

Out[ ]:

"We address the diffusion of information about the COVID-19 with a massive data analysis on Twitter, Instagram, YouTube, Reddit and Gab. We analyze engagement and interest in the COVID-19 topic and provide a differential assessment on the evolution of the discourse on a global scale for each platform and their users. We fit information spreading with epidemic models characterizing the basic reproduction number [Formula: see text] for each social media platform. Moreover, we identify information spreading from questionable sources, finding different volumes of misinformation in each platform. However, information from both reliable and questionable sources do not present different spreading patterns. Finally, we provide platform-dependent numerical estimates of rumors' amplification."

In [ ]:

# Retrieve the 10 nearest neighbours
D, I = index.search(np.array([embeddings[5415]]), k=10)
print(f'L2 distance: {D.flatten().tolist()}\n\nMAG paper IDs: {I.flatten().tolist()}')

L2 distance: [0.0, 1.267284631729126, 62.72160339355469, 63.670326232910156, 64.58393859863281, 67.47344970703125, 67.96402740478516, 69.47564697265625, 72.5633544921875, 74.62230682373047]

MAG paper IDs: [3092618151, 3011345566, 3012936764, 3055557295, 3011186656, 3044429417, 3092128270, 3024620668, 3047284882, 3048848247]

In [ ]:

# Fetch the paper titles based on their index
id2details(df, I, 'original_title')

Out[ ]:

[['The COVID-19 social media infodemic.'],
 ['The COVID-19 Social Media Infodemic'],
 ['Understanding the perception of COVID-19 policies by mining a multilanguage Twitter dataset'],
 ['Covid-19 infodemic reveals new tipping point epidemiology and a revised R formula.'],
 ['Coronavirus Goes Viral: Quantifying the COVID-19 Misinformation Epidemic on Twitter'],
 ['Effects of misinformation on COVID-19 individual responses and recommendations for resilience of disastrous consequences of misinformation'],
 ['Analysis of online misinformation during the peak of the COVID-19 pandemics in Italy'],
 ['Quantifying COVID-19 Content in the Online Health Opinion War Using Machine Learning'],
 ['Global Infodemiology of COVID-19: Analysis of Google Web Searches and Instagram Hashtags.'],
 ['COVID-19-Related Infodemic and Its Impact on Public Health: A Global Social Media Analysis.']]

In [ ]:

# Fetch the paper abstracts based on their index
id2details(df, I, 'abstract')

Out[ ]:

[["We address the diffusion of information about the COVID-19 with a massive data analysis on Twitter, Instagram, YouTube, Reddit and Gab. We analyze engagement and interest in the COVID-19 topic and provide a differential assessment on the evolution of the discourse on a global scale for each platform and their users. We fit information spreading with epidemic models characterizing the basic reproduction number [Formula: see text] for each social media platform. Moreover, we identify information spreading from questionable sources, finding different volumes of misinformation in each platform. However, information from both reliable and questionable sources do not present different spreading patterns. Finally, we provide platform-dependent numerical estimates of rumors' amplification."],
["We address the diffusion of information about the COVID-19 with a massive data analysis on Twitter, Instagram, YouTube, Reddit and Gab. We analyze engagement and interest in the COVID-19 topic and provide a differential assessment on the evolution of the discourse on a global scale for each platform and their users. We fit information spreading with epidemic models characterizing the basic reproduction numbers $R_0$ for each social media platform. Moreover, we characterize information spreading from questionable sources, finding different volumes of misinformation in each platform. However, information from both reliable and questionable sources do not present different spreading patterns. Finally, we provide platform-dependent numerical estimates of rumors' amplification."],
['The objective of this work is to explore popular discourse about the COVID-19 pandemic and policies implemented to manage it. Using Natural Language Processing, Text Mining, and Network Analysis to analyze corpus of tweets that relate to the COVID-19 pandemic, we identify common responses to the pandemic and how these responses differ across time. Moreover, insights as to how information and misinformation were transmitted via Twitter, starting at the early stages of this pandemic, are presented. Finally, this work introduces a dataset of tweets collected from all over the world, in multiple languages, dating back to January 22nd, when the total cases of reported COVID-19 were below 600 worldwide. The insights presented in this work could help inform decision makers in the face of future pandemics, and the dataset introduced can be used to acquire valuable knowledge to help mitigate the COVID-19 pandemic.'],
["Many governments have managed to control their COVID-19 outbreak with a simple message: keep the effective '$R$ number' $R<1$ to prevent widespread contagion and flatten the curve. This raises the question whether a similar policy could control dangerous online 'infodemics' of information, misinformation and disinformation. Here we show, using multi-platform data from the COVID-19 infodemic, that its online spreading instead encompasses a different dynamical regime where communities and users within and across independent platforms, sporadically form temporary active links on similar timescales to the viral spreading. This allows material that might have died out, to evolve and even mutate. This has enabled niche networks that were already successfully spreading hate and anti-vaccination material, to rapidly become global super-spreaders of narratives featuring fake COVID-19 treatments, anti-Asian sentiment and conspiracy theories. We derive new tools that incorporate these coupled social-viral dynamics, including an online $R$, to help prevent infodemic spreading at all scales: from spreading across platforms (e.g. Facebook, 4Chan) to spreading within a given subpopulation, or community, or topic. By accounting for similar social and viral timescales, the same mathematical theory also offers a quantitative description of other unconventional infection profiles such as rumors spreading in financial markets and colds spreading in schools."],
['Background Since the beginning of the coronavirus disease 2019 (COVID-19) epidemic, misinformation has been spreading\xa0uninhibited\xa0over traditional and social media at a rapid pace. We sought to analyze the magnitude of misinformation that is being spread on Twitter\xa0(Twitter, Inc., San Francisco, CA) regarding the coronavirus epidemic.\xa0 Materials and methods We conducted a search on Twitter using 14 different trending hashtags and keywords related to the COVID-19 epidemic. We then summarized and assessed individual tweets for misinformation in comparison to verified and peer-reviewed resources. Descriptive statistics were used to compare\xa0terms and hashtags, and to identify individual tweets and account characteristics. Results The study included 673 tweets. Most tweets were posted by informal individuals/groups (66%), and 129 (19.2%) belonged to verified Twitter accounts. The majority of included tweets contained serious content (91.2%); 548 tweets (81.4%) included genuine information pertaining to the COVID-19 epidemic. Around 70% of the tweets tackled medical/public health information, while the others were pertaining to sociopolitical and financial factors. In total, 153 tweets (24.8%) included misinformation, and 107 (17.4%) included unverifiable information regarding the COVID-19 epidemic. The rate of misinformation was higher among informal individual/group accounts (33.8%, p: <0.001). Tweets from unverified Twitter accounts contained more misinformation (31.0% vs 12.6% for verified accounts, p: <0.001). Tweets from healthcare/public health accounts had the lowest rate of unverifiable information (12.3%, p: 0.04). The number of likes and retweets per tweet was not associated with a difference in either false or unverifiable content. The keyword "COVID-19" had the lowest rate of misinformation and unverifiable information, while the keywords "#2019_ncov" and "Corona" were associated with the highest amount of misinformation and unverifiable content respectively. Conclusions Medical misinformation and unverifiable content pertaining to the global COVID-19 epidemic are being propagated at an alarming rate on social media. We provide an early quantification of the magnitude of misinformation spread and highlight the importance of early interventions in order to curb this phenomenon that endangers public safety at a time when awareness and appropriate preventive actions are paramount.'],
['Abstract The proliferation of misinformation on social media platforms is faster than the spread of Corona Virus Diseases (COVID-19) and it can generate hefty deleterious consequences on health amid a disaster like COVID-19. Drawing upon research on the stimulus-response theory (hypodermic needle theory) and the resilience theory, this study tested a conceptual framework considering general misinformation belief, conspiracy belief, and religious misinformation belief as the stimulus; and credibility evaluations as resilience strategy; and their effects on COVID-19 individual responses. Using a self-administered online survey during the COVID-19 pandemic, the study obtained 483 useable responses and after test, finds that all-inclusive, the propagation of misinformation on social media undermines the COVID-19 individual responses. Particularly, credibility evaluation of misinformation strongly predicts the COVID-19 individual responses with positive influences and religious misinformation beliefs as well as conspiracy beliefs and general misinformation beliefs come next and influence negatively. The findings and general recommendations will help the public, in general, to be cautious about misinformation, and the respective authority of a country, in particular, for initiating proper safety measures about disastrous misinformation to protect the public health from being exploited.'],
['During the Covid-19 pandemics, we also experience another dangerous pandemics based on misinformation. Narratives disconnected from fact-checking on the origin and cure of the disease intertwined with pre-existing political fights. We collect a database on Twitter posts and analyse the topology of the networks of retweeters (users broadcasting again the same elementary piece of information, or tweet) and validate its structure with methods of statistical physics of networks. Furthermore, by using commonly available fact checking software, we assess the reputation of the pieces of news exchanged. By using a combination of theoretical and practical weapons, we are able to track down the flow of misinformation in a snapshot of the Twitter ecosystem. Thanks to the presence of verified users, we can also assign a polarization to the network nodes (users) and see the impact of low-quality information producers and spreaders in the Twitter ecosystem.'],
['A huge amount of potentially dangerous COVID-19 misinformation is appearing online. Here we use machine learning to quantify COVID-19 content among online opponents of establishment health guidance, in particular vaccinations (“anti-vax”). We find that the anti-vax community is developing a less focused debate around COVID-19 than its counterpart, the pro-vaccination (“pro-vax”) community. However, the anti-vax community exhibits a broader range of “flavors” of COVID-19 topics, and hence can appeal to a broader cross-section of individuals seeking COVID-19 guidance online, e.g. individuals wary of a mandatory fast-tracked COVID-19 vaccine or those seeking alternative remedies. Hence the anti-vax community looks better positioned to attract fresh support going forward than the pro-vax community. This is concerning since a widespread lack of adoption of a COVID-19 vaccine will mean the world falls short of providing herd immunity, leaving countries open to future COVID-19 resurgences. We provide a mechanistic model that interprets these results and could help in assessing the likely efficacy of intervention strategies. Our approach is scalable and hence tackles the urgent problem facing social media platforms of having to analyze huge volumes of online health misinformation and disinformation.'],
['BACKGROUND: Although "infodemiological" methods have been used in research on coronavirus disease (COVID-19), an examination of the extent of infodemic moniker (misinformation) use on the internet remains limited. OBJECTIVE: The aim of this paper is to investigate internet search behaviors related to COVID-19 and examine the circulation of infodemic monikers through two platforms-Google and Instagram-during the current global pandemic. METHODS: We have defined infodemic moniker as a term, query, hashtag, or phrase that generates or feeds fake news, misinterpretations, or discriminatory phenomena. Using Google Trends and Instagram hashtags, we explored internet search activities and behaviors related to the COVID-19 pandemic from February 20, 2020, to May 6, 2020. We investigated the names used to identify the virus, health and risk perception, life during the lockdown, and information related to the adoption of COVID-19 infodemic monikers. We computed the average peak volume with a 95% CI for the monikers. RESULTS: The top six COVID-19-related terms searched in Google were "coronavirus," "corona," "COVID," "virus," "corona virus," and "COVID-19." Countries with a higher number of COVID-19 cases had a higher number of COVID-19 queries on Google. The monikers "coronavirus ozone," "coronavirus laboratory," "coronavirus 5G," "coronavirus conspiracy," and "coronavirus bill gates" were widely circulated on the internet. Searches on "tips and cures" for COVID-19 spiked in relation to the US president speculating about a "miracle cure" and suggesting an injection of disinfectant to treat the virus. Around two thirds (n=48,700,000, 66.1%) of Instagram users used the hashtags "COVID-19" and "coronavirus" to disperse virus-related information. CONCLUSIONS: Globally, there is a growing interest in COVID-19, and numerous infodemic monikers continue to circulate on the internet. Based on our findings, we hope to encourage mass media regulators and health organizers to be vigilant and diminish the use and circulation of these infodemic monikers to decrease the spread of misinformation.'],
['Infodemics, often including rumors, stigma, and conspiracy theories, have been common during the COVID-19 pandemic. Monitoring social media data has been identified as the best method for tracking rumors in real time and as a possible way to dispel misinformation and reduce stigma. However, the detection, assessment, and response to rumors, stigma, and conspiracy theories in real time are a challenge. Therefore, we followed and examined COVID-19-related rumors, stigma, and conspiracy theories circulating on online platforms, including fact-checking agency websites, Facebook, Twitter, and online newspapers, and their impacts on public health. Information was extracted between December 31, 2019 and April 5, 2020, and descriptively analyzed. We performed a content analysis of the news articles to compare and contrast data collected from other sources. We identified 2,311 reports of rumors, stigma, and conspiracy theories in 25 languages from 87 countries. Claims were related to illness, transmission and mortality (24%), control measures (21%), treatment and cure (19%), cause of disease including the origin (15%), violence (1%), and miscellaneous (20%). Of the 2,276 reports for which text ratings were available, 1,856 claims were false (82%). Misinformation fueled by rumors, stigma, and conspiracy theories can have potentially serious implications on the individual and community if prioritized over evidence-based guidelines. Health agencies must track misinformation associated with the COVID-19 in real time, and engage local communities and government stakeholders to debunk misinformation.']]

Putting all together¶

So far, we've built a Faiss index using the misinformation abstract vectors we encoded with a sentence-DistilBERT model. That's helpful but in a real case scenario, we would have to work with unseen data. To query the index with an unseen query and retrieve its most relevant documents, we would have to do the following:

Encode the query with the same sentence-DistilBERT model we used for the rest of the abstract vectors.
Change its data type to float32.
Search the index with the encoded query.

Here, we will use the introduction of an article published on HKS Misinformation Review.

In [ ]:

user_query = """
WhatsApp was alleged to have been widely used to spread misinformation and propaganda 
during the 2018 elections in Brazil and the 2019 elections in India. Due to the 
private encrypted nature of the messages on WhatsApp, it is hard to track the dissemination 
of misinformation at scale. In this work, using public WhatsApp data from Brazil and India, we 
observe that misinformation has been largely shared on WhatsApp public groups even after they 
were already fact-checked by popular fact-checking agencies. This represents a significant portion 
of misinformation spread in both Brazil and India in the groups analyzed. We posit that such 
misinformation content could be prevented if WhatsApp had a means to flag already fact-checked 
content. To this end, we propose an architecture that could be implemented by WhatsApp to counter 
such misinformation. Our proposal respects the current end-to-end encryption architecture on WhatsApp, 
thus protecting users’ privacy while providing an approach to detect the misinformation that benefits 
from fact-checking efforts.
"""

In [ ]:

# For convenience, I've wrapped all steps in the vector_search function.
# It takes four arguments: 
# A query, the sentence-level transformer, the Faiss index and the number of requested results
D, I = vector_search([user_query], model, index, num_results=10)
print(f'L2 distance: {D.flatten().tolist()}\n\nMAG paper IDs: {I.flatten().tolist()}')

L2 distance: [7.384466171264648, 57.3224983215332, 57.3224983215332, 71.48453521728516, 72.06803131103516, 79.13472747802734, 86.0128173828125, 89.91024780273438, 90.76014709472656, 90.76422119140625]

MAG paper IDs: [3047438096, 3021927925, 3037966274, 2889959140, 2791045616, 2943077655, 3014380170, 2967434249, 3028584171, 2990343632]

In [ ]:

# Fetching the paper titles based on their index
id2details(df, I, 'original_title')

Out[ ]:

[['Can WhatsApp Benefit from Debunked Fact-Checked Stories to Reduce Misinformation?'],
 ['A Dataset of Fact-Checked Images Shared on WhatsApp During the Brazilian and Indian Elections'],
 ['A Dataset of Fact-Checked Images Shared on WhatsApp During the Brazilian and Indian Elections'],
 ['A System for Monitoring Public Political Groups in WhatsApp'],
 ['Politics of Fake News: How WhatsApp Became a Potent Propaganda Tool in India'],
 ['Characterizing Attention Cascades in WhatsApp Groups'],
 ['OS IMPACTOS JURÍDICOS E SOCIAIS DAS FAKE NEWS EM TERRITÓRIO BRASILEIRO'],
 ['Fake News and Social Media: Indian Perspective'],
 ['Images and Misinformation in Political Groups: Evidence from WhatsApp in India'],
 ['Can WhatsApp Counter Misinformation by Limiting Message Forwarding']]

In [ ]:

# Define project base directory
# Change the index from 1 to 0 if you run this on Google Colab
project_dir = Path('notebooks').resolve().parents[1]
print(project_dir)

# Serialise index and store it as a pickle
with open(f"{project_dir}/models/faiss_index.pickle", "wb") as h:
    pickle.dump(faiss.serialize_index(index), h)

/content/vector_engine