In this notebook, we will be plotting a chloropleth map showing which regions in the world have the most bias towards posting tweets that have a negative sentiment.
The dataset used for this project is publicly available on Kaggle. Here is the link: https://www.kaggle.com/datasets/vivekchary/sentiment-with-16-million-tweets-with-locations
If you are running this on a Jupyter Notebook, make sure you download the CSV file and have it in your environment before you begin running the snippets.
Import the necessary Python libraries
import pandas as pd
import matplotlib.pyplot as plt
import folium
Load dataset into a pandas dataframe
df = pd.read_csv('sentiment140_with_location.csv', header=None, names=['Sentiment Target', 'Tweet ID', 'Date', 'Query Flag', 'User', 'Text', 'Location'], encoding='latin1')
Filter the dataframe to only include tweets with negative sentiment
negative_tweets = df[df['Sentiment Target'] == 0]
Group the negative tweets by location to get a count of negative tweets for each location
negative_tweets_count = negative_tweets.groupby(['Location']).size().reset_index(name='count')
Next, Let's create a map for visualisation of the data we are trying to represent
chloro_map = folium.Map()
We need GeoJSON data of the countries in the world.
GeoJSON data is a data format representing geographical features, such as country borders.
For our case here, Folium already has this GeoJSON data that we could use.
#Setting up the world countries data URL
url = 'https://raw.githubusercontent.com/python-visualization/folium/master/examples/data'
country_shapes = f'{url}/world-countries.json'
Here, we populate our map with the necessary data
#Adding the Choropleth layer onto our base map
folium.Choropleth(
#The GeoJSON data to represent the world country
geo_data=country_shapes,
name='Negative Tweet Bias Chloropleth',
data= negative_tweets_count,
#The column aceppting list with 2 value; The country name and the numerical value
columns=['Location', 'count'],
key_on='feature.properties.name',
fill_color='PuRd',
nan_fill_color='white'
).add_to(chloro_map)
<folium.features.Choropleth at 0x7f6a3e5c5960>
Display the map
chloro_map
The areas in the map with a darkest shade of pink are shown to be the ones that produce the highest numbers of tweets with a percieved negative sentiment.
Since this is a relativley small dataset compared to the number of tweets available, the data may be a little bit skewed.
Looking at the absolute number of negative tweets can be misleading, since most tweets come from a handful of countries.
A better way to determine countries with a higher bias towards posting negative tweets is to plot the fraction of negative tweets by country.
To do this we are going to need to prepare our data a little bit differently:
First we are going to aggregate the tweets by country
country_counts = df.groupby('Location').size().reset_index(name='Total Count')
Next, we aggregate the negative sentiment tweets by country
neg_country_counts = negative_tweets.groupby('Location').size().reset_index(name='Negative Count')
We can then merge the 2 dataframes by the respective country
country_counts = country_counts.merge(neg_country_counts, on='Location', how='left')
We can proceed to calculate the fraction of negative tweets by country and store it in a new column in our new dataframe
country_counts['Negative Fraction'] = country_counts['Negative Count'] / country_counts['Total Count']
# display our columns in our new dataframe
country_counts.columns
Index(['Location', 'Total Count', 'Negative Count', 'Negative Fraction'], dtype='object')
Next, we can add a chloropleth layer to our base map to distinguish areas with a higher density of negatively-biased content
chloro_map_fractions = folium.Map()
#Adding the Choropleth layer onto our base map
folium.Choropleth(
#The GeoJSON data to represent the world country
geo_data=country_shapes,
name='Negative Tweet Bias Chloropleth',
data= country_counts,
#The column aceppting list with 2 value; The country name and the numerical value
columns=['Location', 'Negative Fraction'],
key_on='feature.properties.name',
fill_color='PuRd',
nan_fill_color='white'
).add_to(chloro_map_fractions)
<folium.features.Choropleth at 0x7f6a3d161270>
Display the map
# display the chloropleth map
chloro_map_fractions
The data here is not perfect and only comes from a relatively small dataset. This could explain why some countries you'd expect to show higher bias appear to have a lower proportion.
Another issue with fractions is that the denominator should be reasonably big for the fraction to make sense. For example: if a country has only 5 tweets in the dataset and one of them is marked as negative, it's fraction for bias comes to 0.2 which when plotted as shown above, would present the same result as that of a country that has 1,000,000 tweets and 200,000 of them are marked negative.
A proposed solution for this is to skip a country if the total number of tweets from the country is below some arbitrary threshold.
It is up to you to determine this threshold depending on your goals with your specific project.
The chloropleth map above is generated using already existing data. The sentiment of each tweet in the original dataset, Sentiment 140, was generated by running the dataset through a pretrained model that could detect if a tweet was positive, neutral or negative
The next steps for this experiment would involve:
Dr. Kishore Papinei: for the recommendations that led to the revised section; displaying the proportion of negative tweets by country