Wrangling and Analyzing WeRateDogs Twitter Data¶

Data rarely comes in its most usable form. For this reason, data wrangling and exploratory data analysis are the difference between non-misleading data analysis and a simple garbage in, garbage out.

Table of Contents¶

Introduction
Importing Libraries
Gathering Data
Assessing Data
Assessment Summary
Cleaning Data
Storing Data
Analysis and Visualization
Conclusion
Limitations
References

Introduction¶

In this project, we will be wrangling, analyzing and visualizing the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. We Rate Dogs is a Twitter account that does exactly what it says. It rates dogs. And their funny captions are just over the top. All you have to do is send them a serious or funny dog picture via direct message (or any dog come to think of it), and they'll rate it on a scale of 10. The funny thing, though? The ratings are almost always greater than 10. Why? Because "they're good dogs Brent". WeRateDogs has over 4 million followers and has received international media coverage.

Specifically, we intend to:

Gather relevant data on tweeted dogs, their breed, and user interactions in form of retweets and likes.
Thoroughly clean the data, to prepare it for analysis.
Conduct analysis on the cleaned data and build visualizations.

Importing Libraries¶

The following libraries will be useful during wrangling, analysis and visualization:

In [1]:

# Import useful libraries
import math
import time
import config
import numpy as np
import pandas as pd
import os
import requests
import tweepy
import json
from PIL import Image
from io import BytesIO
from IPython.display import display

# Visualization libraries
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

We will also create a color class to help us pretty-print formatted outputs:

In [2]:

class Color:
    blue = '\033[94m'
    green = '\033[92m'
    red = '\033[91m'
    bold = '\033[1m'
    underline = '\033[4m'
    end = '\033[0m'

Gathering Data¶

We will be gathering data from three different sources:

Enhanced Twitter archive data, compiled by @dogrates and shared with Udacity. This archive data contains basic tweet data for all 5000+ of their tweets, as at August 2017. Udacity provided us with this file, so we will treat it as a file on hand.

An Image predictions tsv file, compiled by running every image in the WeRateDogs Twitter archive through a neural network that can classify breeds of dogs. We will download this file programmatically from Udacity servers, using the requests library.

Additional data from the Twitter API: We will gather each tweet's retweet count, favorite ("like") count and hashtags used from the twitter API using the Tweepy library.

A. WeRateDogs Twitter Archive¶

In [3]:

# Read the twitter archive data provided
wrd_archive = pd.read_csv('./twitter-archive-enhanced.csv')
wrd_archive.head(2)

Out[3]:

	tweet_id	in_reply_to_status_id	in_reply_to_user_id	timestamp	source	text	retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp	expanded_urls	rating_numerator	rating_denominator	name	doggo	floofer	pupper	puppo
0	892420643555336193	NaN	NaN	2017-08-01 16:23:56 +0000	<a href="http://twitter.com/download/iphone" r...	This is Phineas. He's a mystical boy. Only eve...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/892420643...	13	10	Phineas	None	None	None	None
1	892177421306343426	NaN	NaN	2017-08-01 00:17:27 +0000	<a href="http://twitter.com/download/iphone" r...	This is Tilly. She's just checking pup on you....	NaN	NaN	NaN	https://twitter.com/dog_rates/status/892177421...	13	10	Tilly	None	None	None	None

B. Tweet Image Predictions¶

In [4]:

# Programmatically download the image predictions
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
file_name = url.split('/')[-1]
response = requests.get(url)

# Write the url response to a file locally
start = time.time()
with open(file_name, 'wb') as f:
    f.write(response.content)
    
print(Color.green+'Process completed in {} seconds'
      .format(time.time()-start) + Color.end
     )

Process completed in 0.0026407241821289062 seconds

In [5]:

# Read the image predictions into a dataframe
img_predictions = pd.read_csv('./image-predictions.tsv', sep='\t')
img_predictions.head(2)

Out[5]:

	tweet_id	jpg_url	img_num	p1	p1_conf	p1_dog	p2	p2_conf	p2_dog	p3	p3_conf	p3_dog
0	666020888022790149	https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg	1	Welsh_springer_spaniel	0.465074	True	collie	0.156665	True	Shetland_sheepdog	0.061428	True
1	666029285002620928	https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg	1	redbone	0.506826	True	miniature_pinscher	0.074192	True	Rhodesian_ridgeback	0.072010	True

C. Additional Data from Twitter API¶

In [6]:

# Configure and create an API object to gather twitter data
consumer_key = config.API_KEY
consumer_secret = config.API_KEY_SECRET
access_token = config.ACCESS_TOKEN
access_secret = config.ACCESS_TOKEN_SECRET

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit =True,
                wait_on_rate_limit_notify=True)

In [7]:

# --- Pull tweet information using the ids in wrd_archive
# Extract the tweet ids from the wrd dataframe
tweet_ids = wrd_archive['tweet_id']

# Initialize variables to monitor runtime activity
success, failure, counter = (0, 0, 0)
failed_attempts = {}

# Loop over each tweet id and collect the information
print(Color.bold+'COMMENCING JSON EXTRACTION TASK'+Color.end+'\n'+'-'*70)
start_time = time.time()
with open('tweet_json.txt', 'w') as file:
    print('Pulling json data for the first 200 tweets...')
    for tweet_id in tweet_ids:
        # After every 200 tweets, print a summary to the user
        if (success % 200 == 0) and (counter > 0):
            print(Color.bold + Color.green + 'Sub-task Complete!'+ Color.end)
            print('Successful pulls: {} || failed pulls: {} || Pulls pending: {}'
                  .format(success, failure, tweet_ids.size - counter)
                 )
            print('\nPulling json data for the next 200 tweets...')
        try:
            tweet_info = api.get_status(tweet_id, tweet_mode='extended')
            json.dump(tweet_info._json, file)
            file.write('\n')
            success+=1
        except Exception as e:
            failed_attempts[tweet_id] = e
            failure+=1
            pass
        finally:
            counter+=1

# Print feedback on entire execution process  
duration = (time.time() - start_time)/60
failed = len(failed_attempts.keys())
print(Color.bold + Color.green +'Task Completed!\n'+ Color.end + '-'*70)
print(Color.bold +'DISPLAYING RUNTIME SUMMARY'+ Color.end)
print('The entire process took: {} minutes'.format(round(duration, 2)))

if (failed > 0):
    print(Color.bold + Color.red +
          'Could not pull information for '+ str(failed) + ' tweet ids:'+
          Color.end)
    print(pd.Series(failed_attempts))
else:
    print(Color.bold + Color.green +'No failed attempts'+ Color.end)

COMMENCING JSON EXTRACTION TASK
----------------------------------------------------------------------
Pulling json data for the first 200 tweets...
Sub-task Complete!
Successful pulls: 200 || failed pulls: 9 || Pulls pending: 2147

Pulling json data for the next 200 tweets...
Sub-task Complete!
Successful pulls: 400 || failed pulls: 18 || Pulls pending: 1938

Pulling json data for the next 200 tweets...
Sub-task Complete!
Successful pulls: 600 || failed pulls: 20 || Pulls pending: 1736

Pulling json data for the next 200 tweets...
Sub-task Complete!
Successful pulls: 800 || failed pulls: 24 || Pulls pending: 1532

Pulling json data for the next 200 tweets...
Sub-task Complete!
Successful pulls: 1000 || failed pulls: 28 || Pulls pending: 1328

Pulling json data for the next 200 tweets...
Sub-task Complete!
Successful pulls: 1200 || failed pulls: 28 || Pulls pending: 1128

Pulling json data for the next 200 tweets...

Rate limit reached. Sleeping for: 291

Sub-task Complete!
Successful pulls: 1400 || failed pulls: 28 || Pulls pending: 928

Pulling json data for the next 200 tweets...
Sub-task Complete!
Successful pulls: 1600 || failed pulls: 28 || Pulls pending: 728

Pulling json data for the next 200 tweets...
Sub-task Complete!
Successful pulls: 1800 || failed pulls: 29 || Pulls pending: 527

Pulling json data for the next 200 tweets...
Sub-task Complete!
Successful pulls: 2000 || failed pulls: 29 || Pulls pending: 327

Pulling json data for the next 200 tweets...

Rate limit reached. Sleeping for: 296

Sub-task Complete!
Successful pulls: 2200 || failed pulls: 29 || Pulls pending: 127

Pulling json data for the next 200 tweets...
Task Completed!
----------------------------------------------------------------------
DISPLAYING RUNTIME SUMMARY
The entire process took: 37.3 minutes
Could not pull information for 29 tweet ids:
888202515573088257    [{'code': 144, 'message': 'No status found wit...
873697596434513921    [{'code': 144, 'message': 'No status found wit...
872668790621863937    [{'code': 144, 'message': 'No status found wit...
872261713294495745    [{'code': 144, 'message': 'No status found wit...
869988702071779329    [{'code': 144, 'message': 'No status found wit...
866816280283807744    [{'code': 144, 'message': 'No status found wit...
861769973181624320    [{'code': 144, 'message': 'No status found wit...
856602993587888130    [{'code': 144, 'message': 'No status found wit...
856330835276025856    [{'code': 34, 'message': 'Sorry, that page doe...
851953902622658560    [{'code': 144, 'message': 'No status found wit...
851861385021730816    [{'code': 144, 'message': 'No status found wit...
845459076796616705    [{'code': 144, 'message': 'No status found wit...
844704788403113984    [{'code': 144, 'message': 'No status found wit...
842892208864923648    [{'code': 144, 'message': 'No status found wit...
837366284874571778    [{'code': 144, 'message': 'No status found wit...
837012587749474308    [{'code': 144, 'message': 'No status found wit...
829374341691346946    [{'code': 144, 'message': 'No status found wit...
827228250799742977    [{'code': 144, 'message': 'No status found wit...
812747805718642688    [{'code': 144, 'message': 'No status found wit...
802247111496568832    [{'code': 144, 'message': 'No status found wit...
779123168116150273    [{'code': 144, 'message': 'No status found wit...
775096608509886464    [{'code': 144, 'message': 'No status found wit...
771004394259247104    [{'code': 179, 'message': 'Sorry, you are not ...
770743923962707968    [{'code': 144, 'message': 'No status found wit...
766864461642756096    [{'code': 144, 'message': 'No status found wit...
759923798737051648    [{'code': 144, 'message': 'No status found wit...
759566828574212096    [{'code': 144, 'message': 'No status found wit...
754011816964026368    [{'code': 144, 'message': 'No status found wit...
680055455951884288    [{'code': 144, 'message': 'No status found wit...
dtype: object

In [8]:

# Extract the information we want from the json file
json_tweet_details = []

with open('tweet_json.txt', 'r', encoding='UTF-8') as file:
    for line in file:
        json_text = json.loads(line)
        # Extract the tweet_id, likes and retweet count
        tweet_id = json_text['id_str']
        retweets = json_text['retweet_count']
        likes = json_text['favorite_count']
        # Extract the hashtag from the json file
        hashtags_info = json_text['entities']['hashtags']
        if len(hashtags_info) !=0:
            hashtags = ['#'+item['text'] for item in hashtags_info]
        else:
            hashtags = 'None'
        # Assign these values into our list
        json_tweet_details.append({
            'tweet_id': tweet_id,
            'hashtag': hashtags,
            'retweets': retweets,
            'likes': likes}
        )

# Read all extracted data into a Pandas dataframe
json_tweet_info = pd.DataFrame(json_tweet_details)
json_tweet_info.head(2)

Out[8]:

	tweet_id	hashtag	retweets	likes
0	892420643555336193	None	7018	33839
1	892177421306343426	None	5303	29355

Assessing Data¶

Visual Assessment¶

A. WeRateDogs Archive¶

Examining a sample of 20 records from the wrd_archive dataframe in Jupyter notebook, including additional visual assessments in google sheets:

In [9]:

wrd_archive.sample(20)

Out[9]:

	tweet_id	in_reply_to_status_id	in_reply_to_user_id	timestamp	source	text	retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp	expanded_urls	rating_numerator	rating_denominator	name	doggo	floofer	pupper	puppo
957	751538714308972544	NaN	NaN	2016-07-08 22:09:27 +0000	<a href="http://twitter.com/download/iphone" r...	This is Max. She has one ear that's always sli...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/751538714...	10	10	Max	None	None	None	None
563	802572683846291456	NaN	NaN	2016-11-26 18:00:13 +0000	<a href="http://twitter.com/download/iphone" r...	This is Winnie. She's h*ckin ferocious. Dandel...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/802572683...	12	10	Winnie	None	None	None	None
42	884247878851493888	NaN	NaN	2017-07-10 03:08:17 +0000	<a href="http://twitter.com/download/iphone" r...	OMG HE DIDN'T MEAN TO HE WAS JUST TRYING A LIT...	NaN	NaN	NaN	https://twitter.com/kaijohnson_19/status/88396...	13	10	None	None	None	None	None
18	888554962724278272	NaN	NaN	2017-07-22 00:23:06 +0000	<a href="http://twitter.com/download/iphone" r...	This is Ralphus. He's powering up. Attempting ...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/888554962...	13	10	Ralphus	None	None	None	None
1556	688828561667567616	NaN	NaN	2016-01-17 21:01:41 +0000	<a href="http://twitter.com/download/iphone" r...	Say hello to Brad. His car probably has a spoi...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/688828561...	9	10	Brad	None	None	None	None
1550	689154315265683456	NaN	NaN	2016-01-18 18:36:07 +0000	<a href="http://twitter.com/download/iphone" r...	We normally don't rate birds but I feel bad co...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/689154315...	9	10	None	None	None	None	None
1950	673688752737402881	NaN	NaN	2015-12-07 02:21:29 +0000	<a href="http://twitter.com/download/iphone" r...	Meet Larry. He doesn't know how to shoe. 9/10 ...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/673688752...	9	10	Larry	None	None	None	None
1194	717428917016076293	NaN	NaN	2016-04-05 19:09:17 +0000	<a href="http://vine.co" rel="nofollow">Vine -...	This is Skittle. He's trying to communicate. 1...	NaN	NaN	NaN	https://vine.co/v/iIhEU2lVqxz	11	10	Skittle	None	None	None	None
1271	709409458133323776	NaN	NaN	2016-03-14 16:02:49 +0000	<a href="http://twitter.com/download/iphone" r...	This is Billy. He sensed a squirrel. 8/10 damn...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/709409458...	8	10	Billy	None	None	None	None
257	843856843873095681	NaN	NaN	2017-03-20 16:08:44 +0000	<a href="http://twitter.com/download/iphone" r...	Say hello to Sadie and Daisy. They do all thei...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/843856843...	12	10	Sadie	None	None	None	None
1934	674014384960745472	NaN	NaN	2015-12-07 23:55:26 +0000	<a href="http://twitter.com/download/iphone" r...	Say hello to Aiden. His eyes are magical. Love...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/674014384...	11	10	Aiden	None	None	None	None
2009	672254177670729728	NaN	NaN	2015-12-03 03:21:00 +0000	<a href="http://twitter.com/download/iphone" r...	This is Rolf. He's having the time of his life...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/672254177...	11	10	Rolf	None	None	pupper	None
1422	698178924120031232	NaN	NaN	2016-02-12 16:16:41 +0000	<a href="http://twitter.com/download/iphone" r...	This is Lily. She accidentally dropped all her...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/698178924...	10	10	Lily	None	None	None	None
1486	693109034023534592	NaN	NaN	2016-01-29 16:30:45 +0000	<a href="http://twitter.com/download/iphone" r...	"Thank you friend that was a swell petting" 11...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/693109034...	11	10	None	None	None	None	None
615	796563435802726400	NaN	NaN	2016-11-10 04:01:37 +0000	<a href="http://twitter.com/download/iphone" r...	RT @dog_rates: I want to finally rate this ico...	7.809316e+17	4.196984e+09	2016-09-28 00:46:20 +0000	https://twitter.com/dog_rates/status/780931614...	13	10	None	None	None	None	puppo
674	789599242079838210	NaN	NaN	2016-10-21 22:48:24 +0000	<a href="http://twitter.com/download/iphone" r...	This is Brownie. She's wearing a Halloween the...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/789599242...	12	10	Brownie	None	None	None	None
232	847962785489326080	NaN	NaN	2017-04-01 00:04:17 +0000	<a href="http://twitter.com/download/iphone" r...	This is Georgie. He's very shy. Only puppears ...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/847962785...	10	10	Georgie	None	None	None	None
2206	668631377374486528	NaN	NaN	2015-11-23 03:25:17 +0000	<a href="http://twitter.com/download/iphone" r...	Meet Zeek. He is a grey Cumulonimbus. Zeek is ...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/668631377...	5	10	Zeek	None	None	None	None
1798	677228873407442944	NaN	NaN	2015-12-16 20:48:40 +0000	<a href="http://twitter.com/download/iphone" r...	Say hello to Chuq. He just wants to fit in. 11...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/677228873...	11	10	Chuq	None	None	None	None
441	819711362133872643	NaN	NaN	2017-01-13 01:03:12 +0000	<a href="http://twitter.com/download/iphone" r...	This is Howie. He just bloomed. 11/10 revoluti...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/819711362...	11	10	Howie	None	None	None	None

Notes:¶

Quality Issues

Some records are retweets or replies. Some may contain ratings, but they are not the original tweets. The information to identify them can be found in the collowing columns: in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id and retweeted_status_timestamp.

Unexpected ratings in the rating_numerator and rating_denominator columns. Examples are rating numerators as high as 666 and denominators as low as 0.

Unusual dog names such as a, an and not in the name column.

Tidiness Issues

The various stages of dog life: doggo, pupper, puppo, and floofer should be contained in one column.

Long and unneccessary links in the source column (text is embedded in HTML tags). All we need is the actual text.

Unwanted columns present: in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id and retweeted_status_timestamp.

rating_numerator and rating_denominator can be reduced into one column.

B. Image Predictions¶

Examining a sample of 20 records from the img_predictions dataframe in Jupyter notebook, including additional visual assessments in google sheets:

In [10]:

img_predictions.sample(20)

Out[10]:

	tweet_id	jpg_url	img_num	p1	p1_conf	p1_dog	p2	p2_conf	p2_dog	p3	p3_conf	p3_dog
1946	862457590147678208	https://pbs.twimg.com/media/C_gQmaTUMAAPYSS.jpg	1	home_theater	0.496348	False	studio_couch	0.167256	False	barber_chair	0.052625	False
901	700002074055016451	https://pbs.twimg.com/media/CbboKP4WIAAw8xq.jpg	1	Chihuahua	0.369488	True	schipperke	0.243367	True	pug	0.161614	True
114	667924896115245057	https://pbs.twimg.com/media/CUTyJpHWcAATl0O.jpg	1	Labrador_retriever	0.209051	True	hog	0.203980	False	Newfoundland	0.165914	True
133	668480044826800133	https://pbs.twimg.com/media/CUbrDWOWcAEyMdM.jpg	1	Arctic_fox	0.119243	False	Labrador_retriever	0.099965	True	pug	0.086717	True
108	667878741721415682	https://pbs.twimg.com/media/CUTILFiWcAE8Rle.jpg	1	seat_belt	0.200373	False	miniature_pinscher	0.106003	True	schipperke	0.104733	True
1593	798694562394996736	https://pbs.twimg.com/media/Cbs3DOAXIAAp3Bd.jpg	1	Chihuahua	0.615163	True	Pembroke	0.159509	True	basenji	0.084466	True
1809	832757312314028032	https://pbs.twimg.com/media/C46MWnFVYAUg1RK.jpg	2	Cardigan	0.160888	True	Staffordshire_bullterrier	0.159441	True	Boston_bull	0.154368	True
270	670822709593571328	https://pbs.twimg.com/media/CU89schWIAIHQmA.jpg	1	web_site	0.993887	False	Chihuahua	0.001252	True	menu	0.000599	False
2069	891087950875897856	https://pbs.twimg.com/media/DF3HwyEWsAABqE6.jpg	1	Chesapeake_Bay_retriever	0.425595	True	Irish_terrier	0.116317	True	Indian_elephant	0.076902	False
2058	888917238123831296	https://pbs.twimg.com/media/DFYRgsOUQAARGhO.jpg	1	golden_retriever	0.714719	True	Tibetan_mastiff	0.120184	True	Labrador_retriever	0.105506	True
1980	871032628920680449	https://pbs.twimg.com/media/DBaHi3YXgAE6knM.jpg	1	kelpie	0.398053	True	macaque	0.068955	False	dingo	0.050602	False
1636	806242860592926720	https://pbs.twimg.com/media/Ct72q9jWcAAhlnw.jpg	2	Cardigan	0.593858	True	Shetland_sheepdog	0.130611	True	Pembroke	0.100842	True
743	687476254459715584	https://pbs.twimg.com/media/CYpoAZTWEAA6vDs.jpg	1	wood_rabbit	0.702725	False	Angora	0.190659	False	hare	0.105072	False
1959	865718153858494464	https://pbs.twimg.com/media/DAOmEZiXYAAcv2S.jpg	1	golden_retriever	0.673664	True	kuvasz	0.157523	True	Labrador_retriever	0.126073	True
1259	748699167502000129	https://pbs.twimg.com/media/CmPp5pOXgAAD_SG.jpg	1	Pembroke	0.849029	True	Cardigan	0.083629	True	kelpie	0.024394	True
1641	807106840509214720	https://pbs.twimg.com/ext_tw_video_thumb/80710...	1	Chihuahua	0.505370	True	Pomeranian	0.120358	True	toy_terrier	0.077008	True
522	676582956622721024	https://pbs.twimg.com/media/CWO0m8tUwAAB901.jpg	1	seat_belt	0.790028	False	Boston_bull	0.196307	True	French_bulldog	0.012429	True
1433	773547596996571136	https://pbs.twimg.com/media/Crwxb5yWgAAX5P_.jpg	1	Norwegian_elkhound	0.372202	True	Chesapeake_Bay_retriever	0.137187	True	malamute	0.071436	True
1622	803380650405482500	https://pbs.twimg.com/media/CyYub2kWEAEYdaq.jpg	1	bookcase	0.890601	False	entertainment_center	0.019287	False	file	0.009490	False
172	669000397445533696	https://pbs.twimg.com/media/CUjETvDVAAI8LIy.jpg	1	Pembroke	0.822940	True	Cardigan	0.177035	True	basenji	0.000023	True

Notes:¶

Quality Issues

The prediction in columns p1, p2 and p3 are not uniformly formatted. Some names are lowercase, some are uppercase and some are titlecase.

The predictions above also have words seperated by underscores instead of spaces.

Tidiness Issues

From p1, p2 and p3, we only need the most confident prediction that corresponds to an actual dog breed.

C. Tweet JSON Data¶

Examining a sample of 20 records each from the json_tweet_info dataframe in Jupyter notebook

In [11]:

json_tweet_info.sample(20, random_state=4)

Out[11]:

	tweet_id	hashtag	retweets	likes
779	772581559778025472	None	1588	6120
767	773985732834758656	None	3590	10174
1163	717841801130979328	None	545	2280
1660	681523177663676416	None	5217	13203
261	840696689258311684	None	892	11510
1310	705066031337840642	None	556	2021
604	795464331001561088	None	22013	46862
1679	680801747103793152	None	743	2193
740	778286810187399168	None	3065	9751
821	766313316352462849	None	1737	6357
120	868622495443632128	None	4481	23664
297	835246439529840640	None	63	1992
245	843604394117681152	None	2486	15745
1889	674271431610523648	None	645	1397
772	773336787167145985	None	4681	0
853	760656994973933572	None	1753	6190
1780	676864501615042560	None	629	1903
1854	674805413498527744	None	319	766
169	857746408056729600	None	9366	30861
665	788150585577050112	None	1203	5807

Notes:¶

Everything looks fine for now.

Programmatic Assessment¶

A. WeRateDogs Archive¶

In [12]:

# Examine general dataframe information
wrd_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 non-null   object 
 14  floofer                     2356 non-null   object 
 15  pupper                      2356 non-null   object 
 16  puppo                       2356 non-null   object 
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB

Notes

tweet_id stored as int instead of string/object type.

181 records are retweets and 78 records are replies. We don't need these records in our analysis.

timestamp column is stored as string/object type rather than as a Pandas datetime object.

The expanded_urls column has some null records.

Lets zoom into these records where the expanded_url is null:

In [13]:

print(Color.blue+'Computing null entries for records with missing expanded urls..\n'
     + Color.end)
print(wrd_archive[wrd_archive['expanded_urls'].isnull()].isnull().sum())

Computing null entries for records with missing expanded urls..

tweet_id                       0
in_reply_to_status_id          4
in_reply_to_user_id            4
timestamp                      0
source                         0
text                           0
retweeted_status_id           58
retweeted_status_user_id      58
retweeted_status_timestamp    58
expanded_urls                 59
rating_numerator               0
rating_denominator             0
name                           0
doggo                          0
floofer                        0
pupper                         0
puppo                          0
dtype: int64

Tweets with missing expanded_urls are either retweeted posts or replies. We don't need these records in our analysis.

In [14]:

# Check the archive for duplicate records
duplicates = wrd_archive.duplicated().sum()
print(Color.green +
      'wrd_archive has {} number of duplicate records'.format(duplicates)+
      Color.end)

wrd_archive has 0 number of duplicate records

In [15]:

# Examine the unique values in the source column
print(Color.bold+ Color.blue+
      'Examining unique values in the source column\n' +
      Color.end)

for i, item in enumerate(wrd_archive['source'].unique()):
    print(i, ': ', item)

Examining unique values in the source column

0 :  <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
1 :  <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
2 :  <a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>
3 :  <a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>

We only want the information between the opening and closing anchor tags, signalling the tweet source.

Let's verify if the links in the text and expanded_url columns are different:

In [16]:

# Set Pandas column width to allow longer text displays
pd.set_option("display.max_colwidth",150)
# Examine the text column and expanded_url columns
wrd_archive[['text', 'expanded_urls']].sample(5)

Out[16]:

	text	expanded_urls
1414	This is Cuddles. He's not entirely sure how doors work. 10/10 I believe in you Cuddles https://t.co/rKjK88D05Z	https://twitter.com/dog_rates/status/698710712454139905/photo/1
2351	Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10 https://t.co/4B7cOc1EDq	https://twitter.com/dog_rates/status/666049248165822465/photo/1
275	I didn't even have to intervene. Took him 4 minutes to realize his error. 10/10 for Kevin https://t.co/2gclc1MNr7	https://twitter.com/dog_rates/status/840696689258311684/photo/1
2146	This is a spotted Lipitor Rumpelstiltskin named Alphred. He can't wait for the Turkey. 10/10 would pet really well https://t.co/6GUGO7azNX	https://twitter.com/dog_rates/status/669923323644657664/photo/1
766	"Yep... just as I suspected. You're not flossing." 12/10 and 11/10 for the pup not flossing https://t.co/SuXcI9B7pQ	https://twitter.com/dog_rates/status/777684233540206592/photo/1

After testing each link, one would discover that, in each record, both the text and expanded url links lead to the same tweet.
Some records also have multiple expanded urls seperated by commas; all leading to the same tweet. As a result, we can make the following notes:

The text column contains both the tweet text and tweet url.

The same tweet url is already present in the expanded_urls column

In [17]:

# Examine the distribution of ratings in the dataset
wrd_archive[['rating_numerator', 'rating_denominator']].describe()

Out[17]:

	rating_numerator	rating_denominator
count	2356.000000	2356.000000
mean	13.126486	10.455433
std	45.876648	6.745237
min	0.000000	0.000000
25%	10.000000	10.000000
50%	11.000000	10.000000
75%	12.000000	10.000000
max	1776.000000	170.000000

In [18]:

# Examine the unique values in rating numerator and denominator
print(Color.bold+Color.blue+'Unique rating numerators'+Color.end)
print(wrd_archive['rating_numerator'].unique())
print(Color.bold+Color.blue+'\nUnique rating denominators'+Color.end)
print(wrd_archive['rating_denominator'].unique())

Unique rating numerators
[  13   12   14    5   17   11   10  420  666    6   15  182  960    0
   75    7   84    9   24    8    1   27    3    4  165 1776  204   50
   99   80   45   60   44  143  121   20   26    2  144   88]

Unique rating denominators
[ 10   0  15  70   7  11 150 170  20  50  90  80  40 130 110  16 120   2]

Though WeRateDogs post can have numerators higher than 10, they almost always have denominators of 10. Having numerators as high as 1776 and denominators as low as 0 prompts us to inspect the dataframe further:

In [19]:

# Assess instances where rating numerators > 15 and denominators are !=10
rating_check_df = (wrd_archive[(wrd_archive['rating_numerator'] > 15) | (wrd_archive['rating_denominator']!=10)])
# filter out the retweets
rating_check_df = (rating_check_df[rating_check_df['retweeted_status_id'].isnull()])
# filter out the replies
rating_check_df = (rating_check_df[rating_check_df['in_reply_to_status_id'].isnull()])

# Finally examine the text and the ratings
print(Color.red + Color.bold+
      '{} records found!'.format(rating_check_df.shape[0])+
      Color.end)

rating_check_df[['text', 'rating_numerator', 'rating_denominator']]

22 records found!

Out[19]:

	text	rating_numerator	rating_denominator
433	The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd	84	70
516	Meet Sam. She smiles 24/7 & secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t ...	24	7
695	This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS	75	10
763	This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://...	27	10
902	Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE	165	150
979	This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh	1776	10
1068	After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ	9	11
1120	Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv	204	170
1165	Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a	4	20
1202	This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq	50	50
1228	Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1	99	90
1254	Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12	80	80
1274	From left to right:\nCletus, Jerome, Alejandro, Burp, & Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK	45	50
1351	Here is a whole flock of puppers. 60/50 I'll take the lot https://t.co/9dpcw6MdWa	60	50
1433	Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://t.co/HppvrYuamZ	44	40
1635	Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55	121	110
1662	This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5	7	11
1712	Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 https://t.co/eNm2S6p9BD	26	10
1779	IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq	144	120
1843	Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://t.co/y93p6FLvVw	88	80
2074	After so many requests... here you go.\n\nGood dogg. 420/10 https://t.co/yfAAo1gdeY	420	10
2335	This is an Albanian 3 1/2 legged Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv	1	2

Some ratings were erroneously pulled from the original tweet. Especially when dates (e.g 24/7 and 9/11) or decimal ratings (e.g 11.27/10 and 9.75/10) are included in the tweet text.

Some high ratings appear to be addressed to groups of dogs. For example: 165/150, 84/70, 88/80.

Extremely high ratings like 1776/10 and 420/10.

During visual assessment, we identified some unusual dog names like a and an. These names were less than four characters long. We will examine the entire name column for names with less than four characters. We may probably find a lot of invalid names in this group:

In [20]:

print(Color.bold + Color.blue +
      'Specially examine names with four string characters or less..\n'
      + Color.end)
# Examine the name column further especially names with 4 characters or less
print(wrd_archive.name[wrd_archive.name.apply(lambda x: len(x)<=4)].unique())

Specially examine names with four string characters or less..

['None' 'Jax' 'Zoey' 'Koda' 'Ted' 'Jim' 'Zeke' 'such' 'Maya' 'Earl' 'Lola'
 'Yogi' 'Noah' 'Gus' 'Alfy' 'Koko' 'Rey' 'Gary' 'a' 'Jack' 'Emmy' 'Beau'
 'Aja' 'Cash' 'Coco' 'Jed' 'Kody' 'Dawn' 'Cody' 'Lili' 'Dave' 'Burt'
 'Carl' 'Thor' 'Luna' 'Arya' 'Iggy' 'Kyle' 'Leo' 'Odin' 'Tuck' 'Hank'
 'Ken' 'Max' 'Odie' 'Arlo' 'Lucy' 'Ava' 'Rory' 'Eli' 'Ash' 'Tobi' 'not'
 'Kuyu' 'Pete' 'Kyro' 'Loki' 'Mia' 'one' 'Mutt' 'Bear' 'Kona' 'Phil' 'Ike'
 'Mo' 'Toby' 'Nala' 'Gabe' 'Luca' 'Finn' 'Anna' 'Bo' 'Tom' 'Dido' 'Levi'
 'Alf' 'Sky' 'Tyr' 'Mary' 'Moe' 'Halo' 'Sam' 'Ito' 'Milo' 'Cali' 'Duke'
 'Chef' 'Doc' 'Sobe' 'Iroh' 'Ruby' 'Mack' 'Juno' 'Lily' 'Newt' 'Nida'
 'BeBe' 'mad' 'Dale' 'Hero' 'Godi' 'Dash' 'Bell' 'Jay' 'Mya' 'an' 'Huck'
 'very' 'O' 'Blue' 'Fizz' 'Chip' 'Grey' 'Al' 'just' 'Lou' 'Tito' 'Brat'
 'Tove' 'my' 'Kota' 'Eve' 'Rose' 'Theo' 'Fido' 'Emma' 'Gert' 'Dex' 'Ace'
 'Fred' 'Zoe' 'Blu' 'his' 'Cora' 'Abby' 'Geno' 'Beya' 'Kilo' 'Doug' 'Aqua'
 'Axel' 'Remy' 'this' 'Ziva' 'Puff' 'all' 'Ivar' 'Sid' 'Otis' 'Suki'
 'Ebby' 'Link' 'Ozzy' 'old' 'Zeus' 'Nico' 'Siba' 'Kanu' 'Opie' 'Kane'
 'Sora' 'Lacy' 'Olaf' 'Kara' 'Zara' 'Bode' 'Rudy' 'Fiji' 'Rilo' 'Yoda'
 'Chet' 'Kaia' 'Eazy' 'CeCe' 'Ole' 'Berb' 'Bob' 'Kobe' 'Lolo' 'Eriq' 'the'
 'Durg' 'Fynn' 'Ferg' 'Trip' 'Brad' 'Opal' 'Marq' 'Mona' 'Birf' 'Oreo'
 'Jeph' 'Obi' 'Tino' 'Lupe' 'Lulu' 'Taco' 'Joey' 'Kreg' 'Todo' 'Tess' 'by'
 'Mike' 'Evy' 'Tug' 'Izzy' 'Chuq' 'Karl' 'Herm' 'Bert' 'Zuzu' 'Jeb' 'life'
 'Acro' 'Obie' 'Dot' 'Mac' 'Ed' 'Taz' 'Jazz' 'Rolf' 'Cal' 'Tuco' 'Mojo'
 'Mark' 'JD' 'Pip' 'Jett' 'Amy' 'Sage' 'Andy' 'Creg' 'Gin' 'Bloo' 'Edd'
 'Herb' 'Liam' 'Ben' 'Skye' 'Dug' 'Kirk' 'Ralf' 'Chaz' 'Bobb' 'Hanz'
 'Zeek' 'Maks' 'Jo' 'DayZ' 'Ron' 'Erik' 'Stu' 'Kial' 'Dook' 'Hall' 'Fwed'
 'Keet']

Again we notice more unusual names like the, my, by, his, all, mad, life, very, old, this, just etc. All these unusual names are formatted in lower case, while the more viable names are properly capitalized.

We can use this criteria to query the entire name column. We will search for records with improper name capitalizations:

In [21]:

# Check the entire dataframe for improper capitalizations of dog names
mask = wrd_archive['name'].str.match(r"[A-Z].?")
invalid_names = wrd_archive[~mask]['name'].value_counts()

print(Color.red + Color.bold + 
      'There are {} records with invalid names\n'.format(invalid_names.sum())+
      Color.end)

print(invalid_names)

There are 109 records with invalid names

a               55
the              8
an               7
very             5
just             4
quite            4
one              4
getting          2
actually         2
mad              2
not              2
old              1
life             1
officially       1
light            1
by               1
infuriating      1
such             1
all              1
unacceptable     1
this             1
his              1
my               1
incredibly       1
space            1
Name: name, dtype: int64

None of the improperly capitalized entries in the name column are valid dog names.

These entries constitute 109 records in total.

In [22]:

#Examine the dog stage columns
for dog_stage in wrd_archive.columns[-4:]:
    
    print(Color.bold + Color.blue +
          '\nValue counts for {} column'.format(dog_stage) +
          Color.end)
    
    print(wrd_archive[dog_stage].value_counts())


Value counts for doggo column
None     2259
doggo      97
Name: doggo, dtype: int64

Value counts for floofer column
None       2346
floofer      10
Name: floofer, dtype: int64

Value counts for pupper column
None      2099
pupper     257
Name: pupper, dtype: int64

Value counts for puppo column
None     2326
puppo      30
Name: puppo, dtype: int64

Asides the fact that we have to tidy up these columns into one, everything looks good.

B. Image Predictions¶

In [23]:

# Examine a quick summary of the dataframe
img_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB

There are 2075 records here. This is 281 records shorter than the WeRateDogs archive data.

tweet_id is stored with the wrong datatype: It should be a string/object type.

We won't be needing the img_num column.

In [24]:

# Check the dataframe for duplicate records
duplicates = img_predictions.duplicated().sum()
print(Color.green+
      'img_predictions has {} number of duplicate records'.format(duplicates)+
      Color.end)

img_predictions has 0 number of duplicate records

In [25]:

# Compute descriptive statistics for the numeric columns
img_predictions.describe()

Out[25]:

	tweet_id	img_num	p1_conf	p2_conf	p3_conf
count	2.075000e+03	2075.000000	2075.000000	2.075000e+03	2.075000e+03
mean	7.384514e+17	1.203855	0.594548	1.345886e-01	6.032417e-02
std	6.785203e+16	0.561875	0.271174	1.006657e-01	5.090593e-02
min	6.660209e+17	1.000000	0.044333	1.011300e-08	1.740170e-10
25%	6.764835e+17	1.000000	0.364412	5.388625e-02	1.622240e-02
50%	7.119988e+17	1.000000	0.588230	1.181810e-01	4.944380e-02
75%	7.932034e+17	1.000000	0.843855	1.955655e-01	9.180755e-02
max	8.924206e+17	4.000000	1.000000	4.880140e-01	2.734190e-01

Everything looks okay here, confidence levels range from 0 - 1 across all columns.

In [26]:

# Examine the p1, p2 and p3 columns
for prediction in ('p1', 'p2', 'p3'):
    print(Color.bold + Color.blue +
          '\n10 Random entries and counts from {} column\n'.format(prediction)+
          Color.end)
    
    print(img_predictions[prediction].value_counts().sample(10))


10 Random entries and counts from p1 column

hyena             2
mud_turtle        1
prison            3
teapot            1
Saint_Bernard     7
giant_panda       1
English_setter    8
porcupine         5
coffee_mug        1
Leonberg          3
Name: p1, dtype: int64

10 Random entries and counts from p2 column

cloak                              1
Bernese_mountain_dog               1
toaster                            1
Brittany_spaniel                   8
rotisserie                         2
American_Staffordshire_terrier    21
printer                            1
mud_turtle                         1
gibbon                             2
hyena                              1
Name: p2, dtype: int64

10 Random entries and counts from p3 column

feather_boa                  2
lakeside                     2
lampshade                    1
quilt                        4
bull_mastiff                20
Chesapeake_Bay_retriever    27
loggerhead                   1
wombat                       1
space_shuttle                1
French_loaf                  2
Name: p3, dtype: int64

It seems that not all the predictions in our img_predictions dataframe correspond to actual dog breeds.

Let's investigate this case further, and check for situations where none of the predictions detected a dog breed:

In [27]:

# Check for situations where the three predictions were not dog breeds
mask = (~img_predictions.p1_dog) & (~img_predictions.p2_dog) & (~img_predictions.p3_dog)
no_dog_predicted = img_predictions[mask]

print(Color.red + Color.bold +
      '{} records found with no dogs detected!\n'.format(no_dog_predicted.shape[0])+
      Color.end)

print(Color.green+ 'Printing the first five records...' +Color.end)
no_dog_predicted.head(5)

324 records found with no dogs detected!

Printing the first five records...

Out[27]:

	tweet_id	jpg_url	img_num	p1	p1_conf	p1_dog	p2	p2_conf	p2_dog	p3	p3_conf	p3_dog
6	666051853826850816	https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg	1	box_turtle	0.933012	False	mud_turtle	0.045885	False	terrapin	0.017885	False
17	666104133288665088	https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg	1	hen	0.965932	False	cock	0.033919	False	partridge	0.000052	False
18	666268910803644416	https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg	1	desktop_computer	0.086502	False	desk	0.085547	False	bookcase	0.079480	False
21	666293911632134144	https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg	1	three-toed_sloth	0.914671	False	otter	0.015250	False	great_grey_owl	0.013207	False
25	666362758909284353	https://pbs.twimg.com/media/CT9lXGsUcAAyUFt.jpg	1	guinea_pig	0.996496	False	skunk	0.002402	False	hamster	0.000461	False

In [28]:

print(Color.green + Color.bold + 'Collecting two image samples for veiwing..' + Color.end)

# Explore some of the images to crosscheck the predictions
for url in no_dog_predicted['jpg_url'].sample(2, random_state=12):
    response = requests.get(url)
    img = Image.open(BytesIO(response.content))
    display(img)

Collecting two image samples for veiwing..

In about 324 cases, none of the predictions p1, p2 and p3 detected a dog breed.

The pulled images show that some of the tweets were not actually about dogs. This may explain why the algorithms didn't detect a dog in the first place.

Further examination also shows some instances where the neural network gave false negative responses (a dog was present, but was detected as absent).

C. Tweet Json Data¶

In [29]:

# Examine a summary of Json_tweet_info
json_tweet_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2327 entries, 0 to 2326
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   tweet_id  2327 non-null   object
 1   hashtag   2327 non-null   object
 2   retweets  2327 non-null   int64 
 3   likes     2327 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 72.8+ KB

2327 records present. 29 records lesser than WeRateDogs archive data.

Majority of these missing records were caused by Tweepy errors (probably from deleted tweets) during the gathering process.

In [30]:

# Check the unique records in the dataframe columns.
for col in json_tweet_info.columns[1:]:
    
    print(Color.bold + Color.blue +
          '\nValue counts for {} column\n'.format(col) + Color.end)
    
    print(json_tweet_info[col].value_counts())


Value counts for hashtag column

None                               2300
[#BarkWeek]                           9
[#PrideMonth]                         3
[#WKCDogShow]                         1
[#notallpuppers]                      1
[#LoveTwitter]                        1
[#FinalFur]                           1
[#ImWithThor]                         1
[#WomensMarch]                        1
[#BellLetsTalk]                       1
[#GoodDogs]                           1
[#K9VeteransDay]                      1
[#ScienceMarch]                       1
[#dogsatpollingstations]              1
[#PrideMonthPuppo, #PrideMonth]       1
[#Canada150]                          1
[#BATP]                               1
[#NoDaysOff, #swole]                  1
Name: hashtag, dtype: int64

Value counts for retweets column

1019    7
552     6
50      6
409     5
406     5
       ..
1616    1
1656    1
3632    1
2719    1
706     1
Name: retweets, Length: 1637, dtype: int64

Value counts for likes column

0        160
2264       4
2205       4
659        3
3053       3
        ... 
3864       1
3940       1
13424      1
5088       1
2293       1
Name: likes, Length: 1959, dtype: int64

When present, hashtags are stored as lists instead of as python strings.

Some tweets are associated with mutiple hashtags.

Finally, let's check for columns that may be common across the three dataframes:

In [31]:

# Check for the common columns accross the three dataframes.
np.intersect1d(np.intersect1d(wrd_archive.columns, img_predictions.columns),
               json_tweet_info.columns)

Out[31]:

array(['tweet_id'], dtype=object)

The tweet_id column is the only common column accross the three datasets.

Assessment Summary¶

The section below summarizes the findings from both visual and programmatic assessment of the datasets.

Data Quality¶

A. WeRateDogs Archive:¶

181 records are retweets and 78 records are replies to previously created tweets. We won't be needing these records in our analysis.
tweet_id is stored with the wrong datatype. Should be a string/object type.
The timestamp column is stored as a string/object type rather than the Pandas datetime type.
Unusual dog names with improper capitalizations; a, an, not, very, infuriating etc. These names constitute 109 records.
Missing records in the expanded_url column, the majority being retweeted posts and replies.
Unexpected values in the rating_numerator and rating_denominator columns, with numerators as high as 1776 and denominators as low as 0.
- Some of these unsual ratings were incorrectly extracted from tweet text containing dates or decimals. These ratings can be fixed since we have the correct values included in the text.
- Some high ratings were allocated to groups of dogs rather than one.
- Unrealistic high rating numerators like 1776 and 420.
The expanded_url column sometimes contains more than one link, seperated by commas, all leading to the same page.

B. Image Predictions¶

tweet_id is stored with the wrong datatype. Should be a string/object type.
Columns p1, p2 and p3 are not uniformly formatted. Some entries are lowercase, some are uppercase and some are titlecase.
- These predictions also have words seperated by underscores instead of spaces.
2075 records present. This means we have 281 missing records, when compared to the 2356 records in wrd_archive.
Not all predictions in p1, p2, and p3 are dog breeds.

C. Tweet Json Data¶

2327 records present. 29 entries lesser than the wrd_archive; the majority caused by Tweepy errors during the gathering process.
When present, hashtags are stored as lists instead of as python strings.

Data Tidiness¶

A. WeRateDogs Archive:¶

Unwanted columns present: in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id and retweeted_status_timestamp.
The text column contains both tweet url and tweet text.
Long and unneccessary information in the source column, with text embedded within HTML anchor tags.
The rating_numerator and rating_denominator can be reduced to one column.
The various stages of dog life: doggo, pupper, puppo, and floofer are contained in seperate columns.

B. Image Predictions¶

The different prediction and confidence columns can be reduced into two columns.
p1_dog, p2_dog and p3_dog columns can be used to select the appropriate predictions to be used, then removed from our dataframe.
We wont be needing the img_num column.

C. Tweet Json Data¶

Rather than standing alone, this dataframe should be merged into wrd_archive.

Cleaning Data¶

Before cleaning, we will create individual copies of the three dataframes:

In [32]:

# Create copies of the original dataframes
archive_clean = wrd_archive.copy()
predictions_clean = img_predictions.copy()
json_clean = json_tweet_info.copy()

WeRateDogs Archive (Quality Issues)¶

181 records are retweets and 78 records are replies to previously created tweets.

Define:¶

From archive_clean, drop records where retweeted_status_id and in_reply_to_status_id are not null.

Code:¶

In [33]:

# Filter out retweets and replies using a boolean mask
retweet_reply_mask = (archive_clean.retweeted_status_id.notnull() | 
                      archive_clean.in_reply_to_status_id.notnull())

archive_clean = archive_clean[~retweet_reply_mask]

Test:¶

In [34]:

# Verify the absence of entries for the retweet and reply columns
assert archive_clean.retweeted_status_id.isnull().all()
assert archive_clean.in_reply_to_status_id.isnull().all()

print(Color.green + Color.bold +
      'archive_clean has reduced to {:,} records.'.format(archive_clean.shape[0])+
      Color.end)

archive_clean has reduced to 2,097 records.

tweet_id is stored with the wrong datatype.
The timestamp column is stored as a string/object type rather than Pandas datetime type.

Define:¶

convert the tweet_id column to object/string type and the timestamp column to a Pandas datetime object.

Code:¶

In [35]:

# Convert tweet_ids to string datatype
archive_clean['tweet_id'] = archive_clean['tweet_id'].astype(str)
# Convert timestamp to a pandas datetime object
archive_clean['timestamp'] = pd.to_datetime(archive_clean['timestamp'])

Test:¶

In [36]:

archive_clean[['tweet_id', 'timestamp']].dtypes

Out[36]:

tweet_id                  object
timestamp    datetime64[ns, UTC]
dtype: object

Unusual dog names with improper capitalizations; a, an, not, very, infuriating etc.

Define:¶

Replace unusual names having improper capitalizations with None.

Code:¶

In [37]:

# Create a boolean mask to identify the unusual names
unusual_names_mask = archive_clean['name'].str.match(r"[a-z].?")
# Identify each unique unusual name from the name column
unusual_names = archive_clean['name'][unusual_names_mask].unique()
# Replace all unusual names with None
archive_clean['name'] = archive_clean['name'].apply(lambda n: 'None' if n in unusual_names else n)

Test:¶

In [38]:

# Verify if there are any improper names still present
assert archive_clean['name'].str.match(r"[a-z].?").sum() == 0

Missing records in the expanded_urls column, the majority being retweets and replies.
The expanded_url column sometimes contains more than one link, seperated by commas, all leading to the same page.

Define:¶

Drop the expanded_url column since the urls are already present in the tweet text.

Having multiple links leading to the same page is redundant. We will split the text column into two columns later.

Code:¶

In [39]:

# Drop the expanded urls from archive clean
archive_clean.drop(columns='expanded_urls', inplace = True)

Test:¶

In [40]:

# Check if the expanded urls column is now absent from the dataframe
assert 'expanded_urls' not in archive_clean.columns

Unexpected values in the rating_numerator and rating_denominator columns, with numerators as high as 1776 and denominators as low as 0.
- Some of these unsual ratings were incorrectly extracted from tweet text containing dates or decimals.
- Some high ratings were allocated to groups of dogs rather than one.
- Unrealistic high ratings like 1776/10 and 420/10.

Define:¶

The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs. However, we will:

Remove the records with the overly high ratings of 420/10 and 1776/10.

Remove the record with rating of 24/7. This is a date, not an actual rating; the right rating is absent from the text.

Programmatically extract the right ratings from text to replace the wrong ones.

Convert high ratings allocated to dog groups to a scale of 10. This will be done later, when tidying up the dataframe.

Code:¶

In [41]:

# Filter out records with unwanted ratings: 420/10, 1776/10 and 24/7
for num, denum in zip([420, 1776, 24], [10, 10, 7]):
    mask = (archive_clean['rating_numerator'] == num) & (archive_clean['rating_denominator'] == denum)
    archive_clean = archive_clean[~mask]

In [42]:

# Isolate unusual ratings: numerator > 15 and denominator not equal to 10
unusual_rating_mask = (archive_clean['rating_numerator'] > 15) | (archive_clean['rating_denominator']!=10)
unusual_ratings = archive_clean[unusual_rating_mask].copy()

# Replace the numerator and denominators with the right values, if present in the tweet text
pattern = r"([0-9\.]+/[0-9]+)"
unusual_ratings[['rating_numerator', 'rating_denominator']] = (unusual_ratings['text']
                                                               .str.findall(pattern)
                                                               .str[-1]
                                                               .str.split('/', expand=True)
                                                              )

In [43]:

# Streamline the result down to the relevant columns
cleaned_ratings = unusual_ratings[['text', 'rating_numerator', 'rating_denominator']]

# Update the ratings in archive clean with the cleaned ratings
archive_clean.update(cleaned_ratings)

Test:¶

In [44]:

# Verify the removal of the unwanted ratings.
for rating_num in [420, 24, 1776]:
    assert rating_num not in archive_clean.rating_numerator.unique()
    
# Verify the records with unusual ratings
archive_clean[unusual_rating_mask][['text', 'rating_numerator', 'rating_denominator']]

Out[44]:

	text	rating_numerator	rating_denominator
433	The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd	84	70
695	This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS	9.75	10
763	This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://...	11.27	10
902	Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE	165	150
1068	After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ	14	10
1120	Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv	204	170
1165	Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a	13	10
1202	This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq	11	10
1228	Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1	99	90
1254	Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12	80	80
1274	From left to right:\nCletus, Jerome, Alejandro, Burp, & Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK	45	50
1351	Here is a whole flock of puppers. 60/50 I'll take the lot https://t.co/9dpcw6MdWa	60	50
1433	Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://t.co/HppvrYuamZ	44	40
1635	Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55	121	110
1662	This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5	10	10
1712	Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 https://t.co/eNm2S6p9BD	11.26	10
1779	IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq	144	120
1843	Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://t.co/y93p6FLvVw	88	80
2335	This is an Albanian 3 1/2 legged Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv	9	10

WeRateDogs Archive (Tidiness Issues)¶

Unwanted columns present: in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id and retweeted_status_timestamp.

Define:¶

Drop the unwanted columns from the archive_clean dataframe.

Code:¶

In [45]:

# Drop unwanted columns from archive clean
unwanted_cols = ['in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 'retweeted_status_user_id','retweeted_status_timestamp']
archive_clean.drop(columns=unwanted_cols, inplace=True)

Test:¶

In [46]:

# Verify that the unwanted columns have been dropped
for col in unwanted_cols:
    assert col not in archive_clean.columns

The text column contains both tweet url and tweet text.

Define:¶

Extract the tweet url into a seperate column.

Remove tweet urls from tweet text.

Code:¶

In [47]:

# Create a pattern to extract urls
pattern = r"(http.+)"

# Extract urls into a tweet url column
archive_clean['tweet_url'] = archive_clean['text'].str.extract(pattern)

# Account for records where tweet text does not contain a url
archive_clean['tweet_url'].fillna('None', inplace = True)

# Remove urls from the text column
archive_clean['text'] = archive_clean['text'].str.replace(pattern, '', regex=True)

Test:¶

In [48]:

archive_clean[['text', 'tweet_url']].sample(5)

Out[48]:

	text	tweet_url
1582	This is Baxter. He looks like a fun dog. Prefers action shots. 11/10 the last one is impeccable	https://t.co/LHcH1yhhIb
1871	When you're presenting a group project and the 4th guy tells the teacher that he did all the work. 10/10	https://t.co/f50mbB4UWS
7	When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10	https://t.co/v0nONBcwxq
1830	This is Kenneth. He's stuck in a bubble. 10/10 hang in there Kenneth	https://t.co/uQt37xlYMJ
1459	This may be the greatest video I've ever been sent. 4/10 for Charles the puppy, 13/10 overall. (Vid by @stevenxx_)	https://t.co/uaJmNgXR2P

Long and unneccessary information in the source column, with html tags and text present. We only need the device that users are tweeting from.

Define:¶

Extract the device information from the source column.

Code:¶

In [49]:

# Create a pattern to extract info between the <a></a> tags
pattern = r">(.+)<"

# Extract information using the defined pattern
archive_clean['source'] = archive_clean['source'].str.extract(pattern)

Test:¶

In [50]:

# Verify the extraction process
archive_clean.source.value_counts()

Out[50]:

Twitter for iPhone     1962
Vine - Make a Scene      91
Twitter Web Client       31
TweetDeck                10
Name: source, dtype: int64

The rating_numerator and rating_denominator can be reduced to one column.

Define:¶

Convert all ratings to a denominator scale of 0 using the expression: $rating = \frac{rating\,numerator}{rating\,denominator} \times 10 $. With this expression, a rating of 120/100 becomes 12/10 and a rating of 55/60 becomes 9.16/10.

Once the ratings are standardized, reduce ratings to a single column called rating.

Drop the rating_numerator and rating_denominator columns.

Code:¶

In [51]:

# Use the expression to calculate a single rating value
rating = 10 * (archive_clean['rating_numerator'].astype(float) / archive_clean['rating_denominator'].astype(float))

# Allocate the values into a new column in archive_clean
archive_clean['rating'] = rating

# Drop the rating numerator and denominator columns
archive_clean.drop(columns=['rating_numerator', 'rating_denominator'], inplace=True)

Test:¶

In [52]:

# verify the removal of the dropped columns
for col in 'rating_numerator', 'rating_denominator':
    assert col not in archive_clean.columns

# Check how standardized ratings are now distributed in the dataframe
archive_clean.rating.describe().to_frame()

Out[52]:

	rating
count	2094.000000
mean	10.610926
std	2.147757
min	0.000000
25%	10.000000
50%	11.000000
75%	12.000000
max	14.000000

The ratings now appear standardized. However, it seems there are record(s) with ratings of 0. We should investigate this further:

In [53]:

# Verify the rating in the tweet text where the rating is equal to zero
print(Color.bold + Color.green +
      'Verifying the text in records where rating is zero...\n'+
      Color.end)

print(archive_clean.loc[archive_clean.rating==0, 'text'])

Verifying the text in records where rating is zero...

315    When you're so blinded by your systematic plagiarism that you forget what day it is. 0/10 
Name: text, dtype: object

The various stages of dog life: doggo, pupper, puppo, and floofer are contained in seperate columns.

Define:¶

Check and correct for conflicting dog stages, if present.

Store all the dog stages in a single column called stage.

Drop the columns doggo, pupper, puppo, and floofer.

Set the stage column to a categorical type.

Code:¶

In [54]:

# Isolate the dog stage columns into a dataframe
stage_df = archive_clean[['doggo', 'pupper', 'puppo', 'floofer']]

# Check if there are situations where multiple stages co-exist
print(Color.bold + Color.green + 
     'Checking for the existence of multiple dog stages\n'+
      Color.end)

stage = stage_df.sum(axis=1)
stage.value_counts()

Checking for the existence of multiple dog stages

Out[54]:

NoneNoneNoneNone        1758
NonepupperNoneNone       221
doggoNoneNoneNone         72
NoneNonepuppoNone         23
NoneNoneNonefloofer        9
doggopupperNoneNone        9
doggoNonepuppoNone         1
doggoNoneNonefloofer       1
dtype: int64

It seems that some records actually present with multiple dog stages. This is hard to read through for now, so we will trim off the extra None characters.

In [55]:

# Remove None from the each entry, unless the string the string is made up of only Nones
stage = stage.apply(lambda x: x.replace('None', '') if x.replace('None', '') != '' else 'None')

stage.value_counts()

Out[55]:

None            1758
pupper           221
doggo             72
puppo             23
floofer            9
doggopupper        9
doggopuppo         1
doggofloofer       1
dtype: int64

11 records have dogs classified as a mix of doggo and some other stages. To be sure this was not done in error, we can examine the tweet text in detail. These records are few, so we can manually identify and correct them.

In [56]:

# Assign the dog stages into a column in archive_clean
archive_clean['stage'] = stage

# Identify and isolate records where dogs were assigned multiple stages
multiple_stages = ['doggopupper', 'doggopuppo', 'doggofloofer']
multiple_stage_mask = archive_clean.stage.apply(lambda x: x in multiple_stages)

# Examine these occurences
archive_clean[multiple_stage_mask][['tweet_id', 'tweet_url', 'text', 'stage']]

Out[56]:

	tweet_id	tweet_url	text	stage
191	855851453814013952	https://t.co/cMhq16isel	Here's a puppo participating in the #ScienceMarch. Cleverly disguising her own doggo agenda. 13/10 would keep the planet habitable for	doggopuppo
200	854010172552949760	https://t.co/TXdT3tmuYk	At first I thought this was a shy doggo, but it's actually a Rare Canadian Floofer Owl. Amateurs would confuse the two. 11/10 only send dogs	doggofloofer
460	817777686764523521	https://t.co/m7isZrOBX7	This is Dido. She's playing the lead role in "Pupper Stops to Catch Snow Before Resuming Shadow Box with Dried Apple." 13/10 (IG: didodoggo)	doggopupper
531	808106460588765185	https://t.co/ANBpEYHaho	Here we have Burke (pupper) and Dexter (doggo). Pupper wants to be exactly like doggo. Both 12/10 would pet at same time	doggopupper
575	801115127852503040	https://t.co/55Dqe0SJNj	This is Bones. He's being haunted by another doggo of roughly the same size. 12/10 deep breaths pupper everything's fine	doggopupper
705	785639753186217984	https://t.co/f2wmLZTPHd	This is Pinot. He's a sophisticated doggo. You can tell by the hat. Also pointier than your average pupper. Still 10/10 would pet cautiously	doggopupper
733	781308096455073793	https://t.co/WQvcPEpH2u	Pupper butt 1, Doggo 0. Both 12/10	doggopupper
889	759793422261743616	https://t.co/MYwR4DQKll	Meet Maggie & Lila. Maggie is the doggo, Lila is the pupper. They are sisters. Both 12/10 would pet at the same time	doggopupper
956	751583847268179968	https://t.co/u2c9c7qSg8	Please stop sending it pictures that don't even have a doggo or pupper in them. Churlish af. 5/10 neat couch tho	doggopupper
1063	741067306818797568	https://t.co/o5J479bZUC	This is just downright precious af. 12/10 for both pupper and doggo	doggopupper
1113	733109485275860992	https://t.co/pG2inLaOda	Like father (doggo), like son (pupper). Both 12/10	doggopupper

After examining the tweet ids, the text and checking the tweet urls, we can obeserve the following:

Tweets with id: 808106460588765185, 781308096455073793, 759793422261743616, 741067306818797568, and 733109485275860992 are actually about two dogs, a doggo and a pupper, hence the doggopupper classification. We will leave them as they are.

Some dogs were erroneously categorized, but the appropriate dog stage is in the tweet text:

- [855851453814013952](https://t.co/cMhq16isel) should be puppo.
- [854010172552949760](https://t.co/TXdT3tmuYk) should be floofer.
- [817777686764523521](https://t.co/m7isZrOBX7) should be pupper.
- [801115127852503040](https://t.co/55Dqe0SJNj) should be pupper.
- [751583847268179968](https://t.co/u2c9c7qSg8) should be doggo.

785639753186217984 is not about a dog. The tweet is actually about a hedgehog. We will remove this record.

In [57]:

# Remove the record about a hedgehog: tweet_id 785639753186217984.
archive_clean = archive_clean.query("tweet_id != '785639753186217984'")

In [58]:

# Correct the erroneously categorized records
correction_dict = {
    '855851453814013952': 'puppo',
    '854010172552949760': 'floofer',
    '817777686764523521': 'pupper',
    '801115127852503040': 'pupper',
    '751583847268179968': 'doggo'
}

for key in correction_dict.keys():
    archive_clean.loc[archive_clean['tweet_id'] == key, 'stage'] = correction_dict[key]

In [59]:

# Drop the columns puppo, doggo, floofer and pupper
archive_clean.drop(columns=['doggo', 'pupper', 'puppo', 'floofer'], inplace=True)

# Convert the stage column to categorical type
archive_clean.stage = archive_clean.stage.astype('category')

Test:¶

In [60]:

# Check if the record with tweet_id 785639753186217984 has been dropped
assert '785639753186217984' not in archive_clean.tweet_id.values

# Check if the unwanted columns have been dropped
assert archive_clean.columns.any() not in ['doggo', 'pupper', 'puppo', 'floofer']

# Verify the datatype in the stage column
assert archive_clean.stage.dtypes == 'category'

# Verify the values in the stage column
archive_clean.stage.value_counts()

Out[60]:

None           1758
pupper          223
doggo            73
puppo            24
floofer          10
doggopupper       5
Name: stage, dtype: int64

Note: One more thing! let's format the dog stage entries to title case, then give doggopupper a more befitting value.

In [61]:

# Format dog stage entries to title case
archive_clean.stage = archive_clean.stage.apply(lambda x: x.title() if x !='doggopupper' else 'Doggo with Pupper')
archive_clean.stage.value_counts()

Out[61]:

None                 1758
Pupper                223
Doggo                  73
Puppo                  24
Floofer                10
Doggo with Pupper       5
Name: stage, dtype: int64

Finally, lets reset the dataframe index and preview our cleaning results:

In [62]:

# Reset the indices of the archive clean dataframe
archive_clean = archive_clean.reset_index(drop=True)

# Preview cleaming results
archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2093 entries, 0 to 2092
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype              
---  ------     --------------  -----              
 0   tweet_id   2093 non-null   object             
 1   timestamp  2093 non-null   datetime64[ns, UTC]
 2   source     2093 non-null   object             
 3   text       2093 non-null   object             
 4   name       2093 non-null   object             
 5   tweet_url  2093 non-null   object             
 6   rating     2093 non-null   float64            
 7   stage      2093 non-null   category           
dtypes: category(1), datetime64[ns, UTC](1), float64(1), object(5)
memory usage: 116.8+ KB

Image Predictions (Tidiness Issues)¶

Note: Some of the tidiness issues in img_predictions overlap with quality issues. Solving the tidiness issues first can make dealing with quality issues easier later.

The different predictions (p1, p2, p3) and their respective confidence level (p1_conf, p2_conf, p3_conf) columns can be reduced into two columns to contain prediction and confidence variables.
p1_dog, p2_dog and p3_dog columns can be used to select the appropriate predictions to be used.
We wont be needing the img_num column.

Define:¶

Part A

Iterate through each row of predictions_clean and extract the best prediction and confidence values.

Assign these values into new columns named breed and confidence.

Part B

Drop all unwanted columns: p1, p2, p3, p1_conf, p2_conf, p3_conf, p1_dog, p2_dog, p3_dog and img_num.

Code A:¶

In [63]:

# Create a list to store the best prediction and confidence values
prediction_list = []

# Define a function to perform the extraction process
def extract_breed_info(row):
    """
        Extracts the best prediction and confidence value from passed row.
        Params:
            row: a row from the dataframe of interest.
        Output:
            A dictionary containing prediction and confidence appended into prediction list.
            Prints a status update of extraction process.
    """
    if row.p1_dog:
        prediction_list.append({'breed': row.p1,'confidence': row.p1_conf})
    elif row.p2_dog:
        prediction_list.append({'breed': row.p2,'confidence': row.p2_conf})
    elif row.p3_dog:
        prediction_list.append({'breed': row.p3,'confidence': row.p3_conf})
    else:
        prediction_list.append({'breed': 'Unknown','confidence': 0})
        
    return 'Info extracted to prediction list'

In [64]:

# Run the extraction process
predictions_clean.apply(extract_breed_info, axis=1)

Out[64]:

0       Info extracted to prediction list
1       Info extracted to prediction list
2       Info extracted to prediction list
3       Info extracted to prediction list
4       Info extracted to prediction list
                      ...                
2070    Info extracted to prediction list
2071    Info extracted to prediction list
2072    Info extracted to prediction list
2073    Info extracted to prediction list
2074    Info extracted to prediction list
Length: 2075, dtype: object

In [65]:

# Assign the values in prediction list into new columns in predictions_clean
predictions_clean[['breed', 'confidence']] = pd.DataFrame(prediction_list)

# Round confidence to three decimal places
predictions_clean.confidence = round(predictions_clean.confidence, 3)

Test A:¶

In [66]:

# Verify the extraction process
predictions_clean.iloc[:, 3:].sample(5)

Out[66]:

	p1	p1_conf	p1_dog	p2	p2_conf	p2_dog	p3	p3_conf	p3_dog	breed	confidence
1391	beagle	0.451697	True	basset	0.197513	True	bloodhound	0.072699	True	beagle	0.452
98	fire_engine	0.883493	False	tow_truck	0.074734	False	jeep	0.012773	False	Unknown	0.000
24	malamute	0.336874	True	Siberian_husky	0.147655	True	Eskimo_dog	0.093412	True	malamute	0.337
2009	basset	0.320420	True	collie	0.215975	True	Appenzeller	0.128507	True	basset	0.320
175	Chihuahua	0.803528	True	Pomeranian	0.053871	True	chow	0.032257	True	Chihuahua	0.804

Code B:¶

In [67]:

# Create a list of unwanted columns
unwanted_columns = ['p1','p2', 'p3','p1_conf', 'p2_conf', 'p3_conf','p1_dog', 'p2_dog', 'p3_dog', 'img_num']

# Drop all unwanted columns
predictions_clean.drop(columns=unwanted_columns, inplace=True)

Test B:¶

In [68]:

# Check if the unwanted columns have been dropped
assert predictions_clean.columns.any() not in unwanted_columns

Image Predictions (Quality Issues)¶

tweet_id is stored with the wrong datatype. It should be a string/object type.

Define:¶

Convert tweet_id column to string/object type.

Code:¶

In [69]:

# Convert tweet_ids to string datatype
predictions_clean['tweet_id'] = predictions_clean['tweet_id'].astype(str)

Test:¶

In [70]:

# Verify the datatype for tweet_id
assert predictions_clean['tweet_id'].dtypes == 'O'

Columns p1, p2 and p3 are not uniformly formatted. Some entries are lowercase, some are uppercase and some are titlecase.
- These predictions also have words seperated by underscores and instead of spaces.

Define:¶

Perform cleaning on the prediction column, since p1, p2 and p3 have been removed.

Replace underscores with spaces and format all entries to titlecase.

Code:¶

In [71]:

# Remove all underscores and format the breed text to titlecase.
predictions_clean.breed = predictions_clean.breed.str.replace('_', ' ').str.title()

Test:¶

In [72]:

predictions_clean.breed.unique()

Out[72]:

array(['Welsh Springer Spaniel', 'Redbone', 'German Shepherd',
       'Rhodesian Ridgeback', 'Miniature Pinscher',
       'Bernese Mountain Dog', 'Unknown', 'Chow', 'Golden Retriever',
       'Miniature Poodle', 'Gordon Setter', 'Walker Hound', 'Pug',
       'Bloodhound', 'Lhasa', 'English Setter', 'Italian Greyhound',
       'Maltese Dog', 'Newfoundland', 'Malamute',
       'Soft-Coated Wheaten Terrier', 'Chihuahua',
       'Black-And-Tan Coonhound', 'Toy Terrier', 'Blenheim Spaniel',
       'Pembroke', 'Irish Terrier', 'Chesapeake Bay Retriever',
       'Curly-Coated Retriever', 'Dalmatian', 'Ibizan Hound',
       'Border Collie', 'Labrador Retriever', 'Miniature Schnauzer',
       'Airedale', 'Rottweiler', 'West Highland White Terrier',
       'Toy Poodle', 'Giant Schnauzer', 'Vizsla', 'Siberian Husky',
       'Papillon', 'Saint Bernard', 'Tibetan Terrier', 'Borzoi', 'Beagle',
       'Yorkshire Terrier', 'Pomeranian', 'Kuvasz',
       'Flat-Coated Retriever', 'Norwegian Elkhound', 'Boxer',
       'Eskimo Dog', 'Standard Poodle', 'Staffordshire Bullterrier',
       'Basenji', 'Lakeland Terrier', 'American Staffordshire Terrier',
       'Shih-Tzu', 'Groenendael', 'French Bulldog', 'Pekinese',
       'Komondor', 'Malinois', 'Kelpie', 'Brittany Spaniel',
       'Cocker Spaniel', 'Basset', 'English Springer', 'Cardigan',
       'Brabancon Griffon', 'German Short-Haired Pointer',
       'Shetland Sheepdog', 'Cairn', 'Whippet', 'Sussex Spaniel',
       'Dandie Dinmont', 'Norwich Terrier', 'Keeshond', 'Norfolk Terrier',
       'Old English Sheepdog', 'Samoyed', 'Scottish Deerhound',
       'Doberman', 'Irish Wolfhound', 'Great Pyrenees', 'Schipperke',
       'Bull Mastiff', 'Collie', 'Greater Swiss Mountain Dog',
       'Standard Schnauzer', 'Irish Water Spaniel', 'Boston Bull',
       'Japanese Spaniel', 'Bedlington Terrier', 'Entlebucher',
       'Bluetick', 'Irish Setter', 'Leonberg', 'Mexican Hairless',
       'Weimaraner', 'Great Dane', 'Tibetan Mastiff', 'Scotch Terrier',
       'Australian Terrier', 'Briard', 'Appenzeller', 'Border Terrier',
       'Wire-Haired Fox Terrier', 'Saluki', 'Silky Terrier',
       'Afghan Hound', 'Clumber', 'Bouvier Des Flandres'], dtype=object)

Note: We will not clean entries with dashes - since their use is grammatically correct in this case.

Not all predictions in p1, p2, and p3 are dog breeds.

Define:¶

This has been addressed in the process of tidying up the data.

Predictions that are not dog breeds have also been assigned a value of Unknown.

Test:¶

In [73]:

# Verify the presence of records tagged unknown
predictions_clean.breed.value_counts().head()

Out[73]:

Unknown               324
Golden Retriever      173
Labrador Retriever    113
Pembroke               96
Chihuahua              95
Name: breed, dtype: int64

We have 281 missing records, when compared to the 2356 records in wrd_archive.

Define:¶

Account for this by merging archive_clean and predictions_clean with an inner join. This way, only records common to both dataframes will be retained.

We will do this after cleaning the json_clean dataframe.

Tweet Json Data (Quality Issues)¶

When present, hashtags are stored as lists instead of as python strings.

Define:¶

Transform each hashtag from the tweet into a distinct record using the df.explode() method.

Code:¶

In [74]:

# Transform each element in the hashtag list to a distinct row in the dataframe
json_clean = json_clean.explode('hashtag')

Test:¶

In [75]:

json_clean.hashtag.value_counts()

Out[75]:

None                      2300
#BarkWeek                    9
#PrideMonth                  4
#BellLetsTalk                1
#NoDaysOff                   1
#notallpuppers               1
#LoveTwitter                 1
#FinalFur                    1
#ImWithThor                  1
#WomensMarch                 1
#GoodDogs                    1
#WKCDogShow                  1
#K9VeteransDay               1
#ScienceMarch                1
#dogsatpollingstations       1
#PrideMonthPuppo             1
#Canada150                   1
#BATP                        1
#swole                       1
Name: hashtag, dtype: int64

Tweet Json Data (Tidiness Issues)¶

Rather than standing alone, this dataframe should be merged into wrd_archive.

Define:¶

We can resolve this by merging the json_clean dataframe into archive_clean.

Note:¶

It is better to create a master dataset by merging all the cleaned dataframes with an inner join. In addition to the issue listed above, this merge will help us address the following pending issue of unequal records across the three dataframes.

Creating a Master Dataset¶

In [76]:

# reset dataframe indices for json_clean and predictions_clean
json_clean = json_clean.reset_index(drop=True)
predictions_clean = predictions_clean.reset_index(drop=True)

In [77]:

# Merge archive clean and prediction clean into master df
master_df = pd.merge(archive_clean, predictions_clean, on='tweet_id', how='inner')

# Merge json clean into master df
master_df = pd.merge(master_df, json_clean, on='tweet_id', how='inner')

# Check results 
master_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1961 entries, 0 to 1960
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype              
---  ------      --------------  -----              
 0   tweet_id    1961 non-null   object             
 1   timestamp   1961 non-null   datetime64[ns, UTC]
 2   source      1961 non-null   object             
 3   text        1961 non-null   object             
 4   name        1961 non-null   object             
 5   tweet_url   1961 non-null   object             
 6   rating      1961 non-null   float64            
 7   stage       1961 non-null   category           
 8   jpg_url     1961 non-null   object             
 9   breed       1961 non-null   object             
 10  confidence  1961 non-null   float64            
 11  hashtag     1961 non-null   object             
 12  retweets    1961 non-null   int64              
 13  likes       1961 non-null   int64              
dtypes: category(1), datetime64[ns, UTC](1), float64(2), int64(2), object(8)
memory usage: 216.6+ KB

Notes:¶

We can reorder the columns in a much more intuituve pattern.

We should assign a more descriptive name to columns like name, breed, stage, and jpg_url.

Lets add these finishing touches to our master dataframe:

In [78]:

# Order the columns in master df
column_order = ['tweet_id', 'timestamp', 'name', 'breed', 'confidence', 'stage', 'rating',
                'hashtag', 'retweets', 'likes', 'jpg_url', 'tweet_url', 'text']
master_df = master_df[column_order]

# Give some columns descriptive names
master_df.rename(
    columns={
        'name': 'dog_name',
        'breed': 'dog_breed',
        'stage': 'dog_stage',
        'jpg_url': 'image'
    }, inplace=True)

In [79]:

# Preview master dataframe information
master_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1961 entries, 0 to 1960
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype              
---  ------      --------------  -----              
 0   tweet_id    1961 non-null   object             
 1   timestamp   1961 non-null   datetime64[ns, UTC]
 2   dog_name    1961 non-null   object             
 3   dog_breed   1961 non-null   object             
 4   confidence  1961 non-null   float64            
 5   dog_stage   1961 non-null   category           
 6   rating      1961 non-null   float64            
 7   hashtag     1961 non-null   object             
 8   retweets    1961 non-null   int64              
 9   likes       1961 non-null   int64              
 10  image       1961 non-null   object             
 11  tweet_url   1961 non-null   object             
 12  text        1961 non-null   object             
dtypes: category(1), datetime64[ns, UTC](1), float64(2), int64(2), object(7)
memory usage: 201.3+ KB

In [80]:

master_df.head(1)

Out[80]:

	tweet_id	timestamp	dog_name	dog_breed	confidence	dog_stage	rating	hashtag	retweets	likes	image	tweet_url	text
0	892420643555336193	2017-08-01 16:23:56+00:00	Phineas	Unknown	0.0	None	13.0	None	7018	33839	https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg	https://t.co/MgUWQ76dJU	This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10

Storing Data¶

Let's store our cleaned master dataframe locally to a named twitter_archive_master.csv

In [81]:

# Store master_df locally
master_df.to_csv('./twitter_archive_master.csv', index=False, encoding='utf-8')

# Verify storage process
print(Color.green + Color.bold + 
      'Printing csv file list in local directory...'+
      Color.end)

!ls -lh *.csv

Printing csv file list in local directory...
-rw-r--r--@ 1 israelogunmola  staff   894K Jun  1 19:56 twitter-archive-enhanced.csv
-rw-r--r--  1 israelogunmola  staff   514K Jun 15 15:53 twitter_archive_master.csv

Analysis and Visualization¶

Our analysis will focus on exploring the data to understand the following:

The most popular dog breeds tweeted by WeRateDogs.
The highest rated breeds on average.
Activity on the WeRateDogs twitter account within the time period. This will be assessed in terms of the total tweets posted overtime.
User engagement, as it relates to posts created on the account.
How users interact with tweets that relate to the various dog stages (puppo, pupper, doggo etc.).
Hashtags and how they relate to user engagement.
The association, if any, between retweets and likes.
Most favoured dogs, from each dog stage.

User engagement will be assessed in terms of retweets and likes.

1. What are the top breeds in terms of popularity?¶

Logic:¶

Filter out records with for which we don't have a dog breed.

Consider the top 20 breeds in terms of the number of tweets posted.

Visualize this information using a bar graph.

Action:¶

In [82]:

# Remove records with unknown breeds
known_breeds_df = master_df.query("dog_breed != 'Unknown'")

# Isolate the 20 most popular breeds and their tweet counts
popular_20 = known_breeds_df.dog_breed.value_counts().head(20)

In [83]:

# Create a folder to store visualizations locally
folder = 'images'
if not os.path.exists(folder):
    os.makedirs(folder)

In [84]:

# Create a visual
fig = px.bar(x=popular_20.index, y=popular_20.values, text=popular_20.values, height=600, width=1200, template='plotly_white')
fig.update_xaxes(tickangle=90, linecolor='grey')
fig.update_yaxes(showgrid=False, showticklabels=False)
fig.update_traces(width=0.6, textposition='outside', marker_color='royalblue')
fig.update_layout(
    yaxis_title='No of Tweets', 
    xaxis_title='Dog Breed',
    title='20 Most Popular Breeds<br><sup>Popular breeds ranked by number of tweets between Nov 2015 to Jul 2017.</sup>',
    paper_bgcolor='rgb(248, 248, 255)',
    plot_bgcolor='rgb(248, 248, 255)',
    font_family='Arial'
)

# Store graph locally
fig.write_image('images/fig1.svg')

# Display graph
fig.show('svg')

Insights:¶

The Golden retriever is the most popular breed, having a total count of 155 tweets. The Labrador retriever follows behind with 103 tweets. In total, the two retrievers (golden and labrador) contributed to 258 tweets.

Other notable breeds include the Pembroke, Chihuahua and the Pug, occupying the 3rd to 5th place respectively.

2. In terms of rating, what are the top 20 breeds, especially when compared to the list of popular dog breeds?¶

Logic:¶

Filter out records for which we don't have a dog breed. We already have this information in known_breeds_df

Filter out breeds with less than 10 tweets, since we are considering popularity also.

Compute average ratings for each breed and select the top 20.

Visualize results.

Action:¶

In [85]:

# Identify the number of breeds with less than 10 tweets
breed_tweet_count = known_breeds_df.dog_breed.value_counts()

print(Color.red + Color.bold +
      'There are {} breeds with less than 10 tweets'.format((breed_tweet_count >= 10).sum())+
     Color.end)

There are 53 breeds with less than 10 tweets

In [86]:

# Get the names of breeds with more than 10 tweets
wanted_breeds = breed_tweet_count[breed_tweet_count >=10].index

# Select the right breeds from the dataframe
wanted_breed_ratings = master_df[master_df.dog_breed.apply(lambda x: x in wanted_breeds)][['dog_breed', 'rating']]

# Compute average rating per breed, then select the top 20
top_20_rated = wanted_breed_ratings.groupby('dog_breed').mean().sort_values(by='rating').tail(20)

In [87]:

# --- Visualize results ---

# Create a color map to identify breeds in the popular 20 list
color_map = top_20_rated.reset_index()['dog_breed'].apply(lambda x: x in popular_20.index)

# Create plot area and add traces
fig = go.Figure()
fig.add_trace(go.Bar(y=top_20_rated.index, x=top_20_rated.rating, orientation='h'))
fig.add_trace(go.Scatter(y=top_20_rated.index, x=top_20_rated.rating+0.016, mode='markers', marker_size=10))

# --- Set trace properties ---
# Set trace properties for bar plot
fig.update_traces(width=0.2, selector=dict(type="bar"))

# Set trace properties for scatter plot
fig.update_traces(textposition='middle right', selector=dict(type="scatter"), 
                 marker_line_color='black', marker_line_width=1)

#Set trace properties common to both plots
fig.update_traces(marker=dict(color=color_map.astype(int), colorscale=[[0, '#7F7F7F'], [1, 'royalBlue']],
                             opacity=0.9))

# Update axes and plot layout
fig.update_yaxes(ticksuffix='  ', tickfont=dict(size=12))
fig.update_xaxes(showgrid=True, gridcolor='#ddd', showticklabels=True, range=[10, 12], tickfont_color='grey')

fig.update_layout(xaxis_title='Rating', yaxis_title='', font_family='Arial',
                  width=700, height=650, margin=dict(t=80, b=70),
                  template='plotly_white', showlegend=False,
                  title = 'Top 20 Breeds by Average Rating<br><sup>The blue bars represent breeds '+
                  'that are also present on the popular 20 list.</sup>',
                  paper_bgcolor='rgb(248, 248, 255)', plot_bgcolor='rgb(248, 248, 255)'
)

# Add annotations
fig.add_vline(x=10, annotation=dict(text='<b>Common denominator</b>'), annotation_position='top', annotation_font_color='IndianRed')

# Store graph locally
fig.write_image('images/fig2.svg')

fig.show('svg')

Insights:¶

The Samoyed, Golden retriever, Great Pyrenees, Pembroke and Chow are the top five breeds in terms of average ratings.

13 of the top rated breeds (13/20) are also present on the most popular list. It appears that these breeds enjoy in both rating and popularity.

3. What dog stage do users favor the most?¶

Logic:¶

Identify and filter out records where a dog stage was not mentioned.

Isolate only the proper dog stages (doggo, puppo and pupper). The floofer isn't an ideal dog stage since any of the other dog stages can be a floofer too.

Compute the average retweets, likes and ratings by dog stage.

Remove records with two stages in one tweet e.g a doggo and a pupper. It will be hard to tell the one that users liked.

Melt the resulting dataframe; for ease of plotting with Plotly.

Visualize results.

Action:¶

In [88]:

# First estimate the fraction of total records where dog stage was mentioned
stage_counts = master_df.dog_stage.value_counts()

print(Color.green + Color.bold +
      'Only {} records mentioned the stage of the dog'.format(stage_counts.drop('None').sum())+
     Color.end)

Only 302 records mentioned the stage of the dog

In [89]:

# Compute the mean retweets, likes and ratings for each dog stage
stage_aggregates = master_df.groupby('dog_stage')[['retweets', 'likes', 'rating']].mean()

# Remove records with None, Floofer and Doggo with Pupper
stage_aggregates.drop(index = ['None', 'Floofer', 'Doggo with Pupper'], inplace=True)

# Melt the dataframe for plotting ease
stage_aggregates = stage_aggregates.reset_index().melt(id_vars='dog_stage', var_name='criteria', value_name='mean')
stage_aggregates

Out[89]:

	dog_stage	criteria	mean
0	Doggo	retweets	5901.269841
1	Pupper	retweets	1930.334975
2	Puppo	retweets	5703.666667
3	Doggo	likes	17403.317460
4	Pupper	likes	6282.990148
5	Puppo	likes	20418.166667
6	Doggo	rating	11.761905
7	Pupper	rating	10.656502
8	Puppo	rating	12.083333

In [90]:

# --- Visualize results ---

# Create main plot area
fig = px.bar(stage_aggregates, y='dog_stage', x = 'mean', orientation='h', facet_col='criteria', template='plotly_white',
            height=300, width=1200, facet_col_spacing=0.04)

# Customize traces and annotations
fig.update_traces(width=0.7, marker_color=['grey', 'grey', '#DC2912'], opacity=0.7)
fig.for_each_annotation(lambda a: a.update(text= a.text.split("=")[-1].title() + ' on average'))

# Update axes and plot layout
fig.update_xaxes(matches=None, showline=True, linewidth=1, linecolor='grey', mirror=True, 
                 titlefont_size=12, tickfont_color='grey')
fig.update_yaxes(ticksuffix='  ', showline=True, linewidth=1, linecolor='grey', mirror=True)

fig.update_layout(yaxis_title='Dog stage', xaxis_title='', xaxis2_title='', xaxis3_title='',
                  paper_bgcolor='rgb(248, 248, 255)', plot_bgcolor='rgb(248, 248, 255)',
                  title='People Love and Rate Puppos; Doggos Gather Retweets!<br>'+
                  '<sup>User retweets, likes and ratings compared across various dog stages.</sup>',
                  title_x=0.5, font_family='Arial',
                  margin=dict(t=100, b=70))

# Store graph locally
fig.write_image('images/fig3.svg')

fig.show('svg')

Insights:¶

Puppos seem to be the people favorite, leading in average likes (over 20,000) and ratings (12). Doggo tweets also show good engagements in terms of likes (17,412) and ratings (11.7).

Doggos enjoy marginally more retweets (5,906 retweets) than Puppos (5,708 retweets).

Puppers considerably gather the least retweets, likes and ratings.

4. Are Hashtags associated with higher tweet engagements?¶

Logic:¶

Identify the number of records that had hashtags included in the tweet. This helps understand if we have enough data to make conclusions.

Compute the average retweets, likes and ratings by hashtag use.

Action:¶

In [91]:

# Select only relevant columns from the master dataframe
hashtag_df = master_df[['hashtag', 'likes', 'retweets', 'rating']].copy()

# Create a new column to show if hashtags are present in each record
hashtag_df['has_hashtag'] = hashtag_df.hashtag.apply(lambda x: x!='None')

# Print information about how much hashtags are present in the dataframe
print(Color.green+'Printing the number of records based on hashtag use...'+Color.end)
print(hashtag_df.has_hashtag.value_counts(), '\n')

print(Color.green+'Printing the percentage of records based on hashtag use...'+Color.end)
(hashtag_df.has_hashtag.value_counts(normalize=True).round(3)*100).astype(str)+'%'

Printing the number of records based on hashtag use...
False    1937
True       24
Name: has_hashtag, dtype: int64 

Printing the percentage of records based on hashtag use...

Out[91]:

False    98.8%
True      1.2%
Name: has_hashtag, dtype: object

In [92]:

# Aggregate ratings, retweets and likes by hashtag use
hashtag_df.groupby('has_hashtag')[['retweets', 'likes', 'rating']].mean().astype(int)

Out[92]:

	retweets	likes	rating
has_hashtag
False	2215	7600	10
True	5655	20709	12

Insights:¶

Only very few records, 1.2% of the total tweets, actually used hashtags.

Ratings, retweets and likes seem higher on average, when hashtags are used. However, We cannot confidently make this conclusion, considering that there are far fewer records for tweets with hashtags.

A better alternative to this question is to try to understand the hashtags that generated the highest user engagements (retweets and likes) when used.

5. Which hashtags generated the highest engagements when used?¶

Logic:¶

Isolate only records where hashtags are used.

Aggregate retweets and likes based on each unique hashtag.

Visualize the results.

Action:¶

In [93]:

# Isolate posts with hashtags
hashtag_present = hashtag_df.query('has_hashtag == True')

# Aggregate ratings, retweets and likes by each unique hashtag
hashtag_aggregates = hashtag_present.groupby('hashtag')[['retweets', 'likes']].mean().astype(int)

In [94]:

# --- Visualize results ---

# Create plot area
fig = px.scatter(hashtag_aggregates, y=hashtag_aggregates.index, x='retweets', size='likes',
                 text=hashtag_aggregates.index+'<br><sub>'+'Retweets: '+
                 hashtag_aggregates.retweets.astype(str)+', Likes: '+hashtag_aggregates.likes.astype(str)+'</sub>',
                 title='Which Hashtag generated the highest user engagement?<br>'+
                '<sup>Increasing retweets from left to right. Likes are represented by dot size.')

# Format traces and axes
fig.update_traces(textposition='bottom center', opacity=1, marker_line_color='black', marker_line_width=1)
fig.update_yaxes(showticklabels=False, gridcolor='#ddd')
fig.update_xaxes(showticklabels=False, gridcolor='#ddd')

# Update layout and annotations
fig.update_layout(height=850, width=1400, template='plotly_white', xaxis_title='', yaxis_title='',
                 paper_bgcolor='rgb(248, 248, 255)', plot_bgcolor='rgb(248, 248, 255)', 
                 font_family='Arial', font_size=14, margin=dict(t=70, b=70), title_x=0.5)

fig.add_hline(y=13.2, annotation=dict(text=' Increasing average retweets ->'), annotation_position='top right')
fig.add_hline(y=-1, annotation=dict(text=' Increasing Average retweets ->'), annotation_position='top right')

# Store graph locally
fig.write_image('images/fig4.svg')

fig.show('svg')

Insights:¶

#WomensMarch and #ScienceMarch gathered the highest share of interactions. On the other hand, hashtags such as #Swole and #NoDaysOff gained the least number of interactions.

These two hashtags are related to widespread events involving rallies (#ScienceMarch) or protests (#WomensMarch) worldwide. This could explain the high number of interactions recorded with their use.

6. How did the number of original posts, retweets and likes vary over the time period?¶

Logic:¶

Make a copy of the master dataframe and set the timestamp as the new dataframe index.

Resample the timestamp by month, aggregating the average number of posts, retweets, and likes in the process.

Visualize the results.

Action:¶

In [95]:

# Make a copy of the master dataframe, setting timestamp as the index
master_df_copy = master_df.set_index('timestamp')

# For each item to investigate, resample the dataframe by month
tweet_count = master_df_copy.tweet_id.resample('1m').count()
retweets= master_df_copy.retweets.resample('1m').mean()
likes = master_df_copy.likes.resample('1m').mean()

In [96]:

# Initialize figure with subplots
fig = make_subplots(rows=1, cols=2, horizontal_spacing=0.12,
                    specs=[[{"type": "scatter"}, {"secondary_y": True}]])

# Add trace for tweet count
fig.add_trace(
    go.Scatter(x=tweet_count.index, y=tweet_count.values, name='count of tweets', 
              marker_color='IndianRed'),row=1, col=1
)

# Add traces for retweets and likes
# Retweets
fig.add_trace(
    go.Scatter(x=retweets.index, y=retweets.values, name="retweets", marker_color='MediumSlateBlue'),
    secondary_y=False, row=1,col=2
)

# Likes
fig.add_trace(
    go.Scatter(x=likes.index, y=likes.values, name="likes", marker_color='green'),
    secondary_y=True, row=1, col=2
)

# Update traces and axes properties
fig.update_traces(line_width=3, opacity=0.8)
fig.update_yaxes(title_text="Average<b> retweets</b>", secondary_y=False, gridcolor='#ddd', 
                 tickfont_color='grey', row=1, col=2)
fig.update_yaxes(title_text="Average <b>likes</b>", secondary_y=True, showgrid=False, 
                 tickfont_color='grey', row=1, col=2)
fig.update_yaxes(title_text="<b>Tweet</b> count", gridcolor='#ddd', tickfont_color='grey', row=1, col=1)
fig.update_xaxes(title_text="Timestamp", gridcolor='#ddd', linecolor='black', tickfont_color='grey')


# Update layout and annotations
fig.update_layout(height=500, width=1200, template='plotly_white', showlegend=False, font_family='Arial',
                 paper_bgcolor='rgb(248, 248, 255)', plot_bgcolor='rgb(248, 248, 255)',
                 title='How have Tweet count, Likes and Retweets varied over the time period?<br>'+
                 '<sup>Trends in original tweets, retweets and likes compared between Nov 2015 and Jul 2017.</sup>')

fig.add_annotation(x='2016-09', y=3200, text='<b>Retweets</b>', showarrow=False, 
                   textangle=-45, font_color='MediumSlateBlue', row=1, col=2)

fig.add_annotation(x='2016-12', y=2400, text='<b>Likes</b>', showarrow=False, 
                   textangle=-45, font_color='green', row=1, col=2)

fig.add_annotation(x='2016-12', y=80, text='<b>Original tweet count</b>', showarrow=False, 
                   textangle=0, font_color='IndianRed', row=1, col=1)

# Store graph locally
fig.write_image('images/fig5.svg')

fig.show('svg')

Insights:¶

The total number of original tweets posted on the account has been declining overall. This could have led one to believe that the account was gradually becoming unsuccessful with time.

The rising trend in retweets and favorites, however, tells a different story: Although the total number of tweets has been on a decline, the account has been gaining a lot of user interactions, moving from less than 1,000 retweets and 5,000 likes in late 2015, to over 6,000 retweets and 30,000 likes in late 2017.

This could be due to the fact that in earlier stages, an account may have created more tweets to gain popularity. As time progresses, people become familiar with the account. They may start to like and retweet contents for others to see. This can lead to a cycle of success, gradually reducing the need to create numerous contents before driving engagement.

There is also an interesting pattern seen with retweets and likes. They appear to fluctuate in the same direction (when retweets increase, likes increase and vice versa). We can investige this further by examining the correlation between both variables.

7. Is there really an association between retweets and likes?¶

Logic:¶

Sample 1000 records from the dataframe, then evaluate the relationship between both variables using a scatter plot.

Action:¶

In [97]:

# Calculate the correlation coefficient for retweet-like relationship
correlation = master_df.retweets.corr(master_df.likes).round(2)

# Define a function that helps format scatterplots
def format_scatter(f):
    '''
    Updates a plotly scatter plot axes, traces and layouts with predefined formatting
        Params:
            f (figure object): A plotly figure object (scatterplot)
        Output:
            None
    '''
    f = f.update_xaxes(tickfont_color='grey', gridcolor='#ddd')
    f = f.update_yaxes(tickfont_color='grey', gridcolor='#ddd')
    f = f.update_traces(marker_line_color='black', marker_line_width=1, marker_color='royalblue', opacity=0.7)
    f = f.update_layout(height=500, width=600, template='plotly_white', font_family='Arial',
                      paper_bgcolor='rgb(248, 248, 255)', plot_bgcolor='rgb(248, 248, 255)')

In [98]:

# Create a plotly scatter plot object
fig = px.scatter(master_df.sample(1000, random_state=1), x='retweets', y='likes', trendline='ols')

# Format the scatter plot with predefined function
format_scatter(fig)

# Update plot title
fig.update_layout(title='Is there a relationship between Retweets and Likes?<br>'+
                     '<sup>A plot of retweets and likes for all WeRateDogs original posts.</sup>')

# Add neccessary annotations
fig.add_vrect(x0=0, x1=20000, y0=0.05, y1=0.30, opacity=0.6, line_width=2)
fig.add_annotation(x=44000, y=25000, 
                   text='<b> A strong positive correlation masked by' +
                   '<br>the majority of tweets having under<br>20k retweets and 50k likes.</b>', 
                   showarrow=False, textangle=0, font_color='royal blue')

fig.add_annotation(x=15000, y=95000, text='<b>'+'r='+str(correlation)+'</b>', 
                   showarrow=False, font_color='royal blue')

# Store graph locally
fig.write_image('images/fig6.svg')
fig.show('svg')

Insights:¶

The association between many points on the scatterplot above is not very visible. Despite the stong positive correlation, outliers (tweets having very high number of both retweets and likes) are causing the points to be concentrated at the bottom left of the chart.

We can zoom-in to examine this relationship better by plotting a scatterplot with the log of both the x (retweets) and y (likes) axes:

In [99]:

# Regenerate the scatterplot, this time taking the log values of both axes
fig = px.scatter(master_df.sample(1000, random_state=1), x='retweets', y='likes', log_x=True, log_y=True)

# Format the scatter plot with predefined function
format_scatter(fig)

# Add neccessary update plot layout and add neccessary annotations
fig.update_layout(title='A clearer association between Retweets and Likes<br>'+
                 '<sup>The log plot zooms into the relationship between retweets and likes for WeRateDogs posts.</sup>',
                 xaxis_title= 'retweets (log scale)', yaxis_title='likes (log scale)')

fig.add_annotation(x=math.log10(70), y=math.log10(2000), text='<b>'+'r='+str(correlation)+'</b>', 
                   showarrow=False, textangle=0, font_color='royal blue')

# Store graph locally
fig.write_image('images/fig7.svg')
fig.show('svg')

Insights:¶

Retweets and likes show strong positive correlation, and this is especially clearer on a log scale.

8. Can we Identify the most favored dogs in each dog stage?¶

Logic:¶

Isolate records for each unique dog stage from the master dataframe.

Filter out Floofers (since this isn't an actual dog stage), Doggo with Pupper (it's hard to tell which one people liked in the tweet), and records where a dog stage wasn't specified.

Create a leaderboard system that sorts dogs based on tweet engagements (retweets and likes), then ratings.

Identify the leading dogs in each group and display their images as visual outputs.

Action:¶

In [100]:

# Loop through each dog stage in the dog stage column
# Do not consider 'None', 'Floofer' and 'Doggo with pupper'

for stage in master_df.dog_stage.unique()[1:4]:

        # Isolate each dog stage into its own dataframe
        stage_df = master_df.query("dog_stage == @stage")

        # Sort each tweet based on retweets >> Likes >> ratings
        top_dog = stage_df.sort_values(by=['retweets', 'likes', 'rating'], ascending=False).head(1)

        # Pull full profile info for the most favored dog
        top_dog_image = top_dog.image.values[0]
        top_dog_name = top_dog.dog_name.values[0].replace('None', 'Wish we knew')
        top_dog_breed = top_dog.dog_breed.values[0]
        top_dog_retweets = str(top_dog.retweets.values[0])
        top_dog_likes = str(top_dog.likes.values[0])
        top_dog_rating = str(top_dog.rating.values[0])
        top_dog_text = top_dog.text.values[0]

        # Print dog profile as output
        print(Color.underline + Color.green + 
              Color.bold + 'Peoples favorite '+ stage+Color.end)

        print(Color.blue+ 'Name: '+ top_dog_name +
              '\nBreed: '+ top_dog_breed +
              '\nRetweets: '+ top_dog_retweets +
              '\nLikes: '+ top_dog_likes + 
              '\nRating: '+ top_dog_rating + Color.end)

        print(Color.bold+'Tweet: ' + top_dog_text + Color.end)

        # Pull dog image from the web then display it
        response = requests.get(top_dog_image)
        img = Image.open(BytesIO(response.content))
        display(img)

        # Write dog image locally
        folder = 'images'
        filename = 'favored_' + stage.replace(' ', '_') + '.jpg'
        img.save(os.path.join(folder, filename))

        # Output demarcator
        print('-'*138)

Peoples favorite Doggo
Name: Wish we knew
Breed: Labrador Retriever
Retweets: 70826
Likes: 145013
Rating: 13.0
Tweet: Here's a doggo realizing you can stand in a pool. 13/10 enlightened af (vid by Tina Conrad)

------------------------------------------------------------------------------------------------------------------------------------------
Peoples favorite Puppo
Name: Wish we knew
Breed: Lakeland Terrier
Retweets: 39970
Likes: 124209
Rating: 13.0
Tweet: Here's a super supportive puppo participating in the Toronto  #WomensMarch today. 13/10

------------------------------------------------------------------------------------------------------------------------------------------
Peoples favorite Pupper
Name: Jamesy
Breed: French Bulldog
Retweets: 30247
Likes: 108985
Rating: 13.0
Tweet: This is Jamesy. He gives a kiss to every other pupper he sees on his walk. 13/10 such passion, much tender

------------------------------------------------------------------------------------------------------------------------------------------

Insights:¶

The people's favorite Doggo is a Labrador retriever swimming in a pool. We do not know its name, but it has gathered 70,913 retweets, 145,119 likes, and a rating of 13.

For the Puppos, its a Lakeland Terrier. We couldn't get its name, but this puppo participated in the Womens march in Canada, earning itself 40,002 retweets, 124,275 likes, and a rating of 13 in the process.

A French bulldog named Jamesy won it all for the Puppers. People seem to love the fact that he gives kisses to other dogs. He earned himself 20,275 retweets, 109,051 likes and a rating of 13 for being so tender.

Conclusion¶

Real life data rarely comes clean. In the course of this project, WeRateDogs twitter data was collected in fragments from different sources. Each piece of data was assessed for quality and tidiness, then cleaned. After wrangling, the datasets were combined into a single dataframe in preparation for further analysis.

Further analysis involved exploring the data and building visualizations. These explorations led to the following insights:

In terms of popularity, the top 5 breeds are the Golden Retriever, Labrador retriever, Pembroke, Chihuahua, and the Pug.
Majority of the popular breeds also enjoyed high ratings. In fact, Popular breeds constitute 13 of the 20 most rated dogs.
Puppos (the equivalent of teenage dogs) are people favorites. Their posts gather the highest likes and ratings on average.
Doggos (the equivalent of adult dogs) are the most retweeted dogs.
Hashtags are not really common place on WeRateDogs. Only about 1.2% of the total tweets had hashtags included.
Where hashtags were used, the #WomensMarch protest and #ScienceMarch rally gathered the most engagements: Some dogs participated in these events.
WeRateDogs original tweets have been declining in number. However, user interactions have been growing.
Retweets and likes are strongly correlated, with a positive correlation coefficient of 0.93. This relationship is clearly visible on a log scale.

Limitations¶

29 tweet records could not be retrieved from the Twitter API, possibly due to deleted tweets.
There are instances of actual dog images wrongly classified by the image predictions neural network. Accurate information about these breeds could present interesting insights during analysis.
We observed that tweets with hashtags generated more interactions and ratings on average than those without. There is not enough data to conclude this, since tweets with hashtags only comprised 1.2% of the entire tweet population.
The original dog ratings were not uniform, and there may be better methods than the one I used to standardize the ratings.
Despite the strong correlation between retweets and likes, we cannot conclude on a causative relationship between both variables; this is merely an observational study. To establish causation, a controlled experiment would be required.