Data rarely comes in its most usable form. For this reason, data wrangling and exploratory data analysis are the difference between non-misleading data analysis and a simple garbage in, garbage out.
In this project, we will be wrangling, analyzing and visualizing the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. We Rate Dogs is a Twitter account that does exactly what it says. It rates dogs. And their funny captions are just over the top. All you have to do is send them a serious or funny dog picture via direct message (or any dog come to think of it), and they'll rate it on a scale of 10. The funny thing, though? The ratings are almost always greater than 10. Why? Because "they're good dogs Brent". WeRateDogs has over 4 million followers and has received international media coverage.
Specifically, we intend to:
The following libraries will be useful during wrangling, analysis and visualization:
# Import useful libraries
import math
import time
import config
import numpy as np
import pandas as pd
import os
import requests
import tweepy
import json
from PIL import Image
from io import BytesIO
from IPython.display import display
# Visualization libraries
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
We will also create a color
class to help us pretty-print formatted outputs:
class Color:
blue = '\033[94m'
green = '\033[92m'
red = '\033[91m'
bold = '\033[1m'
underline = '\033[4m'
end = '\033[0m'
We will be gathering data from three different sources:
- Enhanced Twitter archive data, compiled by @dogrates and shared with Udacity. This archive data contains basic tweet data for all 5000+ of their tweets, as at August 2017. Udacity provided us with this file, so we will treat it as a file on hand.
- An Image predictions tsv file, compiled by running every image in the WeRateDogs Twitter archive through a neural network that can classify breeds of dogs. We will download this file programmatically from Udacity servers, using the requests library.
- Additional data from the Twitter API: We will gather each tweet's retweet count, favorite ("like") count and hashtags used from the twitter API using the Tweepy library.
# Read the twitter archive data provided
wrd_archive = pd.read_csv('./twitter-archive-enhanced.csv')
wrd_archive.head(2)
tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892420643555336193 | NaN | NaN | 2017-08-01 16:23:56 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Phineas. He's a mystical boy. Only eve... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892420643... | 13 | 10 | Phineas | None | None | None | None |
1 | 892177421306343426 | NaN | NaN | 2017-08-01 00:17:27 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Tilly. She's just checking pup on you.... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892177421... | 13 | 10 | Tilly | None | None | None | None |
# Programmatically download the image predictions
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
file_name = url.split('/')[-1]
response = requests.get(url)
# Write the url response to a file locally
start = time.time()
with open(file_name, 'wb') as f:
f.write(response.content)
print(Color.green+'Process completed in {} seconds'
.format(time.time()-start) + Color.end
)
Process completed in 0.0026407241821289062 seconds
# Read the image predictions into a dataframe
img_predictions = pd.read_csv('./image-predictions.tsv', sep='\t')
img_predictions.head(2)
tweet_id | jpg_url | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 666020888022790149 | https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg | 1 | Welsh_springer_spaniel | 0.465074 | True | collie | 0.156665 | True | Shetland_sheepdog | 0.061428 | True |
1 | 666029285002620928 | https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg | 1 | redbone | 0.506826 | True | miniature_pinscher | 0.074192 | True | Rhodesian_ridgeback | 0.072010 | True |
# Configure and create an API object to gather twitter data
consumer_key = config.API_KEY
consumer_secret = config.API_KEY_SECRET
access_token = config.ACCESS_TOKEN
access_secret = config.ACCESS_TOKEN_SECRET
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, wait_on_rate_limit =True,
wait_on_rate_limit_notify=True)
# --- Pull tweet information using the ids in wrd_archive
# Extract the tweet ids from the wrd dataframe
tweet_ids = wrd_archive['tweet_id']
# Initialize variables to monitor runtime activity
success, failure, counter = (0, 0, 0)
failed_attempts = {}
# Loop over each tweet id and collect the information
print(Color.bold+'COMMENCING JSON EXTRACTION TASK'+Color.end+'\n'+'-'*70)
start_time = time.time()
with open('tweet_json.txt', 'w') as file:
print('Pulling json data for the first 200 tweets...')
for tweet_id in tweet_ids:
# After every 200 tweets, print a summary to the user
if (success % 200 == 0) and (counter > 0):
print(Color.bold + Color.green + 'Sub-task Complete!'+ Color.end)
print('Successful pulls: {} || failed pulls: {} || Pulls pending: {}'
.format(success, failure, tweet_ids.size - counter)
)
print('\nPulling json data for the next 200 tweets...')
try:
tweet_info = api.get_status(tweet_id, tweet_mode='extended')
json.dump(tweet_info._json, file)
file.write('\n')
success+=1
except Exception as e:
failed_attempts[tweet_id] = e
failure+=1
pass
finally:
counter+=1
# Print feedback on entire execution process
duration = (time.time() - start_time)/60
failed = len(failed_attempts.keys())
print(Color.bold + Color.green +'Task Completed!\n'+ Color.end + '-'*70)
print(Color.bold +'DISPLAYING RUNTIME SUMMARY'+ Color.end)
print('The entire process took: {} minutes'.format(round(duration, 2)))
if (failed > 0):
print(Color.bold + Color.red +
'Could not pull information for '+ str(failed) + ' tweet ids:'+
Color.end)
print(pd.Series(failed_attempts))
else:
print(Color.bold + Color.green +'No failed attempts'+ Color.end)
COMMENCING JSON EXTRACTION TASK ---------------------------------------------------------------------- Pulling json data for the first 200 tweets... Sub-task Complete! Successful pulls: 200 || failed pulls: 9 || Pulls pending: 2147 Pulling json data for the next 200 tweets... Sub-task Complete! Successful pulls: 400 || failed pulls: 18 || Pulls pending: 1938 Pulling json data for the next 200 tweets... Sub-task Complete! Successful pulls: 600 || failed pulls: 20 || Pulls pending: 1736 Pulling json data for the next 200 tweets... Sub-task Complete! Successful pulls: 800 || failed pulls: 24 || Pulls pending: 1532 Pulling json data for the next 200 tweets... Sub-task Complete! Successful pulls: 1000 || failed pulls: 28 || Pulls pending: 1328 Pulling json data for the next 200 tweets... Sub-task Complete! Successful pulls: 1200 || failed pulls: 28 || Pulls pending: 1128 Pulling json data for the next 200 tweets...
Rate limit reached. Sleeping for: 291
Sub-task Complete! Successful pulls: 1400 || failed pulls: 28 || Pulls pending: 928 Pulling json data for the next 200 tweets... Sub-task Complete! Successful pulls: 1600 || failed pulls: 28 || Pulls pending: 728 Pulling json data for the next 200 tweets... Sub-task Complete! Successful pulls: 1800 || failed pulls: 29 || Pulls pending: 527 Pulling json data for the next 200 tweets... Sub-task Complete! Successful pulls: 2000 || failed pulls: 29 || Pulls pending: 327 Pulling json data for the next 200 tweets...
Rate limit reached. Sleeping for: 296
Sub-task Complete! Successful pulls: 2200 || failed pulls: 29 || Pulls pending: 127 Pulling json data for the next 200 tweets... Task Completed! ---------------------------------------------------------------------- DISPLAYING RUNTIME SUMMARY The entire process took: 37.3 minutes Could not pull information for 29 tweet ids: 888202515573088257 [{'code': 144, 'message': 'No status found wit... 873697596434513921 [{'code': 144, 'message': 'No status found wit... 872668790621863937 [{'code': 144, 'message': 'No status found wit... 872261713294495745 [{'code': 144, 'message': 'No status found wit... 869988702071779329 [{'code': 144, 'message': 'No status found wit... 866816280283807744 [{'code': 144, 'message': 'No status found wit... 861769973181624320 [{'code': 144, 'message': 'No status found wit... 856602993587888130 [{'code': 144, 'message': 'No status found wit... 856330835276025856 [{'code': 34, 'message': 'Sorry, that page doe... 851953902622658560 [{'code': 144, 'message': 'No status found wit... 851861385021730816 [{'code': 144, 'message': 'No status found wit... 845459076796616705 [{'code': 144, 'message': 'No status found wit... 844704788403113984 [{'code': 144, 'message': 'No status found wit... 842892208864923648 [{'code': 144, 'message': 'No status found wit... 837366284874571778 [{'code': 144, 'message': 'No status found wit... 837012587749474308 [{'code': 144, 'message': 'No status found wit... 829374341691346946 [{'code': 144, 'message': 'No status found wit... 827228250799742977 [{'code': 144, 'message': 'No status found wit... 812747805718642688 [{'code': 144, 'message': 'No status found wit... 802247111496568832 [{'code': 144, 'message': 'No status found wit... 779123168116150273 [{'code': 144, 'message': 'No status found wit... 775096608509886464 [{'code': 144, 'message': 'No status found wit... 771004394259247104 [{'code': 179, 'message': 'Sorry, you are not ... 770743923962707968 [{'code': 144, 'message': 'No status found wit... 766864461642756096 [{'code': 144, 'message': 'No status found wit... 759923798737051648 [{'code': 144, 'message': 'No status found wit... 759566828574212096 [{'code': 144, 'message': 'No status found wit... 754011816964026368 [{'code': 144, 'message': 'No status found wit... 680055455951884288 [{'code': 144, 'message': 'No status found wit... dtype: object
# Extract the information we want from the json file
json_tweet_details = []
with open('tweet_json.txt', 'r', encoding='UTF-8') as file:
for line in file:
json_text = json.loads(line)
# Extract the tweet_id, likes and retweet count
tweet_id = json_text['id_str']
retweets = json_text['retweet_count']
likes = json_text['favorite_count']
# Extract the hashtag from the json file
hashtags_info = json_text['entities']['hashtags']
if len(hashtags_info) !=0:
hashtags = ['#'+item['text'] for item in hashtags_info]
else:
hashtags = 'None'
# Assign these values into our list
json_tweet_details.append({
'tweet_id': tweet_id,
'hashtag': hashtags,
'retweets': retweets,
'likes': likes}
)
# Read all extracted data into a Pandas dataframe
json_tweet_info = pd.DataFrame(json_tweet_details)
json_tweet_info.head(2)
tweet_id | hashtag | retweets | likes | |
---|---|---|---|---|
0 | 892420643555336193 | None | 7018 | 33839 |
1 | 892177421306343426 | None | 5303 | 29355 |
wrd_archive.sample(20)
tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
957 | 751538714308972544 | NaN | NaN | 2016-07-08 22:09:27 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Max. She has one ear that's always sli... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/751538714... | 10 | 10 | Max | None | None | None | None |
563 | 802572683846291456 | NaN | NaN | 2016-11-26 18:00:13 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Winnie. She's h*ckin ferocious. Dandel... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/802572683... | 12 | 10 | Winnie | None | None | None | None |
42 | 884247878851493888 | NaN | NaN | 2017-07-10 03:08:17 +0000 | <a href="http://twitter.com/download/iphone" r... | OMG HE DIDN'T MEAN TO HE WAS JUST TRYING A LIT... | NaN | NaN | NaN | https://twitter.com/kaijohnson_19/status/88396... | 13 | 10 | None | None | None | None | None |
18 | 888554962724278272 | NaN | NaN | 2017-07-22 00:23:06 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Ralphus. He's powering up. Attempting ... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/888554962... | 13 | 10 | Ralphus | None | None | None | None |
1556 | 688828561667567616 | NaN | NaN | 2016-01-17 21:01:41 +0000 | <a href="http://twitter.com/download/iphone" r... | Say hello to Brad. His car probably has a spoi... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/688828561... | 9 | 10 | Brad | None | None | None | None |
1550 | 689154315265683456 | NaN | NaN | 2016-01-18 18:36:07 +0000 | <a href="http://twitter.com/download/iphone" r... | We normally don't rate birds but I feel bad co... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/689154315... | 9 | 10 | None | None | None | None | None |
1950 | 673688752737402881 | NaN | NaN | 2015-12-07 02:21:29 +0000 | <a href="http://twitter.com/download/iphone" r... | Meet Larry. He doesn't know how to shoe. 9/10 ... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/673688752... | 9 | 10 | Larry | None | None | None | None |
1194 | 717428917016076293 | NaN | NaN | 2016-04-05 19:09:17 +0000 | <a href="http://vine.co" rel="nofollow">Vine -... | This is Skittle. He's trying to communicate. 1... | NaN | NaN | NaN | https://vine.co/v/iIhEU2lVqxz | 11 | 10 | Skittle | None | None | None | None |
1271 | 709409458133323776 | NaN | NaN | 2016-03-14 16:02:49 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Billy. He sensed a squirrel. 8/10 damn... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/709409458... | 8 | 10 | Billy | None | None | None | None |
257 | 843856843873095681 | NaN | NaN | 2017-03-20 16:08:44 +0000 | <a href="http://twitter.com/download/iphone" r... | Say hello to Sadie and Daisy. They do all thei... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/843856843... | 12 | 10 | Sadie | None | None | None | None |
1934 | 674014384960745472 | NaN | NaN | 2015-12-07 23:55:26 +0000 | <a href="http://twitter.com/download/iphone" r... | Say hello to Aiden. His eyes are magical. Love... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/674014384... | 11 | 10 | Aiden | None | None | None | None |
2009 | 672254177670729728 | NaN | NaN | 2015-12-03 03:21:00 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Rolf. He's having the time of his life... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/672254177... | 11 | 10 | Rolf | None | None | pupper | None |
1422 | 698178924120031232 | NaN | NaN | 2016-02-12 16:16:41 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Lily. She accidentally dropped all her... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/698178924... | 10 | 10 | Lily | None | None | None | None |
1486 | 693109034023534592 | NaN | NaN | 2016-01-29 16:30:45 +0000 | <a href="http://twitter.com/download/iphone" r... | "Thank you friend that was a swell petting" 11... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/693109034... | 11 | 10 | None | None | None | None | None |
615 | 796563435802726400 | NaN | NaN | 2016-11-10 04:01:37 +0000 | <a href="http://twitter.com/download/iphone" r... | RT @dog_rates: I want to finally rate this ico... | 7.809316e+17 | 4.196984e+09 | 2016-09-28 00:46:20 +0000 | https://twitter.com/dog_rates/status/780931614... | 13 | 10 | None | None | None | None | puppo |
674 | 789599242079838210 | NaN | NaN | 2016-10-21 22:48:24 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Brownie. She's wearing a Halloween the... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/789599242... | 12 | 10 | Brownie | None | None | None | None |
232 | 847962785489326080 | NaN | NaN | 2017-04-01 00:04:17 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Georgie. He's very shy. Only puppears ... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/847962785... | 10 | 10 | Georgie | None | None | None | None |
2206 | 668631377374486528 | NaN | NaN | 2015-11-23 03:25:17 +0000 | <a href="http://twitter.com/download/iphone" r... | Meet Zeek. He is a grey Cumulonimbus. Zeek is ... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/668631377... | 5 | 10 | Zeek | None | None | None | None |
1798 | 677228873407442944 | NaN | NaN | 2015-12-16 20:48:40 +0000 | <a href="http://twitter.com/download/iphone" r... | Say hello to Chuq. He just wants to fit in. 11... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/677228873... | 11 | 10 | Chuq | None | None | None | None |
441 | 819711362133872643 | NaN | NaN | 2017-01-13 01:03:12 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Howie. He just bloomed. 11/10 revoluti... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/819711362... | 11 | 10 | Howie | None | None | None | None |
Quality Issues
- Some records are retweets or replies. Some may contain ratings, but they are not the original tweets. The information to identify them can be found in the collowing columns:
in_reply_to_status_id
,in_reply_to_user_id
,retweeted_status_id
,retweeted_status_user_id
andretweeted_status_timestamp
.- Unexpected ratings in the
rating_numerator
andrating_denominator
columns. Examples are rating numerators as high as 666 and denominators as low as 0.- Unusual dog names such as a, an and not in the
name
column.
Tidiness Issues
- The various stages of dog life:
doggo
,pupper
,puppo
, andfloofer
should be contained in one column.- Long and unneccessary links in the
source
column (text is embedded in HTML tags). All we need is the actual text.- Unwanted columns present:
in_reply_to_status_id
,in_reply_to_user_id
,retweeted_status_id
,retweeted_status_user_id
andretweeted_status_timestamp
.rating_numerator
andrating_denominator
can be reduced into one column.
img_predictions
dataframe in Jupyter notebook, including additional visual assessments in google sheets:img_predictions.sample(20)
tweet_id | jpg_url | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1946 | 862457590147678208 | https://pbs.twimg.com/media/C_gQmaTUMAAPYSS.jpg | 1 | home_theater | 0.496348 | False | studio_couch | 0.167256 | False | barber_chair | 0.052625 | False |
901 | 700002074055016451 | https://pbs.twimg.com/media/CbboKP4WIAAw8xq.jpg | 1 | Chihuahua | 0.369488 | True | schipperke | 0.243367 | True | pug | 0.161614 | True |
114 | 667924896115245057 | https://pbs.twimg.com/media/CUTyJpHWcAATl0O.jpg | 1 | Labrador_retriever | 0.209051 | True | hog | 0.203980 | False | Newfoundland | 0.165914 | True |
133 | 668480044826800133 | https://pbs.twimg.com/media/CUbrDWOWcAEyMdM.jpg | 1 | Arctic_fox | 0.119243 | False | Labrador_retriever | 0.099965 | True | pug | 0.086717 | True |
108 | 667878741721415682 | https://pbs.twimg.com/media/CUTILFiWcAE8Rle.jpg | 1 | seat_belt | 0.200373 | False | miniature_pinscher | 0.106003 | True | schipperke | 0.104733 | True |
1593 | 798694562394996736 | https://pbs.twimg.com/media/Cbs3DOAXIAAp3Bd.jpg | 1 | Chihuahua | 0.615163 | True | Pembroke | 0.159509 | True | basenji | 0.084466 | True |
1809 | 832757312314028032 | https://pbs.twimg.com/media/C46MWnFVYAUg1RK.jpg | 2 | Cardigan | 0.160888 | True | Staffordshire_bullterrier | 0.159441 | True | Boston_bull | 0.154368 | True |
270 | 670822709593571328 | https://pbs.twimg.com/media/CU89schWIAIHQmA.jpg | 1 | web_site | 0.993887 | False | Chihuahua | 0.001252 | True | menu | 0.000599 | False |
2069 | 891087950875897856 | https://pbs.twimg.com/media/DF3HwyEWsAABqE6.jpg | 1 | Chesapeake_Bay_retriever | 0.425595 | True | Irish_terrier | 0.116317 | True | Indian_elephant | 0.076902 | False |
2058 | 888917238123831296 | https://pbs.twimg.com/media/DFYRgsOUQAARGhO.jpg | 1 | golden_retriever | 0.714719 | True | Tibetan_mastiff | 0.120184 | True | Labrador_retriever | 0.105506 | True |
1980 | 871032628920680449 | https://pbs.twimg.com/media/DBaHi3YXgAE6knM.jpg | 1 | kelpie | 0.398053 | True | macaque | 0.068955 | False | dingo | 0.050602 | False |
1636 | 806242860592926720 | https://pbs.twimg.com/media/Ct72q9jWcAAhlnw.jpg | 2 | Cardigan | 0.593858 | True | Shetland_sheepdog | 0.130611 | True | Pembroke | 0.100842 | True |
743 | 687476254459715584 | https://pbs.twimg.com/media/CYpoAZTWEAA6vDs.jpg | 1 | wood_rabbit | 0.702725 | False | Angora | 0.190659 | False | hare | 0.105072 | False |
1959 | 865718153858494464 | https://pbs.twimg.com/media/DAOmEZiXYAAcv2S.jpg | 1 | golden_retriever | 0.673664 | True | kuvasz | 0.157523 | True | Labrador_retriever | 0.126073 | True |
1259 | 748699167502000129 | https://pbs.twimg.com/media/CmPp5pOXgAAD_SG.jpg | 1 | Pembroke | 0.849029 | True | Cardigan | 0.083629 | True | kelpie | 0.024394 | True |
1641 | 807106840509214720 | https://pbs.twimg.com/ext_tw_video_thumb/80710... | 1 | Chihuahua | 0.505370 | True | Pomeranian | 0.120358 | True | toy_terrier | 0.077008 | True |
522 | 676582956622721024 | https://pbs.twimg.com/media/CWO0m8tUwAAB901.jpg | 1 | seat_belt | 0.790028 | False | Boston_bull | 0.196307 | True | French_bulldog | 0.012429 | True |
1433 | 773547596996571136 | https://pbs.twimg.com/media/Crwxb5yWgAAX5P_.jpg | 1 | Norwegian_elkhound | 0.372202 | True | Chesapeake_Bay_retriever | 0.137187 | True | malamute | 0.071436 | True |
1622 | 803380650405482500 | https://pbs.twimg.com/media/CyYub2kWEAEYdaq.jpg | 1 | bookcase | 0.890601 | False | entertainment_center | 0.019287 | False | file | 0.009490 | False |
172 | 669000397445533696 | https://pbs.twimg.com/media/CUjETvDVAAI8LIy.jpg | 1 | Pembroke | 0.822940 | True | Cardigan | 0.177035 | True | basenji | 0.000023 | True |
Quality Issues
- The prediction in columns
p1
,p2
andp3
are not uniformly formatted. Some names are lowercase, some are uppercase and some are titlecase.- The predictions above also have words seperated by underscores instead of spaces.
Tidiness Issues
- From
p1
,p2
andp3
, we only need the most confident prediction that corresponds to an actual dog breed.
json_tweet_info
dataframe in Jupyter notebookjson_tweet_info.sample(20, random_state=4)
tweet_id | hashtag | retweets | likes | |
---|---|---|---|---|
779 | 772581559778025472 | None | 1588 | 6120 |
767 | 773985732834758656 | None | 3590 | 10174 |
1163 | 717841801130979328 | None | 545 | 2280 |
1660 | 681523177663676416 | None | 5217 | 13203 |
261 | 840696689258311684 | None | 892 | 11510 |
1310 | 705066031337840642 | None | 556 | 2021 |
604 | 795464331001561088 | None | 22013 | 46862 |
1679 | 680801747103793152 | None | 743 | 2193 |
740 | 778286810187399168 | None | 3065 | 9751 |
821 | 766313316352462849 | None | 1737 | 6357 |
120 | 868622495443632128 | None | 4481 | 23664 |
297 | 835246439529840640 | None | 63 | 1992 |
245 | 843604394117681152 | None | 2486 | 15745 |
1889 | 674271431610523648 | None | 645 | 1397 |
772 | 773336787167145985 | None | 4681 | 0 |
853 | 760656994973933572 | None | 1753 | 6190 |
1780 | 676864501615042560 | None | 629 | 1903 |
1854 | 674805413498527744 | None | 319 | 766 |
169 | 857746408056729600 | None | 9366 | 30861 |
665 | 788150585577050112 | None | 1203 | 5807 |
- Everything looks fine for now.
# Examine general dataframe information
wrd_archive.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2356 entries, 0 to 2355 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2356 non-null int64 1 in_reply_to_status_id 78 non-null float64 2 in_reply_to_user_id 78 non-null float64 3 timestamp 2356 non-null object 4 source 2356 non-null object 5 text 2356 non-null object 6 retweeted_status_id 181 non-null float64 7 retweeted_status_user_id 181 non-null float64 8 retweeted_status_timestamp 181 non-null object 9 expanded_urls 2297 non-null object 10 rating_numerator 2356 non-null int64 11 rating_denominator 2356 non-null int64 12 name 2356 non-null object 13 doggo 2356 non-null object 14 floofer 2356 non-null object 15 pupper 2356 non-null object 16 puppo 2356 non-null object dtypes: float64(4), int64(3), object(10) memory usage: 313.0+ KB
Notes
tweet_id
stored as int instead of string/object type.- 181 records are retweets and 78 records are replies. We don't need these records in our analysis.
timestamp
column is stored as string/object type rather than as a Pandas datetime object.- The
expanded_urls
column has some null records.
Lets zoom into these records where the expanded_url
is null:
print(Color.blue+'Computing null entries for records with missing expanded urls..\n'
+ Color.end)
print(wrd_archive[wrd_archive['expanded_urls'].isnull()].isnull().sum())
Computing null entries for records with missing expanded urls..
tweet_id 0
in_reply_to_status_id 4
in_reply_to_user_id 4
timestamp 0
source 0
text 0
retweeted_status_id 58
retweeted_status_user_id 58
retweeted_status_timestamp 58
expanded_urls 59
rating_numerator 0
rating_denominator 0
name 0
doggo 0
floofer 0
pupper 0
puppo 0
dtype: int64
Tweets with missing
expanded_urls
are either retweeted posts or replies. We don't need these records in our analysis.
# Check the archive for duplicate records
duplicates = wrd_archive.duplicated().sum()
print(Color.green +
'wrd_archive has {} number of duplicate records'.format(duplicates)+
Color.end)
wrd_archive has 0 number of duplicate records
# Examine the unique values in the source column
print(Color.bold+ Color.blue+
'Examining unique values in the source column\n' +
Color.end)
for i, item in enumerate(wrd_archive['source'].unique()):
print(i, ': ', item)
Examining unique values in the source column
0 : <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
1 : <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
2 : <a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>
3 : <a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>
- We only want the information between the opening and closing anchor tags, signalling the tweet source.
Let's verify if the links in the text
and expanded_url
columns are different:
# Set Pandas column width to allow longer text displays
pd.set_option("display.max_colwidth",150)
# Examine the text column and expanded_url columns
wrd_archive[['text', 'expanded_urls']].sample(5)
text | expanded_urls | |
---|---|---|
1414 | This is Cuddles. He's not entirely sure how doors work. 10/10 I believe in you Cuddles https://t.co/rKjK88D05Z | https://twitter.com/dog_rates/status/698710712454139905/photo/1 |
2351 | Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10 https://t.co/4B7cOc1EDq | https://twitter.com/dog_rates/status/666049248165822465/photo/1 |
275 | I didn't even have to intervene. Took him 4 minutes to realize his error. 10/10 for Kevin https://t.co/2gclc1MNr7 | https://twitter.com/dog_rates/status/840696689258311684/photo/1 |
2146 | This is a spotted Lipitor Rumpelstiltskin named Alphred. He can't wait for the Turkey. 10/10 would pet really well https://t.co/6GUGO7azNX | https://twitter.com/dog_rates/status/669923323644657664/photo/1 |
766 | "Yep... just as I suspected. You're not flossing." 12/10 and 11/10 for the pup not flossing https://t.co/SuXcI9B7pQ | https://twitter.com/dog_rates/status/777684233540206592/photo/1 |
After testing each link, one would discover that, in each record, both the text
and expanded url
links lead to the same tweet.
Some records also have multiple expanded urls seperated by commas; all leading to the same tweet. As a result, we can make the following notes:
- The
text
column contains both the tweet text and tweet url.- The same tweet url is already present in the
expanded_urls
column
# Examine the distribution of ratings in the dataset
wrd_archive[['rating_numerator', 'rating_denominator']].describe()
rating_numerator | rating_denominator | |
---|---|---|
count | 2356.000000 | 2356.000000 |
mean | 13.126486 | 10.455433 |
std | 45.876648 | 6.745237 |
min | 0.000000 | 0.000000 |
25% | 10.000000 | 10.000000 |
50% | 11.000000 | 10.000000 |
75% | 12.000000 | 10.000000 |
max | 1776.000000 | 170.000000 |
# Examine the unique values in rating numerator and denominator
print(Color.bold+Color.blue+'Unique rating numerators'+Color.end)
print(wrd_archive['rating_numerator'].unique())
print(Color.bold+Color.blue+'\nUnique rating denominators'+Color.end)
print(wrd_archive['rating_denominator'].unique())
Unique rating numerators [ 13 12 14 5 17 11 10 420 666 6 15 182 960 0 75 7 84 9 24 8 1 27 3 4 165 1776 204 50 99 80 45 60 44 143 121 20 26 2 144 88] Unique rating denominators [ 10 0 15 70 7 11 150 170 20 50 90 80 40 130 110 16 120 2]
- Though WeRateDogs post can have numerators higher than 10, they almost always have denominators of 10. Having numerators as high as 1776 and denominators as low as 0 prompts us to inspect the dataframe further:
# Assess instances where rating numerators > 15 and denominators are !=10
rating_check_df = (wrd_archive[(wrd_archive['rating_numerator'] > 15) | (wrd_archive['rating_denominator']!=10)])
# filter out the retweets
rating_check_df = (rating_check_df[rating_check_df['retweeted_status_id'].isnull()])
# filter out the replies
rating_check_df = (rating_check_df[rating_check_df['in_reply_to_status_id'].isnull()])
# Finally examine the text and the ratings
print(Color.red + Color.bold+
'{} records found!'.format(rating_check_df.shape[0])+
Color.end)
rating_check_df[['text', 'rating_numerator', 'rating_denominator']]
22 records found!
text | rating_numerator | rating_denominator | |
---|---|---|---|
433 | The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd | 84 | 70 |
516 | Meet Sam. She smiles 24/7 & secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t ... | 24 | 7 |
695 | This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS | 75 | 10 |
763 | This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://... | 27 | 10 |
902 | Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE | 165 | 150 |
979 | This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh | 1776 | 10 |
1068 | After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ | 9 | 11 |
1120 | Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv | 204 | 170 |
1165 | Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a | 4 | 20 |
1202 | This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq | 50 | 50 |
1228 | Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1 | 99 | 90 |
1254 | Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12 | 80 | 80 |
1274 | From left to right:\nCletus, Jerome, Alejandro, Burp, & Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK | 45 | 50 |
1351 | Here is a whole flock of puppers. 60/50 I'll take the lot https://t.co/9dpcw6MdWa | 60 | 50 |
1433 | Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://t.co/HppvrYuamZ | 44 | 40 |
1635 | Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55 | 121 | 110 |
1662 | This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5 | 7 | 11 |
1712 | Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 https://t.co/eNm2S6p9BD | 26 | 10 |
1779 | IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq | 144 | 120 |
1843 | Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://t.co/y93p6FLvVw | 88 | 80 |
2074 | After so many requests... here you go.\n\nGood dogg. 420/10 https://t.co/yfAAo1gdeY | 420 | 10 |
2335 | This is an Albanian 3 1/2 legged Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv | 1 | 2 |
- Some ratings were erroneously pulled from the original tweet. Especially when dates (e.g 24/7 and 9/11) or decimal ratings (e.g 11.27/10 and 9.75/10) are included in the tweet text.
- Some high ratings appear to be addressed to groups of dogs. For example: 165/150, 84/70, 88/80.
- Extremely high ratings like 1776/10 and 420/10.
During visual assessment, we identified some unusual dog names like a and an. These names were less than four characters long. We will examine the entire name column for names with less than four characters. We may probably find a lot of invalid names in this group:
print(Color.bold + Color.blue +
'Specially examine names with four string characters or less..\n'
+ Color.end)
# Examine the name column further especially names with 4 characters or less
print(wrd_archive.name[wrd_archive.name.apply(lambda x: len(x)<=4)].unique())
Specially examine names with four string characters or less..
['None' 'Jax' 'Zoey' 'Koda' 'Ted' 'Jim' 'Zeke' 'such' 'Maya' 'Earl' 'Lola'
'Yogi' 'Noah' 'Gus' 'Alfy' 'Koko' 'Rey' 'Gary' 'a' 'Jack' 'Emmy' 'Beau'
'Aja' 'Cash' 'Coco' 'Jed' 'Kody' 'Dawn' 'Cody' 'Lili' 'Dave' 'Burt'
'Carl' 'Thor' 'Luna' 'Arya' 'Iggy' 'Kyle' 'Leo' 'Odin' 'Tuck' 'Hank'
'Ken' 'Max' 'Odie' 'Arlo' 'Lucy' 'Ava' 'Rory' 'Eli' 'Ash' 'Tobi' 'not'
'Kuyu' 'Pete' 'Kyro' 'Loki' 'Mia' 'one' 'Mutt' 'Bear' 'Kona' 'Phil' 'Ike'
'Mo' 'Toby' 'Nala' 'Gabe' 'Luca' 'Finn' 'Anna' 'Bo' 'Tom' 'Dido' 'Levi'
'Alf' 'Sky' 'Tyr' 'Mary' 'Moe' 'Halo' 'Sam' 'Ito' 'Milo' 'Cali' 'Duke'
'Chef' 'Doc' 'Sobe' 'Iroh' 'Ruby' 'Mack' 'Juno' 'Lily' 'Newt' 'Nida'
'BeBe' 'mad' 'Dale' 'Hero' 'Godi' 'Dash' 'Bell' 'Jay' 'Mya' 'an' 'Huck'
'very' 'O' 'Blue' 'Fizz' 'Chip' 'Grey' 'Al' 'just' 'Lou' 'Tito' 'Brat'
'Tove' 'my' 'Kota' 'Eve' 'Rose' 'Theo' 'Fido' 'Emma' 'Gert' 'Dex' 'Ace'
'Fred' 'Zoe' 'Blu' 'his' 'Cora' 'Abby' 'Geno' 'Beya' 'Kilo' 'Doug' 'Aqua'
'Axel' 'Remy' 'this' 'Ziva' 'Puff' 'all' 'Ivar' 'Sid' 'Otis' 'Suki'
'Ebby' 'Link' 'Ozzy' 'old' 'Zeus' 'Nico' 'Siba' 'Kanu' 'Opie' 'Kane'
'Sora' 'Lacy' 'Olaf' 'Kara' 'Zara' 'Bode' 'Rudy' 'Fiji' 'Rilo' 'Yoda'
'Chet' 'Kaia' 'Eazy' 'CeCe' 'Ole' 'Berb' 'Bob' 'Kobe' 'Lolo' 'Eriq' 'the'
'Durg' 'Fynn' 'Ferg' 'Trip' 'Brad' 'Opal' 'Marq' 'Mona' 'Birf' 'Oreo'
'Jeph' 'Obi' 'Tino' 'Lupe' 'Lulu' 'Taco' 'Joey' 'Kreg' 'Todo' 'Tess' 'by'
'Mike' 'Evy' 'Tug' 'Izzy' 'Chuq' 'Karl' 'Herm' 'Bert' 'Zuzu' 'Jeb' 'life'
'Acro' 'Obie' 'Dot' 'Mac' 'Ed' 'Taz' 'Jazz' 'Rolf' 'Cal' 'Tuco' 'Mojo'
'Mark' 'JD' 'Pip' 'Jett' 'Amy' 'Sage' 'Andy' 'Creg' 'Gin' 'Bloo' 'Edd'
'Herb' 'Liam' 'Ben' 'Skye' 'Dug' 'Kirk' 'Ralf' 'Chaz' 'Bobb' 'Hanz'
'Zeek' 'Maks' 'Jo' 'DayZ' 'Ron' 'Erik' 'Stu' 'Kial' 'Dook' 'Hall' 'Fwed'
'Keet']
- Again we notice more unusual names like the, my, by, his, all, mad, life, very, old, this, just etc. All these unusual names are formatted in lower case, while the more viable names are properly capitalized.
We can use this criteria to query the entire name
column. We will search for records with improper name capitalizations:
# Check the entire dataframe for improper capitalizations of dog names
mask = wrd_archive['name'].str.match(r"[A-Z].?")
invalid_names = wrd_archive[~mask]['name'].value_counts()
print(Color.red + Color.bold +
'There are {} records with invalid names\n'.format(invalid_names.sum())+
Color.end)
print(invalid_names)
There are 109 records with invalid names
a 55
the 8
an 7
very 5
just 4
quite 4
one 4
getting 2
actually 2
mad 2
not 2
old 1
life 1
officially 1
light 1
by 1
infuriating 1
such 1
all 1
unacceptable 1
this 1
his 1
my 1
incredibly 1
space 1
Name: name, dtype: int64
- None of the improperly capitalized entries in the
name
column are valid dog names.- These entries constitute 109 records in total.
#Examine the dog stage columns
for dog_stage in wrd_archive.columns[-4:]:
print(Color.bold + Color.blue +
'\nValue counts for {} column'.format(dog_stage) +
Color.end)
print(wrd_archive[dog_stage].value_counts())
Value counts for doggo column None 2259 doggo 97 Name: doggo, dtype: int64 Value counts for floofer column None 2346 floofer 10 Name: floofer, dtype: int64 Value counts for pupper column None 2099 pupper 257 Name: pupper, dtype: int64 Value counts for puppo column None 2326 puppo 30 Name: puppo, dtype: int64
- Asides the fact that we have to tidy up these columns into one, everything looks good.
# Examine a quick summary of the dataframe
img_predictions.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2075 entries, 0 to 2074 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2075 non-null int64 1 jpg_url 2075 non-null object 2 img_num 2075 non-null int64 3 p1 2075 non-null object 4 p1_conf 2075 non-null float64 5 p1_dog 2075 non-null bool 6 p2 2075 non-null object 7 p2_conf 2075 non-null float64 8 p2_dog 2075 non-null bool 9 p3 2075 non-null object 10 p3_conf 2075 non-null float64 11 p3_dog 2075 non-null bool dtypes: bool(3), float64(3), int64(2), object(4) memory usage: 152.1+ KB
- There are 2075 records here. This is 281 records shorter than the WeRateDogs archive data.
tweet_id
is stored with the wrong datatype: It should be a string/object type.- We won't be needing the
img_num
column.
# Check the dataframe for duplicate records
duplicates = img_predictions.duplicated().sum()
print(Color.green+
'img_predictions has {} number of duplicate records'.format(duplicates)+
Color.end)
img_predictions has 0 number of duplicate records
# Compute descriptive statistics for the numeric columns
img_predictions.describe()
tweet_id | img_num | p1_conf | p2_conf | p3_conf | |
---|---|---|---|---|---|
count | 2.075000e+03 | 2075.000000 | 2075.000000 | 2.075000e+03 | 2.075000e+03 |
mean | 7.384514e+17 | 1.203855 | 0.594548 | 1.345886e-01 | 6.032417e-02 |
std | 6.785203e+16 | 0.561875 | 0.271174 | 1.006657e-01 | 5.090593e-02 |
min | 6.660209e+17 | 1.000000 | 0.044333 | 1.011300e-08 | 1.740170e-10 |
25% | 6.764835e+17 | 1.000000 | 0.364412 | 5.388625e-02 | 1.622240e-02 |
50% | 7.119988e+17 | 1.000000 | 0.588230 | 1.181810e-01 | 4.944380e-02 |
75% | 7.932034e+17 | 1.000000 | 0.843855 | 1.955655e-01 | 9.180755e-02 |
max | 8.924206e+17 | 4.000000 | 1.000000 | 4.880140e-01 | 2.734190e-01 |
- Everything looks okay here, confidence levels range from 0 - 1 across all columns.
# Examine the p1, p2 and p3 columns
for prediction in ('p1', 'p2', 'p3'):
print(Color.bold + Color.blue +
'\n10 Random entries and counts from {} column\n'.format(prediction)+
Color.end)
print(img_predictions[prediction].value_counts().sample(10))
10 Random entries and counts from p1 column hyena 2 mud_turtle 1 prison 3 teapot 1 Saint_Bernard 7 giant_panda 1 English_setter 8 porcupine 5 coffee_mug 1 Leonberg 3 Name: p1, dtype: int64 10 Random entries and counts from p2 column cloak 1 Bernese_mountain_dog 1 toaster 1 Brittany_spaniel 8 rotisserie 2 American_Staffordshire_terrier 21 printer 1 mud_turtle 1 gibbon 2 hyena 1 Name: p2, dtype: int64 10 Random entries and counts from p3 column feather_boa 2 lakeside 2 lampshade 1 quilt 4 bull_mastiff 20 Chesapeake_Bay_retriever 27 loggerhead 1 wombat 1 space_shuttle 1 French_loaf 2 Name: p3, dtype: int64
- It seems that not all the predictions in our
img_predictions
dataframe correspond to actual dog breeds.
Let's investigate this case further, and check for situations where none of the predictions detected a dog breed:
# Check for situations where the three predictions were not dog breeds
mask = (~img_predictions.p1_dog) & (~img_predictions.p2_dog) & (~img_predictions.p3_dog)
no_dog_predicted = img_predictions[mask]
print(Color.red + Color.bold +
'{} records found with no dogs detected!\n'.format(no_dog_predicted.shape[0])+
Color.end)
print(Color.green+ 'Printing the first five records...' +Color.end)
no_dog_predicted.head(5)
324 records found with no dogs detected! Printing the first five records...
tweet_id | jpg_url | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 666051853826850816 | https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg | 1 | box_turtle | 0.933012 | False | mud_turtle | 0.045885 | False | terrapin | 0.017885 | False |
17 | 666104133288665088 | https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg | 1 | hen | 0.965932 | False | cock | 0.033919 | False | partridge | 0.000052 | False |
18 | 666268910803644416 | https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg | 1 | desktop_computer | 0.086502 | False | desk | 0.085547 | False | bookcase | 0.079480 | False |
21 | 666293911632134144 | https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg | 1 | three-toed_sloth | 0.914671 | False | otter | 0.015250 | False | great_grey_owl | 0.013207 | False |
25 | 666362758909284353 | https://pbs.twimg.com/media/CT9lXGsUcAAyUFt.jpg | 1 | guinea_pig | 0.996496 | False | skunk | 0.002402 | False | hamster | 0.000461 | False |
print(Color.green + Color.bold + 'Collecting two image samples for veiwing..' + Color.end)
# Explore some of the images to crosscheck the predictions
for url in no_dog_predicted['jpg_url'].sample(2, random_state=12):
response = requests.get(url)
img = Image.open(BytesIO(response.content))
display(img)
Collecting two image samples for veiwing..
- In about 324 cases, none of the predictions
p1
,p2
andp3
detected a dog breed.- The pulled images show that some of the tweets were not actually about dogs. This may explain why the algorithms didn't detect a dog in the first place.
- Further examination also shows some instances where the neural network gave false negative responses (a dog was present, but was detected as absent).
# Examine a summary of Json_tweet_info
json_tweet_info.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2327 entries, 0 to 2326 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2327 non-null object 1 hashtag 2327 non-null object 2 retweets 2327 non-null int64 3 likes 2327 non-null int64 dtypes: int64(2), object(2) memory usage: 72.8+ KB
- 2327 records present. 29 records lesser than WeRateDogs archive data.
- Majority of these missing records were caused by Tweepy errors (probably from deleted tweets) during the gathering process.
# Check the unique records in the dataframe columns.
for col in json_tweet_info.columns[1:]:
print(Color.bold + Color.blue +
'\nValue counts for {} column\n'.format(col) + Color.end)
print(json_tweet_info[col].value_counts())
Value counts for hashtag column None 2300 [#BarkWeek] 9 [#PrideMonth] 3 [#WKCDogShow] 1 [#notallpuppers] 1 [#LoveTwitter] 1 [#FinalFur] 1 [#ImWithThor] 1 [#WomensMarch] 1 [#BellLetsTalk] 1 [#GoodDogs] 1 [#K9VeteransDay] 1 [#ScienceMarch] 1 [#dogsatpollingstations] 1 [#PrideMonthPuppo, #PrideMonth] 1 [#Canada150] 1 [#BATP] 1 [#NoDaysOff, #swole] 1 Name: hashtag, dtype: int64 Value counts for retweets column 1019 7 552 6 50 6 409 5 406 5 .. 1616 1 1656 1 3632 1 2719 1 706 1 Name: retweets, Length: 1637, dtype: int64 Value counts for likes column 0 160 2264 4 2205 4 659 3 3053 3 ... 3864 1 3940 1 13424 1 5088 1 2293 1 Name: likes, Length: 1959, dtype: int64
- When present,
hashtags
are stored as lists instead of as python strings.- Some tweets are associated with mutiple hashtags.
Finally, let's check for columns that may be common across the three dataframes:
# Check for the common columns accross the three dataframes.
np.intersect1d(np.intersect1d(wrd_archive.columns, img_predictions.columns),
json_tweet_info.columns)
array(['tweet_id'], dtype=object)
- The
tweet_id
column is the only common column accross the three datasets.
The section below summarizes the findings from both visual and programmatic assessment of the datasets.
tweet_id
is stored with the wrong datatype. Should be a string/object type.timestamp
column is stored as a string/object type rather than the Pandas datetime type.expanded_url
column, the majority being retweeted posts and replies.rating_numerator
and rating_denominator
columns, with numerators as high as 1776 and denominators as low as 0.expanded_url
column sometimes contains more than one link, seperated by commas, all leading to the same page.tweet_id
is stored with the wrong datatype. Should be a string/object type.p1
, p2
and p3
are not uniformly formatted. Some entries are lowercase, some are uppercase and some are titlecase.wrd_archive
.p1
, p2
, and p3
are dog breeds.wrd_archive
; the majority caused by Tweepy
errors during the gathering process.in_reply_to_status_id
, in_reply_to_user_id
, retweeted_status_id
, retweeted_status_user_id
and retweeted_status_timestamp
.text
column contains both tweet url and tweet text.source
column, with text embedded within HTML anchor tags.rating_numerator
and rating_denominator
can be reduced to one column.p1_dog
, p2_dog
and p3_dog
columns can be used to select the appropriate predictions to be used, then removed from our dataframe.img_num
column.wrd_archive
.Before cleaning, we will create individual copies of the three dataframes:
# Create copies of the original dataframes
archive_clean = wrd_archive.copy()
predictions_clean = img_predictions.copy()
json_clean = json_tweet_info.copy()
# Filter out retweets and replies using a boolean mask
retweet_reply_mask = (archive_clean.retweeted_status_id.notnull() |
archive_clean.in_reply_to_status_id.notnull())
archive_clean = archive_clean[~retweet_reply_mask]
# Verify the absence of entries for the retweet and reply columns
assert archive_clean.retweeted_status_id.isnull().all()
assert archive_clean.in_reply_to_status_id.isnull().all()
print(Color.green + Color.bold +
'archive_clean has reduced to {:,} records.'.format(archive_clean.shape[0])+
Color.end)
archive_clean has reduced to 2,097 records.
# Convert tweet_ids to string datatype
archive_clean['tweet_id'] = archive_clean['tweet_id'].astype(str)
# Convert timestamp to a pandas datetime object
archive_clean['timestamp'] = pd.to_datetime(archive_clean['timestamp'])
archive_clean[['tweet_id', 'timestamp']].dtypes
tweet_id object timestamp datetime64[ns, UTC] dtype: object
# Create a boolean mask to identify the unusual names
unusual_names_mask = archive_clean['name'].str.match(r"[a-z].?")
# Identify each unique unusual name from the name column
unusual_names = archive_clean['name'][unusual_names_mask].unique()
# Replace all unusual names with None
archive_clean['name'] = archive_clean['name'].apply(lambda n: 'None' if n in unusual_names else n)
# Verify if there are any improper names still present
assert archive_clean['name'].str.match(r"[a-z].?").sum() == 0
expanded_urls
column, the majority being retweets and replies.expanded_url
column sometimes contains more than one link, seperated by commas, all leading to the same page.
- Drop the
expanded_url
column since the urls are already present in the tweet text.- Having multiple links leading to the same page is redundant. We will split the text column into two columns later.
# Drop the expanded urls from archive clean
archive_clean.drop(columns='expanded_urls', inplace = True)
# Check if the expanded urls column is now absent from the dataframe
assert 'expanded_urls' not in archive_clean.columns
rating_numerator
and rating_denominator
columns, with numerators as high as 1776 and denominators as low as 0.The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs. However, we will:
- Remove the records with the overly high ratings of 420/10 and 1776/10.
- Remove the record with rating of 24/7. This is a date, not an actual rating; the right rating is absent from the text.
- Programmatically extract the right ratings from text to replace the wrong ones.
- Convert high ratings allocated to dog groups to a scale of 10. This will be done later, when tidying up the dataframe.
# Filter out records with unwanted ratings: 420/10, 1776/10 and 24/7
for num, denum in zip([420, 1776, 24], [10, 10, 7]):
mask = (archive_clean['rating_numerator'] == num) & (archive_clean['rating_denominator'] == denum)
archive_clean = archive_clean[~mask]
# Isolate unusual ratings: numerator > 15 and denominator not equal to 10
unusual_rating_mask = (archive_clean['rating_numerator'] > 15) | (archive_clean['rating_denominator']!=10)
unusual_ratings = archive_clean[unusual_rating_mask].copy()
# Replace the numerator and denominators with the right values, if present in the tweet text
pattern = r"([0-9\.]+/[0-9]+)"
unusual_ratings[['rating_numerator', 'rating_denominator']] = (unusual_ratings['text']
.str.findall(pattern)
.str[-1]
.str.split('/', expand=True)
)
# Streamline the result down to the relevant columns
cleaned_ratings = unusual_ratings[['text', 'rating_numerator', 'rating_denominator']]
# Update the ratings in archive clean with the cleaned ratings
archive_clean.update(cleaned_ratings)
# Verify the removal of the unwanted ratings.
for rating_num in [420, 24, 1776]:
assert rating_num not in archive_clean.rating_numerator.unique()
# Verify the records with unusual ratings
archive_clean[unusual_rating_mask][['text', 'rating_numerator', 'rating_denominator']]
text | rating_numerator | rating_denominator | |
---|---|---|---|
433 | The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd | 84 | 70 |
695 | This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS | 9.75 | 10 |
763 | This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://... | 11.27 | 10 |
902 | Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE | 165 | 150 |
1068 | After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ | 14 | 10 |
1120 | Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv | 204 | 170 |
1165 | Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a | 13 | 10 |
1202 | This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq | 11 | 10 |
1228 | Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1 | 99 | 90 |
1254 | Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12 | 80 | 80 |
1274 | From left to right:\nCletus, Jerome, Alejandro, Burp, & Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK | 45 | 50 |
1351 | Here is a whole flock of puppers. 60/50 I'll take the lot https://t.co/9dpcw6MdWa | 60 | 50 |
1433 | Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://t.co/HppvrYuamZ | 44 | 40 |
1635 | Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55 | 121 | 110 |
1662 | This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5 | 10 | 10 |
1712 | Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 https://t.co/eNm2S6p9BD | 11.26 | 10 |
1779 | IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq | 144 | 120 |
1843 | Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://t.co/y93p6FLvVw | 88 | 80 |
2335 | This is an Albanian 3 1/2 legged Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv | 9 | 10 |
# Drop unwanted columns from archive clean
unwanted_cols = ['in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 'retweeted_status_user_id','retweeted_status_timestamp']
archive_clean.drop(columns=unwanted_cols, inplace=True)
# Verify that the unwanted columns have been dropped
for col in unwanted_cols:
assert col not in archive_clean.columns
# Create a pattern to extract urls
pattern = r"(http.+)"
# Extract urls into a tweet url column
archive_clean['tweet_url'] = archive_clean['text'].str.extract(pattern)
# Account for records where tweet text does not contain a url
archive_clean['tweet_url'].fillna('None', inplace = True)
# Remove urls from the text column
archive_clean['text'] = archive_clean['text'].str.replace(pattern, '', regex=True)
archive_clean[['text', 'tweet_url']].sample(5)
text | tweet_url | |
---|---|---|
1582 | This is Baxter. He looks like a fun dog. Prefers action shots. 11/10 the last one is impeccable | https://t.co/LHcH1yhhIb |
1871 | When you're presenting a group project and the 4th guy tells the teacher that he did all the work. 10/10 | https://t.co/f50mbB4UWS |
7 | When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 | https://t.co/v0nONBcwxq |
1830 | This is Kenneth. He's stuck in a bubble. 10/10 hang in there Kenneth | https://t.co/uQt37xlYMJ |
1459 | This may be the greatest video I've ever been sent. 4/10 for Charles the puppy, 13/10 overall. (Vid by @stevenxx_) | https://t.co/uaJmNgXR2P |
# Create a pattern to extract info between the <a></a> tags
pattern = r">(.+)<"
# Extract information using the defined pattern
archive_clean['source'] = archive_clean['source'].str.extract(pattern)
# Verify the extraction process
archive_clean.source.value_counts()
Twitter for iPhone 1962 Vine - Make a Scene 91 Twitter Web Client 31 TweetDeck 10 Name: source, dtype: int64
rating_numerator
and rating_denominator
can be reduced to one column.
- Convert all ratings to a denominator scale of 0 using the expression: $rating = \frac{rating\,numerator}{rating\,denominator} \times 10 $. With this expression, a rating of 120/100 becomes 12/10 and a rating of 55/60 becomes 9.16/10.
- Once the ratings are standardized, reduce ratings to a single column called
rating
.- Drop the
rating_numerator
andrating_denominator
columns.
# Use the expression to calculate a single rating value
rating = 10 * (archive_clean['rating_numerator'].astype(float) / archive_clean['rating_denominator'].astype(float))
# Allocate the values into a new column in archive_clean
archive_clean['rating'] = rating
# Drop the rating numerator and denominator columns
archive_clean.drop(columns=['rating_numerator', 'rating_denominator'], inplace=True)
# verify the removal of the dropped columns
for col in 'rating_numerator', 'rating_denominator':
assert col not in archive_clean.columns
# Check how standardized ratings are now distributed in the dataframe
archive_clean.rating.describe().to_frame()
rating | |
---|---|
count | 2094.000000 |
mean | 10.610926 |
std | 2.147757 |
min | 0.000000 |
25% | 10.000000 |
50% | 11.000000 |
75% | 12.000000 |
max | 14.000000 |
- The ratings now appear standardized. However, it seems there are record(s) with ratings of 0. We should investigate this further:
# Verify the rating in the tweet text where the rating is equal to zero
print(Color.bold + Color.green +
'Verifying the text in records where rating is zero...\n'+
Color.end)
print(archive_clean.loc[archive_clean.rating==0, 'text'])
Verifying the text in records where rating is zero...
315 When you're so blinded by your systematic plagiarism that you forget what day it is. 0/10
Name: text, dtype: object
- Check and correct for conflicting dog stages, if present.
- Store all the dog stages in a single column called
stage
.- Drop the columns
doggo
,pupper
,puppo
, andfloofer
.- Set the
stage
column to a categorical type.
# Isolate the dog stage columns into a dataframe
stage_df = archive_clean[['doggo', 'pupper', 'puppo', 'floofer']]
# Check if there are situations where multiple stages co-exist
print(Color.bold + Color.green +
'Checking for the existence of multiple dog stages\n'+
Color.end)
stage = stage_df.sum(axis=1)
stage.value_counts()
Checking for the existence of multiple dog stages
NoneNoneNoneNone 1758 NonepupperNoneNone 221 doggoNoneNoneNone 72 NoneNonepuppoNone 23 NoneNoneNonefloofer 9 doggopupperNoneNone 9 doggoNonepuppoNone 1 doggoNoneNonefloofer 1 dtype: int64
- It seems that some records actually present with multiple dog stages. This is hard to read through for now, so we will trim off the extra
None
characters.
# Remove None from the each entry, unless the string the string is made up of only Nones
stage = stage.apply(lambda x: x.replace('None', '') if x.replace('None', '') != '' else 'None')
stage.value_counts()
None 1758 pupper 221 doggo 72 puppo 23 floofer 9 doggopupper 9 doggopuppo 1 doggofloofer 1 dtype: int64
- 11 records have dogs classified as a mix of doggo and some other stages. To be sure this was not done in error, we can examine the tweet text in detail. These records are few, so we can manually identify and correct them.
# Assign the dog stages into a column in archive_clean
archive_clean['stage'] = stage
# Identify and isolate records where dogs were assigned multiple stages
multiple_stages = ['doggopupper', 'doggopuppo', 'doggofloofer']
multiple_stage_mask = archive_clean.stage.apply(lambda x: x in multiple_stages)
# Examine these occurences
archive_clean[multiple_stage_mask][['tweet_id', 'tweet_url', 'text', 'stage']]
tweet_id | tweet_url | text | stage | |
---|---|---|---|---|
191 | 855851453814013952 | https://t.co/cMhq16isel | Here's a puppo participating in the #ScienceMarch. Cleverly disguising her own doggo agenda. 13/10 would keep the planet habitable for | doggopuppo |
200 | 854010172552949760 | https://t.co/TXdT3tmuYk | At first I thought this was a shy doggo, but it's actually a Rare Canadian Floofer Owl. Amateurs would confuse the two. 11/10 only send dogs | doggofloofer |
460 | 817777686764523521 | https://t.co/m7isZrOBX7 | This is Dido. She's playing the lead role in "Pupper Stops to Catch Snow Before Resuming Shadow Box with Dried Apple." 13/10 (IG: didodoggo) | doggopupper |
531 | 808106460588765185 | https://t.co/ANBpEYHaho | Here we have Burke (pupper) and Dexter (doggo). Pupper wants to be exactly like doggo. Both 12/10 would pet at same time | doggopupper |
575 | 801115127852503040 | https://t.co/55Dqe0SJNj | This is Bones. He's being haunted by another doggo of roughly the same size. 12/10 deep breaths pupper everything's fine | doggopupper |
705 | 785639753186217984 | https://t.co/f2wmLZTPHd | This is Pinot. He's a sophisticated doggo. You can tell by the hat. Also pointier than your average pupper. Still 10/10 would pet cautiously | doggopupper |
733 | 781308096455073793 | https://t.co/WQvcPEpH2u | Pupper butt 1, Doggo 0. Both 12/10 | doggopupper |
889 | 759793422261743616 | https://t.co/MYwR4DQKll | Meet Maggie & Lila. Maggie is the doggo, Lila is the pupper. They are sisters. Both 12/10 would pet at the same time | doggopupper |
956 | 751583847268179968 | https://t.co/u2c9c7qSg8 | Please stop sending it pictures that don't even have a doggo or pupper in them. Churlish af. 5/10 neat couch tho | doggopupper |
1063 | 741067306818797568 | https://t.co/o5J479bZUC | This is just downright precious af. 12/10 for both pupper and doggo | doggopupper |
1113 | 733109485275860992 | https://t.co/pG2inLaOda | Like father (doggo), like son (pupper). Both 12/10 | doggopupper |
After examining the tweet ids, the text and checking the tweet urls, we can obeserve the following:
- Tweets with id: 808106460588765185, 781308096455073793, 759793422261743616, 741067306818797568, and 733109485275860992 are actually about two dogs, a doggo and a pupper, hence the doggopupper classification. We will leave them as they are.
- Some dogs were erroneously categorized, but the appropriate dog stage is in the tweet text:
- [855851453814013952](https://t.co/cMhq16isel) should be puppo.
- [854010172552949760](https://t.co/TXdT3tmuYk) should be floofer.
- [817777686764523521](https://t.co/m7isZrOBX7) should be pupper.
- [801115127852503040](https://t.co/55Dqe0SJNj) should be pupper.
- [751583847268179968](https://t.co/u2c9c7qSg8) should be doggo.
- 785639753186217984 is not about a dog. The tweet is actually about a hedgehog. We will remove this record.
# Remove the record about a hedgehog: tweet_id 785639753186217984.
archive_clean = archive_clean.query("tweet_id != '785639753186217984'")
# Correct the erroneously categorized records
correction_dict = {
'855851453814013952': 'puppo',
'854010172552949760': 'floofer',
'817777686764523521': 'pupper',
'801115127852503040': 'pupper',
'751583847268179968': 'doggo'
}
for key in correction_dict.keys():
archive_clean.loc[archive_clean['tweet_id'] == key, 'stage'] = correction_dict[key]
# Drop the columns puppo, doggo, floofer and pupper
archive_clean.drop(columns=['doggo', 'pupper', 'puppo', 'floofer'], inplace=True)
# Convert the stage column to categorical type
archive_clean.stage = archive_clean.stage.astype('category')
# Check if the record with tweet_id 785639753186217984 has been dropped
assert '785639753186217984' not in archive_clean.tweet_id.values
# Check if the unwanted columns have been dropped
assert archive_clean.columns.any() not in ['doggo', 'pupper', 'puppo', 'floofer']
# Verify the datatype in the stage column
assert archive_clean.stage.dtypes == 'category'
# Verify the values in the stage column
archive_clean.stage.value_counts()
None 1758 pupper 223 doggo 73 puppo 24 floofer 10 doggopupper 5 Name: stage, dtype: int64
Note: One more thing! let's format the dog stage entries to title case, then give doggopupper
a more befitting value.
# Format dog stage entries to title case
archive_clean.stage = archive_clean.stage.apply(lambda x: x.title() if x !='doggopupper' else 'Doggo with Pupper')
archive_clean.stage.value_counts()
None 1758 Pupper 223 Doggo 73 Puppo 24 Floofer 10 Doggo with Pupper 5 Name: stage, dtype: int64
Finally, lets reset the dataframe index and preview our cleaning results:
# Reset the indices of the archive clean dataframe
archive_clean = archive_clean.reset_index(drop=True)
# Preview cleaming results
archive_clean.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2093 entries, 0 to 2092 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2093 non-null object 1 timestamp 2093 non-null datetime64[ns, UTC] 2 source 2093 non-null object 3 text 2093 non-null object 4 name 2093 non-null object 5 tweet_url 2093 non-null object 6 rating 2093 non-null float64 7 stage 2093 non-null category dtypes: category(1), datetime64[ns, UTC](1), float64(1), object(5) memory usage: 116.8+ KB
Note: Some of the tidiness issues in img_predictions
overlap with quality issues. Solving the tidiness issues first can make dealing with quality issues easier later.
p1
, p2
, p3
) and their respective confidence level (p1_conf
, p2_conf
, p3_conf
) columns can be reduced into two columns to contain prediction
and confidence
variables.p1_dog
, p2_dog
and p3_dog
columns can be used to select the appropriate predictions to be used.img_num
column.Part A
- Iterate through each row of
predictions_clean
and extract the best prediction and confidence values.- Assign these values into new columns named
breed
andconfidence
.
Part B
- Drop all unwanted columns:
p1
,p2
,p3
,p1_conf
,p2_conf
,p3_conf
,p1_dog
,p2_dog
,p3_dog
andimg_num
.
# Create a list to store the best prediction and confidence values
prediction_list = []
# Define a function to perform the extraction process
def extract_breed_info(row):
"""
Extracts the best prediction and confidence value from passed row.
Params:
row: a row from the dataframe of interest.
Output:
A dictionary containing prediction and confidence appended into prediction list.
Prints a status update of extraction process.
"""
if row.p1_dog:
prediction_list.append({'breed': row.p1,'confidence': row.p1_conf})
elif row.p2_dog:
prediction_list.append({'breed': row.p2,'confidence': row.p2_conf})
elif row.p3_dog:
prediction_list.append({'breed': row.p3,'confidence': row.p3_conf})
else:
prediction_list.append({'breed': 'Unknown','confidence': 0})
return 'Info extracted to prediction list'
# Run the extraction process
predictions_clean.apply(extract_breed_info, axis=1)
0 Info extracted to prediction list 1 Info extracted to prediction list 2 Info extracted to prediction list 3 Info extracted to prediction list 4 Info extracted to prediction list ... 2070 Info extracted to prediction list 2071 Info extracted to prediction list 2072 Info extracted to prediction list 2073 Info extracted to prediction list 2074 Info extracted to prediction list Length: 2075, dtype: object
# Assign the values in prediction list into new columns in predictions_clean
predictions_clean[['breed', 'confidence']] = pd.DataFrame(prediction_list)
# Round confidence to three decimal places
predictions_clean.confidence = round(predictions_clean.confidence, 3)
# Verify the extraction process
predictions_clean.iloc[:, 3:].sample(5)
p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | breed | confidence | |
---|---|---|---|---|---|---|---|---|---|---|---|
1391 | beagle | 0.451697 | True | basset | 0.197513 | True | bloodhound | 0.072699 | True | beagle | 0.452 |
98 | fire_engine | 0.883493 | False | tow_truck | 0.074734 | False | jeep | 0.012773 | False | Unknown | 0.000 |
24 | malamute | 0.336874 | True | Siberian_husky | 0.147655 | True | Eskimo_dog | 0.093412 | True | malamute | 0.337 |
2009 | basset | 0.320420 | True | collie | 0.215975 | True | Appenzeller | 0.128507 | True | basset | 0.320 |
175 | Chihuahua | 0.803528 | True | Pomeranian | 0.053871 | True | chow | 0.032257 | True | Chihuahua | 0.804 |
# Create a list of unwanted columns
unwanted_columns = ['p1','p2', 'p3','p1_conf', 'p2_conf', 'p3_conf','p1_dog', 'p2_dog', 'p3_dog', 'img_num']
# Drop all unwanted columns
predictions_clean.drop(columns=unwanted_columns, inplace=True)
# Check if the unwanted columns have been dropped
assert predictions_clean.columns.any() not in unwanted_columns
# Convert tweet_ids to string datatype
predictions_clean['tweet_id'] = predictions_clean['tweet_id'].astype(str)
# Verify the datatype for tweet_id
assert predictions_clean['tweet_id'].dtypes == 'O'
p1
, p2
and p3
are not uniformly formatted. Some entries are lowercase, some are uppercase and some are titlecase.
- Perform cleaning on the
prediction
column, sincep1
,p2
andp3
have been removed.- Replace underscores with spaces and format all entries to titlecase.
# Remove all underscores and format the breed text to titlecase.
predictions_clean.breed = predictions_clean.breed.str.replace('_', ' ').str.title()
predictions_clean.breed.unique()
array(['Welsh Springer Spaniel', 'Redbone', 'German Shepherd', 'Rhodesian Ridgeback', 'Miniature Pinscher', 'Bernese Mountain Dog', 'Unknown', 'Chow', 'Golden Retriever', 'Miniature Poodle', 'Gordon Setter', 'Walker Hound', 'Pug', 'Bloodhound', 'Lhasa', 'English Setter', 'Italian Greyhound', 'Maltese Dog', 'Newfoundland', 'Malamute', 'Soft-Coated Wheaten Terrier', 'Chihuahua', 'Black-And-Tan Coonhound', 'Toy Terrier', 'Blenheim Spaniel', 'Pembroke', 'Irish Terrier', 'Chesapeake Bay Retriever', 'Curly-Coated Retriever', 'Dalmatian', 'Ibizan Hound', 'Border Collie', 'Labrador Retriever', 'Miniature Schnauzer', 'Airedale', 'Rottweiler', 'West Highland White Terrier', 'Toy Poodle', 'Giant Schnauzer', 'Vizsla', 'Siberian Husky', 'Papillon', 'Saint Bernard', 'Tibetan Terrier', 'Borzoi', 'Beagle', 'Yorkshire Terrier', 'Pomeranian', 'Kuvasz', 'Flat-Coated Retriever', 'Norwegian Elkhound', 'Boxer', 'Eskimo Dog', 'Standard Poodle', 'Staffordshire Bullterrier', 'Basenji', 'Lakeland Terrier', 'American Staffordshire Terrier', 'Shih-Tzu', 'Groenendael', 'French Bulldog', 'Pekinese', 'Komondor', 'Malinois', 'Kelpie', 'Brittany Spaniel', 'Cocker Spaniel', 'Basset', 'English Springer', 'Cardigan', 'Brabancon Griffon', 'German Short-Haired Pointer', 'Shetland Sheepdog', 'Cairn', 'Whippet', 'Sussex Spaniel', 'Dandie Dinmont', 'Norwich Terrier', 'Keeshond', 'Norfolk Terrier', 'Old English Sheepdog', 'Samoyed', 'Scottish Deerhound', 'Doberman', 'Irish Wolfhound', 'Great Pyrenees', 'Schipperke', 'Bull Mastiff', 'Collie', 'Greater Swiss Mountain Dog', 'Standard Schnauzer', 'Irish Water Spaniel', 'Boston Bull', 'Japanese Spaniel', 'Bedlington Terrier', 'Entlebucher', 'Bluetick', 'Irish Setter', 'Leonberg', 'Mexican Hairless', 'Weimaraner', 'Great Dane', 'Tibetan Mastiff', 'Scotch Terrier', 'Australian Terrier', 'Briard', 'Appenzeller', 'Border Terrier', 'Wire-Haired Fox Terrier', 'Saluki', 'Silky Terrier', 'Afghan Hound', 'Clumber', 'Bouvier Des Flandres'], dtype=object)
Note: We will not clean entries with dashes -
since their use is grammatically correct in this case.
p1
, p2
, and p3
are dog breeds.
- This has been addressed in the process of tidying up the data.
- Predictions that are not dog breeds have also been assigned a value of Unknown.
# Verify the presence of records tagged unknown
predictions_clean.breed.value_counts().head()
Unknown 324 Golden Retriever 173 Labrador Retriever 113 Pembroke 96 Chihuahua 95 Name: breed, dtype: int64
wrd_archive
.
- Account for this by merging
archive_clean
andpredictions_clean
with an inner join. This way, only records common to both dataframes will be retained.- We will do this after cleaning the
json_clean
dataframe.
- Transform each hashtag from the tweet into a distinct record using the
df.explode()
method.
# Transform each element in the hashtag list to a distinct row in the dataframe
json_clean = json_clean.explode('hashtag')
json_clean.hashtag.value_counts()
None 2300 #BarkWeek 9 #PrideMonth 4 #BellLetsTalk 1 #NoDaysOff 1 #notallpuppers 1 #LoveTwitter 1 #FinalFur 1 #ImWithThor 1 #WomensMarch 1 #GoodDogs 1 #WKCDogShow 1 #K9VeteransDay 1 #ScienceMarch 1 #dogsatpollingstations 1 #PrideMonthPuppo 1 #Canada150 1 #BATP 1 #swole 1 Name: hashtag, dtype: int64
wrd_archive
.
- We can resolve this by merging the
json_clean
dataframe intoarchive_clean
.
It is better to create a master dataset by merging all the cleaned dataframes with an inner join. In addition to the issue listed above, this merge will help us address the following pending issue of unequal records across the three dataframes.
# reset dataframe indices for json_clean and predictions_clean
json_clean = json_clean.reset_index(drop=True)
predictions_clean = predictions_clean.reset_index(drop=True)
# Merge archive clean and prediction clean into master df
master_df = pd.merge(archive_clean, predictions_clean, on='tweet_id', how='inner')
# Merge json clean into master df
master_df = pd.merge(master_df, json_clean, on='tweet_id', how='inner')
# Check results
master_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1961 entries, 0 to 1960 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 1961 non-null object 1 timestamp 1961 non-null datetime64[ns, UTC] 2 source 1961 non-null object 3 text 1961 non-null object 4 name 1961 non-null object 5 tweet_url 1961 non-null object 6 rating 1961 non-null float64 7 stage 1961 non-null category 8 jpg_url 1961 non-null object 9 breed 1961 non-null object 10 confidence 1961 non-null float64 11 hashtag 1961 non-null object 12 retweets 1961 non-null int64 13 likes 1961 non-null int64 dtypes: category(1), datetime64[ns, UTC](1), float64(2), int64(2), object(8) memory usage: 216.6+ KB
- We can reorder the columns in a much more intuituve pattern.
- We should assign a more descriptive name to columns like
name
,breed
,stage
, andjpg_url
.
Lets add these finishing touches to our master dataframe:
# Order the columns in master df
column_order = ['tweet_id', 'timestamp', 'name', 'breed', 'confidence', 'stage', 'rating',
'hashtag', 'retweets', 'likes', 'jpg_url', 'tweet_url', 'text']
master_df = master_df[column_order]
# Give some columns descriptive names
master_df.rename(
columns={
'name': 'dog_name',
'breed': 'dog_breed',
'stage': 'dog_stage',
'jpg_url': 'image'
}, inplace=True)
# Preview master dataframe information
master_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1961 entries, 0 to 1960 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 1961 non-null object 1 timestamp 1961 non-null datetime64[ns, UTC] 2 dog_name 1961 non-null object 3 dog_breed 1961 non-null object 4 confidence 1961 non-null float64 5 dog_stage 1961 non-null category 6 rating 1961 non-null float64 7 hashtag 1961 non-null object 8 retweets 1961 non-null int64 9 likes 1961 non-null int64 10 image 1961 non-null object 11 tweet_url 1961 non-null object 12 text 1961 non-null object dtypes: category(1), datetime64[ns, UTC](1), float64(2), int64(2), object(7) memory usage: 201.3+ KB
master_df.head(1)
tweet_id | timestamp | dog_name | dog_breed | confidence | dog_stage | rating | hashtag | retweets | likes | image | tweet_url | text | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892420643555336193 | 2017-08-01 16:23:56+00:00 | Phineas | Unknown | 0.0 | None | 13.0 | None | 7018 | 33839 | https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg | https://t.co/MgUWQ76dJU | This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 |
Let's store our cleaned master dataframe locally to a named twitter_archive_master.csv
# Store master_df locally
master_df.to_csv('./twitter_archive_master.csv', index=False, encoding='utf-8')
# Verify storage process
print(Color.green + Color.bold +
'Printing csv file list in local directory...'+
Color.end)
!ls -lh *.csv
Printing csv file list in local directory...
-rw-r--r--@ 1 israelogunmola staff 894K Jun 1 19:56 twitter-archive-enhanced.csv
-rw-r--r-- 1 israelogunmola staff 514K Jun 15 15:53 twitter_archive_master.csv
Our analysis will focus on exploring the data to understand the following:
- User engagement will be assessed in terms of retweets and likes.
- Filter out records with for which we don't have a dog breed.
- Consider the top 20 breeds in terms of the number of tweets posted.
- Visualize this information using a bar graph.
# Remove records with unknown breeds
known_breeds_df = master_df.query("dog_breed != 'Unknown'")
# Isolate the 20 most popular breeds and their tweet counts
popular_20 = known_breeds_df.dog_breed.value_counts().head(20)
# Create a folder to store visualizations locally
folder = 'images'
if not os.path.exists(folder):
os.makedirs(folder)
# Create a visual
fig = px.bar(x=popular_20.index, y=popular_20.values, text=popular_20.values, height=600, width=1200, template='plotly_white')
fig.update_xaxes(tickangle=90, linecolor='grey')
fig.update_yaxes(showgrid=False, showticklabels=False)
fig.update_traces(width=0.6, textposition='outside', marker_color='royalblue')
fig.update_layout(
yaxis_title='No of Tweets',
xaxis_title='Dog Breed',
title='20 Most Popular Breeds<br><sup>Popular breeds ranked by number of tweets between Nov 2015 to Jul 2017.</sup>',
paper_bgcolor='rgb(248, 248, 255)',
plot_bgcolor='rgb(248, 248, 255)',
font_family='Arial'
)
# Store graph locally
fig.write_image('images/fig1.svg')
# Display graph
fig.show('svg')
- The Golden retriever is the most popular breed, having a total count of 155 tweets. The Labrador retriever follows behind with 103 tweets. In total, the two retrievers (golden and labrador) contributed to 258 tweets.
- Other notable breeds include the Pembroke, Chihuahua and the Pug, occupying the 3rd to 5th place respectively.
- Filter out records for which we don't have a dog breed. We already have this information in
known_breeds_df
- Filter out breeds with less than 10 tweets, since we are considering popularity also.
- Compute average ratings for each breed and select the top 20.
- Visualize results.
# Identify the number of breeds with less than 10 tweets
breed_tweet_count = known_breeds_df.dog_breed.value_counts()
print(Color.red + Color.bold +
'There are {} breeds with less than 10 tweets'.format((breed_tweet_count >= 10).sum())+
Color.end)
There are 53 breeds with less than 10 tweets
# Get the names of breeds with more than 10 tweets
wanted_breeds = breed_tweet_count[breed_tweet_count >=10].index
# Select the right breeds from the dataframe
wanted_breed_ratings = master_df[master_df.dog_breed.apply(lambda x: x in wanted_breeds)][['dog_breed', 'rating']]
# Compute average rating per breed, then select the top 20
top_20_rated = wanted_breed_ratings.groupby('dog_breed').mean().sort_values(by='rating').tail(20)
# --- Visualize results ---
# Create a color map to identify breeds in the popular 20 list
color_map = top_20_rated.reset_index()['dog_breed'].apply(lambda x: x in popular_20.index)
# Create plot area and add traces
fig = go.Figure()
fig.add_trace(go.Bar(y=top_20_rated.index, x=top_20_rated.rating, orientation='h'))
fig.add_trace(go.Scatter(y=top_20_rated.index, x=top_20_rated.rating+0.016, mode='markers', marker_size=10))
# --- Set trace properties ---
# Set trace properties for bar plot
fig.update_traces(width=0.2, selector=dict(type="bar"))
# Set trace properties for scatter plot
fig.update_traces(textposition='middle right', selector=dict(type="scatter"),
marker_line_color='black', marker_line_width=1)
#Set trace properties common to both plots
fig.update_traces(marker=dict(color=color_map.astype(int), colorscale=[[0, '#7F7F7F'], [1, 'royalBlue']],
opacity=0.9))
# Update axes and plot layout
fig.update_yaxes(ticksuffix=' ', tickfont=dict(size=12))
fig.update_xaxes(showgrid=True, gridcolor='#ddd', showticklabels=True, range=[10, 12], tickfont_color='grey')
fig.update_layout(xaxis_title='Rating', yaxis_title='', font_family='Arial',
width=700, height=650, margin=dict(t=80, b=70),
template='plotly_white', showlegend=False,
title = 'Top 20 Breeds by Average Rating<br><sup>The blue bars represent breeds '+
'that are also present on the popular 20 list.</sup>',
paper_bgcolor='rgb(248, 248, 255)', plot_bgcolor='rgb(248, 248, 255)'
)
# Add annotations
fig.add_vline(x=10, annotation=dict(text='<b>Common denominator</b>'), annotation_position='top', annotation_font_color='IndianRed')
# Store graph locally
fig.write_image('images/fig2.svg')
fig.show('svg')
- The Samoyed, Golden retriever, Great Pyrenees, Pembroke and Chow are the top five breeds in terms of average ratings.
- 13 of the top rated breeds (13/20) are also present on the most popular list. It appears that these breeds enjoy in both rating and popularity.
- Identify and filter out records where a dog stage was not mentioned.
- Isolate only the proper dog stages (doggo, puppo and pupper). The floofer isn't an ideal dog stage since any of the other dog stages can be a floofer too.
- Compute the average retweets, likes and ratings by dog stage.
- Remove records with two stages in one tweet e.g a doggo and a pupper. It will be hard to tell the one that users liked.
- Melt the resulting dataframe; for ease of plotting with Plotly.
- Visualize results.
# First estimate the fraction of total records where dog stage was mentioned
stage_counts = master_df.dog_stage.value_counts()
print(Color.green + Color.bold +
'Only {} records mentioned the stage of the dog'.format(stage_counts.drop('None').sum())+
Color.end)
Only 302 records mentioned the stage of the dog
# Compute the mean retweets, likes and ratings for each dog stage
stage_aggregates = master_df.groupby('dog_stage')[['retweets', 'likes', 'rating']].mean()
# Remove records with None, Floofer and Doggo with Pupper
stage_aggregates.drop(index = ['None', 'Floofer', 'Doggo with Pupper'], inplace=True)
# Melt the dataframe for plotting ease
stage_aggregates = stage_aggregates.reset_index().melt(id_vars='dog_stage', var_name='criteria', value_name='mean')
stage_aggregates
dog_stage | criteria | mean | |
---|---|---|---|
0 | Doggo | retweets | 5901.269841 |
1 | Pupper | retweets | 1930.334975 |
2 | Puppo | retweets | 5703.666667 |
3 | Doggo | likes | 17403.317460 |
4 | Pupper | likes | 6282.990148 |
5 | Puppo | likes | 20418.166667 |
6 | Doggo | rating | 11.761905 |
7 | Pupper | rating | 10.656502 |
8 | Puppo | rating | 12.083333 |
# --- Visualize results ---
# Create main plot area
fig = px.bar(stage_aggregates, y='dog_stage', x = 'mean', orientation='h', facet_col='criteria', template='plotly_white',
height=300, width=1200, facet_col_spacing=0.04)
# Customize traces and annotations
fig.update_traces(width=0.7, marker_color=['grey', 'grey', '#DC2912'], opacity=0.7)
fig.for_each_annotation(lambda a: a.update(text= a.text.split("=")[-1].title() + ' on average'))
# Update axes and plot layout
fig.update_xaxes(matches=None, showline=True, linewidth=1, linecolor='grey', mirror=True,
titlefont_size=12, tickfont_color='grey')
fig.update_yaxes(ticksuffix=' ', showline=True, linewidth=1, linecolor='grey', mirror=True)
fig.update_layout(yaxis_title='Dog stage', xaxis_title='', xaxis2_title='', xaxis3_title='',
paper_bgcolor='rgb(248, 248, 255)', plot_bgcolor='rgb(248, 248, 255)',
title='People Love and Rate Puppos; Doggos Gather Retweets!<br>'+
'<sup>User retweets, likes and ratings compared across various dog stages.</sup>',
title_x=0.5, font_family='Arial',
margin=dict(t=100, b=70))
# Store graph locally
fig.write_image('images/fig3.svg')
fig.show('svg')
- Puppos seem to be the people favorite, leading in average likes (over 20,000) and ratings (12). Doggo tweets also show good engagements in terms of likes (17,412) and ratings (11.7).
- Doggos enjoy marginally more retweets (5,906 retweets) than Puppos (5,708 retweets).
- Puppers considerably gather the least retweets, likes and ratings.
- Identify the number of records that had hashtags included in the tweet. This helps understand if we have enough data to make conclusions.
- Compute the average retweets, likes and ratings by hashtag use.
# Select only relevant columns from the master dataframe
hashtag_df = master_df[['hashtag', 'likes', 'retweets', 'rating']].copy()
# Create a new column to show if hashtags are present in each record
hashtag_df['has_hashtag'] = hashtag_df.hashtag.apply(lambda x: x!='None')
# Print information about how much hashtags are present in the dataframe
print(Color.green+'Printing the number of records based on hashtag use...'+Color.end)
print(hashtag_df.has_hashtag.value_counts(), '\n')
print(Color.green+'Printing the percentage of records based on hashtag use...'+Color.end)
(hashtag_df.has_hashtag.value_counts(normalize=True).round(3)*100).astype(str)+'%'
Printing the number of records based on hashtag use... False 1937 True 24 Name: has_hashtag, dtype: int64 Printing the percentage of records based on hashtag use...
False 98.8% True 1.2% Name: has_hashtag, dtype: object
# Aggregate ratings, retweets and likes by hashtag use
hashtag_df.groupby('has_hashtag')[['retweets', 'likes', 'rating']].mean().astype(int)
retweets | likes | rating | |
---|---|---|---|
has_hashtag | |||
False | 2215 | 7600 | 10 |
True | 5655 | 20709 | 12 |
- Only very few records, 1.2% of the total tweets, actually used hashtags.
- Ratings, retweets and likes seem higher on average, when hashtags are used. However, We cannot confidently make this conclusion, considering that there are far fewer records for tweets with hashtags.
A better alternative to this question is to try to understand the hashtags that generated the highest user engagements (retweets and likes) when used.
- Isolate only records where hashtags are used.
- Aggregate retweets and likes based on each unique hashtag.
- Visualize the results.
# Isolate posts with hashtags
hashtag_present = hashtag_df.query('has_hashtag == True')
# Aggregate ratings, retweets and likes by each unique hashtag
hashtag_aggregates = hashtag_present.groupby('hashtag')[['retweets', 'likes']].mean().astype(int)
# --- Visualize results ---
# Create plot area
fig = px.scatter(hashtag_aggregates, y=hashtag_aggregates.index, x='retweets', size='likes',
text=hashtag_aggregates.index+'<br><sub>'+'Retweets: '+
hashtag_aggregates.retweets.astype(str)+', Likes: '+hashtag_aggregates.likes.astype(str)+'</sub>',
title='Which Hashtag generated the highest user engagement?<br>'+
'<sup>Increasing retweets from left to right. Likes are represented by dot size.')
# Format traces and axes
fig.update_traces(textposition='bottom center', opacity=1, marker_line_color='black', marker_line_width=1)
fig.update_yaxes(showticklabels=False, gridcolor='#ddd')
fig.update_xaxes(showticklabels=False, gridcolor='#ddd')
# Update layout and annotations
fig.update_layout(height=850, width=1400, template='plotly_white', xaxis_title='', yaxis_title='',
paper_bgcolor='rgb(248, 248, 255)', plot_bgcolor='rgb(248, 248, 255)',
font_family='Arial', font_size=14, margin=dict(t=70, b=70), title_x=0.5)
fig.add_hline(y=13.2, annotation=dict(text=' Increasing average retweets ->'), annotation_position='top right')
fig.add_hline(y=-1, annotation=dict(text=' Increasing Average retweets ->'), annotation_position='top right')
# Store graph locally
fig.write_image('images/fig4.svg')
fig.show('svg')
- #WomensMarch and #ScienceMarch gathered the highest share of interactions. On the other hand, hashtags such as #Swole and #NoDaysOff gained the least number of interactions.
- These two hashtags are related to widespread events involving rallies (#ScienceMarch) or protests (#WomensMarch) worldwide. This could explain the high number of interactions recorded with their use.
- Make a copy of the master dataframe and set the timestamp as the new dataframe index.
- Resample the timestamp by month, aggregating the average number of posts, retweets, and likes in the process.
- Visualize the results.
# Make a copy of the master dataframe, setting timestamp as the index
master_df_copy = master_df.set_index('timestamp')
# For each item to investigate, resample the dataframe by month
tweet_count = master_df_copy.tweet_id.resample('1m').count()
retweets= master_df_copy.retweets.resample('1m').mean()
likes = master_df_copy.likes.resample('1m').mean()
# Initialize figure with subplots
fig = make_subplots(rows=1, cols=2, horizontal_spacing=0.12,
specs=[[{"type": "scatter"}, {"secondary_y": True}]])
# Add trace for tweet count
fig.add_trace(
go.Scatter(x=tweet_count.index, y=tweet_count.values, name='count of tweets',
marker_color='IndianRed'),row=1, col=1
)
# Add traces for retweets and likes
# Retweets
fig.add_trace(
go.Scatter(x=retweets.index, y=retweets.values, name="retweets", marker_color='MediumSlateBlue'),
secondary_y=False, row=1,col=2
)
# Likes
fig.add_trace(
go.Scatter(x=likes.index, y=likes.values, name="likes", marker_color='green'),
secondary_y=True, row=1, col=2
)
# Update traces and axes properties
fig.update_traces(line_width=3, opacity=0.8)
fig.update_yaxes(title_text="Average<b> retweets</b>", secondary_y=False, gridcolor='#ddd',
tickfont_color='grey', row=1, col=2)
fig.update_yaxes(title_text="Average <b>likes</b>", secondary_y=True, showgrid=False,
tickfont_color='grey', row=1, col=2)
fig.update_yaxes(title_text="<b>Tweet</b> count", gridcolor='#ddd', tickfont_color='grey', row=1, col=1)
fig.update_xaxes(title_text="Timestamp", gridcolor='#ddd', linecolor='black', tickfont_color='grey')
# Update layout and annotations
fig.update_layout(height=500, width=1200, template='plotly_white', showlegend=False, font_family='Arial',
paper_bgcolor='rgb(248, 248, 255)', plot_bgcolor='rgb(248, 248, 255)',
title='How have Tweet count, Likes and Retweets varied over the time period?<br>'+
'<sup>Trends in original tweets, retweets and likes compared between Nov 2015 and Jul 2017.</sup>')
fig.add_annotation(x='2016-09', y=3200, text='<b>Retweets</b>', showarrow=False,
textangle=-45, font_color='MediumSlateBlue', row=1, col=2)
fig.add_annotation(x='2016-12', y=2400, text='<b>Likes</b>', showarrow=False,
textangle=-45, font_color='green', row=1, col=2)
fig.add_annotation(x='2016-12', y=80, text='<b>Original tweet count</b>', showarrow=False,
textangle=0, font_color='IndianRed', row=1, col=1)
# Store graph locally
fig.write_image('images/fig5.svg')
fig.show('svg')
- The total number of original tweets posted on the account has been declining overall. This could have led one to believe that the account was gradually becoming unsuccessful with time.
- The rising trend in retweets and favorites, however, tells a different story: Although the total number of tweets has been on a decline, the account has been gaining a lot of user interactions, moving from less than 1,000 retweets and 5,000 likes in late 2015, to over 6,000 retweets and 30,000 likes in late 2017.
- This could be due to the fact that in earlier stages, an account may have created more tweets to gain popularity. As time progresses, people become familiar with the account. They may start to like and retweet contents for others to see. This can lead to a cycle of success, gradually reducing the need to create numerous contents before driving engagement.
- There is also an interesting pattern seen with retweets and likes. They appear to fluctuate in the same direction (when retweets increase, likes increase and vice versa). We can investige this further by examining the correlation between both variables.
- Sample 1000 records from the dataframe, then evaluate the relationship between both variables using a scatter plot.
# Calculate the correlation coefficient for retweet-like relationship
correlation = master_df.retweets.corr(master_df.likes).round(2)
# Define a function that helps format scatterplots
def format_scatter(f):
'''
Updates a plotly scatter plot axes, traces and layouts with predefined formatting
Params:
f (figure object): A plotly figure object (scatterplot)
Output:
None
'''
f = f.update_xaxes(tickfont_color='grey', gridcolor='#ddd')
f = f.update_yaxes(tickfont_color='grey', gridcolor='#ddd')
f = f.update_traces(marker_line_color='black', marker_line_width=1, marker_color='royalblue', opacity=0.7)
f = f.update_layout(height=500, width=600, template='plotly_white', font_family='Arial',
paper_bgcolor='rgb(248, 248, 255)', plot_bgcolor='rgb(248, 248, 255)')
# Create a plotly scatter plot object
fig = px.scatter(master_df.sample(1000, random_state=1), x='retweets', y='likes', trendline='ols')
# Format the scatter plot with predefined function
format_scatter(fig)
# Update plot title
fig.update_layout(title='Is there a relationship between Retweets and Likes?<br>'+
'<sup>A plot of retweets and likes for all WeRateDogs original posts.</sup>')
# Add neccessary annotations
fig.add_vrect(x0=0, x1=20000, y0=0.05, y1=0.30, opacity=0.6, line_width=2)
fig.add_annotation(x=44000, y=25000,
text='<b> A strong positive correlation masked by' +
'<br>the majority of tweets having under<br>20k retweets and 50k likes.</b>',
showarrow=False, textangle=0, font_color='royal blue')
fig.add_annotation(x=15000, y=95000, text='<b>'+'r='+str(correlation)+'</b>',
showarrow=False, font_color='royal blue')
# Store graph locally
fig.write_image('images/fig6.svg')
fig.show('svg')
- The association between many points on the scatterplot above is not very visible. Despite the stong positive correlation, outliers (tweets having very high number of both retweets and likes) are causing the points to be concentrated at the bottom left of the chart.
We can zoom-in to examine this relationship better by plotting a scatterplot with the log of both the x (retweets) and y (likes) axes:
# Regenerate the scatterplot, this time taking the log values of both axes
fig = px.scatter(master_df.sample(1000, random_state=1), x='retweets', y='likes', log_x=True, log_y=True)
# Format the scatter plot with predefined function
format_scatter(fig)
# Add neccessary update plot layout and add neccessary annotations
fig.update_layout(title='A clearer association between Retweets and Likes<br>'+
'<sup>The log plot zooms into the relationship between retweets and likes for WeRateDogs posts.</sup>',
xaxis_title= 'retweets (log scale)', yaxis_title='likes (log scale)')
fig.add_annotation(x=math.log10(70), y=math.log10(2000), text='<b>'+'r='+str(correlation)+'</b>',
showarrow=False, textangle=0, font_color='royal blue')
# Store graph locally
fig.write_image('images/fig7.svg')
fig.show('svg')
- Retweets and likes show strong positive correlation, and this is especially clearer on a log scale.
- Isolate records for each unique dog stage from the master dataframe.
- Filter out Floofers (since this isn't an actual dog stage), Doggo with Pupper (it's hard to tell which one people liked in the tweet), and records where a dog stage wasn't specified.
- Create a leaderboard system that sorts dogs based on tweet engagements (retweets and likes), then ratings.
- Identify the leading dogs in each group and display their images as visual outputs.
# Loop through each dog stage in the dog stage column
# Do not consider 'None', 'Floofer' and 'Doggo with pupper'
for stage in master_df.dog_stage.unique()[1:4]:
# Isolate each dog stage into its own dataframe
stage_df = master_df.query("dog_stage == @stage")
# Sort each tweet based on retweets >> Likes >> ratings
top_dog = stage_df.sort_values(by=['retweets', 'likes', 'rating'], ascending=False).head(1)
# Pull full profile info for the most favored dog
top_dog_image = top_dog.image.values[0]
top_dog_name = top_dog.dog_name.values[0].replace('None', 'Wish we knew')
top_dog_breed = top_dog.dog_breed.values[0]
top_dog_retweets = str(top_dog.retweets.values[0])
top_dog_likes = str(top_dog.likes.values[0])
top_dog_rating = str(top_dog.rating.values[0])
top_dog_text = top_dog.text.values[0]
# Print dog profile as output
print(Color.underline + Color.green +
Color.bold + 'Peoples favorite '+ stage+Color.end)
print(Color.blue+ 'Name: '+ top_dog_name +
'\nBreed: '+ top_dog_breed +
'\nRetweets: '+ top_dog_retweets +
'\nLikes: '+ top_dog_likes +
'\nRating: '+ top_dog_rating + Color.end)
print(Color.bold+'Tweet: ' + top_dog_text + Color.end)
# Pull dog image from the web then display it
response = requests.get(top_dog_image)
img = Image.open(BytesIO(response.content))
display(img)
# Write dog image locally
folder = 'images'
filename = 'favored_' + stage.replace(' ', '_') + '.jpg'
img.save(os.path.join(folder, filename))
# Output demarcator
print('-'*138)
Peoples favorite Doggo Name: Wish we knew Breed: Labrador Retriever Retweets: 70826 Likes: 145013 Rating: 13.0 Tweet: Here's a doggo realizing you can stand in a pool. 13/10 enlightened af (vid by Tina Conrad)
------------------------------------------------------------------------------------------------------------------------------------------ Peoples favorite Puppo Name: Wish we knew Breed: Lakeland Terrier Retweets: 39970 Likes: 124209 Rating: 13.0 Tweet: Here's a super supportive puppo participating in the Toronto #WomensMarch today. 13/10
------------------------------------------------------------------------------------------------------------------------------------------ Peoples favorite Pupper Name: Jamesy Breed: French Bulldog Retweets: 30247 Likes: 108985 Rating: 13.0 Tweet: This is Jamesy. He gives a kiss to every other pupper he sees on his walk. 13/10 such passion, much tender
------------------------------------------------------------------------------------------------------------------------------------------
- The people's favorite Doggo is a Labrador retriever swimming in a pool. We do not know its name, but it has gathered 70,913 retweets, 145,119 likes, and a rating of 13.
- For the Puppos, its a Lakeland Terrier. We couldn't get its name, but this puppo participated in the Womens march in Canada, earning itself 40,002 retweets, 124,275 likes, and a rating of 13 in the process.
- A French bulldog named Jamesy won it all for the Puppers. People seem to love the fact that he gives kisses to other dogs. He earned himself 20,275 retweets, 109,051 likes and a rating of 13 for being so tender.
Real life data rarely comes clean. In the course of this project, WeRateDogs twitter data was collected in fragments from different sources. Each piece of data was assessed for quality and tidiness, then cleaned. After wrangling, the datasets were combined into a single dataframe in preparation for further analysis.
Further analysis involved exploring the data and building visualizations. These explorations led to the following insights: