Abstract

This project is for wrangling data. There are three data sets. Two of them are given by Udacity (which are 'twitter-archive-enhanced.csv, image-predictions.tsv), one of them can be taken through 'twitter-API' which is about retweet counts and favorite counts. Those data sets are not cleaned yet. In other words, Those are dirty and messy. We should clean them.

Gather

In [1]:
import tweepy
import pandas as pd
import numpy as np
import requests
import json
from timeit import default_timer as timer
pd.set_option('display.max_rows', 7)

# change the CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_SECRET by yours
consumer_key = 'CONSUMER_KEY'
consumer_secret = 'CONSUMER_SECRET'
access_token = 'ACCESS_TOKEN'
access_secret = 'ACCESS_SECRET'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

Load the WeRateDogs twitter archive

In [2]:
df_twit_archive_raw = pd.read_csv('twitter-archive-enhanced.csv')
df_twit_archive_raw.head(1)
Out[2]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN NaN https://twitter.com/dog_rates/status/892420643... 13 10 Phineas None None None None

Load the tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network.

In [3]:
image_predictions_url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
df_image_predic_raw = pd.read_csv(image_predictions_url, sep='\t')
df_image_predic_raw.head()
Out[3]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True

Each tweet's retweet count and favorite ("like") count at minimum. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line.

In [4]:
'''tweet_ids = df_twit_archive.tweet_id.values

fails_dict = {}
start = timer()

# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        try:
            api_status = api.get_status(tweet_id, tweet_mode='extended')
            json.dump(api_status._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
        
end = timer()
print(end - start)
print(fails_dict)'''
Out[4]:
'tweet_ids = df_twit_archive.tweet_id.values\n\nfails_dict = {}\nstart = timer()\n\n# Save each tweet\'s returned JSON as a new line in a .txt file\nwith open(\'tweet_json.txt\', \'w\') as outfile:\n    # This loop will likely take 20-30 minutes to run because of Twitter\'s rate limit\n    for tweet_id in tweet_ids:\n        try:\n            api_status = api.get_status(tweet_id, tweet_mode=\'extended\')\n            json.dump(api_status._json, outfile)\n            outfile.write(\'\n\')\n        except tweepy.TweepError as e:\n            print("Fail")\n            fails_dict[tweet_id] = e\n            pass\n        \nend = timer()\nprint(end - start)\nprint(fails_dict)'
In [5]:
df_twit_add_raw = pd.read_json('tweet_json.txt', lines=True)

Assess

Before cleaning, I copied DataFrame to preserve original DataFrame. I solved eight Quality issues first and then tidienss issues. Some of them have been made through visual assessment by Excel, others have been made through programmatic assessment. I've made some useful function in order to clean DataFrame effectively. I could find all of library's function through documents and stackoverflow. Finally, I exported the cleaned DataFrame to a csv file

copy the DataFrame

In [6]:
df_twit_archive = df_twit_archive_raw.copy()
In [7]:
df_twit_archive.head(1)
Out[7]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN NaN https://twitter.com/dog_rates/status/892420643... 13 10 Phineas None None None None
In [8]:
df_image_predic = df_image_predic_raw.copy()
In [9]:
df_image_predic.head(3)
Out[9]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
In [10]:
df_twit_add = df_twit_add_raw.copy()
In [11]:
df_twit_add.head(1)
Out[11]:
contributors coordinates created_at display_text_range entities extended_entities favorite_count favorited full_text geo ... quoted_status quoted_status_id quoted_status_id_str quoted_status_permalink retweet_count retweeted retweeted_status source truncated user
0 NaN NaN 2017-08-01 16:23:56 [0, 85] {'hashtags': [], 'symbols': [], 'user_mentions... {'media': [{'id': 892420639486877696, 'id_str'... 37511 False This is Phineas. He's a mystical boy. Only eve... NaN ... NaN NaN NaN NaN 8176 False NaN <a href="http://twitter.com/download/iphone" r... False {'id': 4196983835, 'id_str': '4196983835', 'na...

1 rows × 32 columns

In [12]:
df_twit_archive.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB
In [13]:
df_image_predic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB
In [14]:
df_twit_add.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2336 entries, 0 to 2335
Data columns (total 32 columns):
contributors                     0 non-null float64
coordinates                      0 non-null float64
created_at                       2336 non-null datetime64[ns]
display_text_range               2336 non-null object
entities                         2336 non-null object
extended_entities                2065 non-null object
favorite_count                   2336 non-null int64
favorited                        2336 non-null bool
full_text                        2336 non-null object
geo                              0 non-null float64
id                               2336 non-null int64
id_str                           2336 non-null int64
in_reply_to_screen_name          77 non-null object
in_reply_to_status_id            77 non-null float64
in_reply_to_status_id_str        77 non-null float64
in_reply_to_user_id              77 non-null float64
in_reply_to_user_id_str          77 non-null float64
is_quote_status                  2336 non-null bool
lang                             2336 non-null object
place                            1 non-null object
possibly_sensitive               2203 non-null float64
possibly_sensitive_appealable    2203 non-null float64
quoted_status                    24 non-null object
quoted_status_id                 26 non-null float64
quoted_status_id_str             26 non-null float64
quoted_status_permalink          26 non-null object
retweet_count                    2336 non-null int64
retweeted                        2336 non-null bool
retweeted_status                 165 non-null object
source                           2336 non-null object
truncated                        2336 non-null bool
user                             2336 non-null object
dtypes: bool(4), datetime64[ns](1), float64(11), int64(4), object(12)
memory usage: 520.2+ KB
In [15]:
df_twit_archive.name.value_counts()
Out[15]:
None       745
a           55
Charlie     12
          ... 
Charl        1
Chubbs       1
Gilbert      1
Name: name, Length: 957, dtype: int64
In [16]:
df_twit_archive.query("name =='a'").text
Out[16]:
56      Here is a pupper approaching maximum borkdrive...
649     Here is a perfect example of someone who has t...
801     Guys this is getting so out of hand. We only r...
                              ...                        
2352    This is a purebred Piers Morgan. Loves to Netf...
2353    Here is a very happy pup. Big fan of well-main...
2354    This is a western brown Mitsubishi terrier. Up...
Name: text, Length: 55, dtype: object
In [17]:
df_twit_archive.query('name in ["a", "an", "the", "such"]')
Out[17]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
22 887517139158093824 NaN NaN 2017-07-19 03:39:09 +0000 <a href="http://twitter.com/download/iphone" r... I've yet to rate a Venezuelan Hover Wiener. Th... NaN NaN NaN https://twitter.com/dog_rates/status/887517139... 14 10 such None None None None
56 881536004380872706 NaN NaN 2017-07-02 15:32:16 +0000 <a href="http://twitter.com/download/iphone" r... Here is a pupper approaching maximum borkdrive... NaN NaN NaN https://twitter.com/dog_rates/status/881536004... 14 10 a None None pupper None
649 792913359805018113 NaN NaN 2016-10-31 02:17:31 +0000 <a href="http://twitter.com/download/iphone" r... Here is a perfect example of someone who has t... NaN NaN NaN https://twitter.com/dog_rates/status/792913359... 13 10 a None None None None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2352 666044226329800704 NaN NaN 2015-11-16 00:04:52 +0000 <a href="http://twitter.com/download/iphone" r... This is a purebred Piers Morgan. Loves to Netf... NaN NaN NaN https://twitter.com/dog_rates/status/666044226... 6 10 a None None None None
2353 666033412701032449 NaN NaN 2015-11-15 23:21:54 +0000 <a href="http://twitter.com/download/iphone" r... Here is a very happy pup. Big fan of well-main... NaN NaN NaN https://twitter.com/dog_rates/status/666033412... 9 10 a None None None None
2354 666029285002620928 NaN NaN 2015-11-15 23:05:30 +0000 <a href="http://twitter.com/download/iphone" r... This is a western brown Mitsubishi terrier. Up... NaN NaN NaN https://twitter.com/dog_rates/status/666029285... 7 10 a None None None None

71 rows × 17 columns

Quality Issues

df_twit_archive table

  • Null value in expanded_urls
  • There are useless rows that have non-empty retweeted_status_id, retweeted_status_user_id, and retweeted_status_timestamp
  • [in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp] are useless columns
  • The type of timestamp is object
  • There are rows whose name == a, an, the, such
  • The rating_numerator is represented by an integer whose rating is a decimal point.
  • rating_denominator is not 10
  • rating_denominator and rating_numerator can be combined in one column
  • expanded_url has rows with multiple URLs

Tidiness Issues

df_twit_archive table

  • 1 variable in 4 columns in df_twit_archive table (doggo, floofer, pupper, puppo)

df_twit_add table

  • Attach favorite_count, retweet_count of df_twit_add to df_twit_archive

df_image_predic table

  • Attach p1, p1_conf, p1_dog of df_image_predic to df_twit_archive

Clean

Quality Issues

df_twit_archive: Null value in expanded_urls

Define

Drop the rows with expanded_urls is Null

Cdoe

In [18]:
df_twit_archive.dropna(subset=['tweet_id','expanded_urls'], inplace=True)
df_twit_archive.reset_index(drop=True, inplace=True)

Test

In [19]:
df_twit_archive.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2297 entries, 0 to 2296
Data columns (total 17 columns):
tweet_id                      2297 non-null int64
in_reply_to_status_id         23 non-null float64
in_reply_to_user_id           23 non-null float64
timestamp                     2297 non-null object
source                        2297 non-null object
text                          2297 non-null object
retweeted_status_id           180 non-null float64
retweeted_status_user_id      180 non-null float64
retweeted_status_timestamp    180 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2297 non-null int64
rating_denominator            2297 non-null int64
name                          2297 non-null object
doggo                         2297 non-null object
floofer                       2297 non-null object
pupper                        2297 non-null object
puppo                         2297 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 305.1+ KB

df_twit_archive: There are useless rows that have non-empty retweeted_status_id, retweeted_status_user_id, and retweeted_status_timestamp

Define

Remove rows that have non-empty retweeted_status_id, retweeted_status_user_id, and retweeted_status_timestamp from df_twit_archive

Code

In [20]:
retweeted_index = df_twit_archive.query('retweeted_status_id != "NaN"').index
retweeted_index
Out[20]:
Int64Index([  19,   31,   35,   65,   70,   71,   75,   88,   92,   94,
            ...
             894,  905,  911,  917,  978,  989, 1009, 1207, 2201, 2202],
           dtype='int64', length=180)
In [21]:
df_twit_archive.drop(retweeted_index, axis=0, inplace=True)
df_twit_archive.reset_index(drop=True, inplace=True)

Test

In [22]:
df_twit_archive.query('retweeted_status_id != "NaN"')
Out[22]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo

df_twit_archive: [in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp] are useless columns

Define

Delete useless [in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp] columns

In [23]:
df_twit_archive.query('retweeted_status_id != "NaN"')
Out[23]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo

Code

In [24]:
df_twit_archive.drop(['in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'], axis=1, inplace=True)

Test

In [25]:
df_twit_archive.head(1)
Out[25]:
tweet_id timestamp source text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... https://twitter.com/dog_rates/status/892420643... 13 10 Phineas None None None None

df_twit_archive: The type of timestamp is object

Define

Change type of timestamp to datatime64

code

In [26]:
df_twit_archive.timestamp = df_twit_archive.timestamp.astype('datetime64')

Test

In [27]:
df_twit_archive.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2117 entries, 0 to 2116
Data columns (total 12 columns):
tweet_id              2117 non-null int64
timestamp             2117 non-null datetime64[ns]
source                2117 non-null object
text                  2117 non-null object
expanded_urls         2117 non-null object
rating_numerator      2117 non-null int64
rating_denominator    2117 non-null int64
name                  2117 non-null object
doggo                 2117 non-null object
floofer               2117 non-null object
pupper                2117 non-null object
puppo                 2117 non-null object
dtypes: datetime64[ns](1), int64(3), object(8)
memory usage: 198.5+ KB

df_twit_archive: There are rows whose name == a, an, the, such

Define

First, find the text of rows which name is 'a, an, the, such'.

If text has the word 'named', replace the name with the word immediately after it. Otherwise, replace the name with None.

code

In [28]:
df_twit_wrong_name = df_twit_archive.query('name in ["a", "an", "the", "such"]')
df_twit_wrong_name
Out[28]:
tweet_id timestamp source text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
21 887517139158093824 2017-07-19 03:39:09 <a href="http://twitter.com/download/iphone" r... I've yet to rate a Venezuelan Hover Wiener. Th... https://twitter.com/dog_rates/status/887517139... 14 10 such None None None None
51 881536004380872706 2017-07-02 15:32:16 <a href="http://twitter.com/download/iphone" r... Here is a pupper approaching maximum borkdrive... https://twitter.com/dog_rates/status/881536004... 14 10 a None None pupper None
497 792913359805018113 2016-10-31 02:17:31 <a href="http://twitter.com/download/iphone" r... Here is a perfect example of someone who has t... https://twitter.com/dog_rates/status/792913359... 13 10 a None None None None
... ... ... ... ... ... ... ... ... ... ... ... ...
2113 666044226329800704 2015-11-16 00:04:52 <a href="http://twitter.com/download/iphone" r... This is a purebred Piers Morgan. Loves to Netf... https://twitter.com/dog_rates/status/666044226... 6 10 a None None None None
2114 666033412701032449 2015-11-15 23:21:54 <a href="http://twitter.com/download/iphone" r... Here is a very happy pup. Big fan of well-main... https://twitter.com/dog_rates/status/666033412... 9 10 a None None None None
2115 666029285002620928 2015-11-15 23:05:30 <a href="http://twitter.com/download/iphone" r... This is a western brown Mitsubishi terrier. Up... https://twitter.com/dog_rates/status/666029285... 7 10 a None None None None

70 rows × 12 columns

In [29]:
def get_correct_name(text):
    split_list = text.split()
    if 'named' in split_list:
        name_index = split_list.index('named') + 1
        name = split_list[name_index]
        return name
    else:
        return None    
In [30]:
df_twit_wrong_name['correct_name'] = df_twit_wrong_name.apply(lambda x: get_correct_name(x['text']), axis=1)
C:\Users\weroo\Anaconda\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [31]:
df_twit_wrong_name
Out[31]:
tweet_id timestamp source text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo correct_name
21 887517139158093824 2017-07-19 03:39:09 <a href="http://twitter.com/download/iphone" r... I've yet to rate a Venezuelan Hover Wiener. Th... https://twitter.com/dog_rates/status/887517139... 14 10 such None None None None None
51 881536004380872706 2017-07-02 15:32:16 <a href="http://twitter.com/download/iphone" r... Here is a pupper approaching maximum borkdrive... https://twitter.com/dog_rates/status/881536004... 14 10 a None None pupper None None
497 792913359805018113 2016-10-31 02:17:31 <a href="http://twitter.com/download/iphone" r... Here is a perfect example of someone who has t... https://twitter.com/dog_rates/status/792913359... 13 10 a None None None None None
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2113 666044226329800704 2015-11-16 00:04:52 <a href="http://twitter.com/download/iphone" r... This is a purebred Piers Morgan. Loves to Netf... https://twitter.com/dog_rates/status/666044226... 6 10 a None None None None None
2114 666033412701032449 2015-11-15 23:21:54 <a href="http://twitter.com/download/iphone" r... Here is a very happy pup. Big fan of well-main... https://twitter.com/dog_rates/status/666033412... 9 10 a None None None None None
2115 666029285002620928 2015-11-15 23:05:30 <a href="http://twitter.com/download/iphone" r... This is a western brown Mitsubishi terrier. Up... https://twitter.com/dog_rates/status/666029285... 7 10 a None None None None None

70 rows × 13 columns

In [32]:
df_twit_archive.head(1)
Out[32]:
tweet_id timestamp source text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 2017-08-01 16:23:56 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... https://twitter.com/dog_rates/status/892420643... 13 10 Phineas None None None None
In [33]:
correct_name_dic = {}
for index, row in df_twit_wrong_name.iterrows():
    correct_name_dic[row.tweet_id] = row.correct_name
correct_name_dic
Out[33]:
{887517139158093824: None,
 881536004380872706: None,
 792913359805018113: None,
 772581559778025472: None,
 747885874273214464: None,
 747816857231626240: None,
 746872823977771008: None,
 746369468511756288: None,
 743222593470234624: None,
 717537687239008257: None,
 715733265223708672: None,
 704859558691414016: None,
 704054845121142784: None,
 703079050210877440: None,
 703041949650034688: None,
 702539513671897089: None,
 700864154249383937: None,
 692187005137076224: None,
 690360449368465409: None,
 685943807276412928: None,
 679530280114372609: None,
 677644091929329666: None,
 677269281705472000: None,
 676613908052996102: None,
 675706639471788032: 'Wylie.',
 675534494439489536: None,
 675109292475830276: None,
 675047298674663426: None,
 674082852460433408: None,
 673715861853720576: None,
 673636718965334016: 'Kip.',
 672604026190569472: None,
 671743150407421952: 'Jacob',
 671561002136281088: None,
 671147085991960577: 'Rufus.',
 670427002554466305: 'Spork.',
 670361874861563904: 'Cherokee.',
 670303360680108032: 'Hemry.',
 669923323644657664: 'Alphred.',
 669661792646373376: None,
 669564461267722241: 'Alfredo.',
 668955713004314625: 'Leroi.',
 668815180734689280: None,
 668636665813057536: 'Berta.',
 668614819948453888: None,
 668587383441514497: None,
 668507509523615744: 'Chuk.',
 668466899341221888: None,
 668171859951755264: 'Alfonso.',
 667861340749471744: 'Cheryl.',
 667773195014021121: 'Jessiga.',
 667538891197542400: 'Klint.',
 667470559035432960: 'Kohl.',
 667177989038297088: None,
 666983947667116034: 'Pepe.',
 666781792255496192: 'Octaviath.',
 666701168228331520: 'Johm.',
 666407126856765440: None,
 666337882303524864: None,
 666293911632134144: None,
 666287406224695296: None,
 666063827256086533: None,
 666058600524156928: None,
 666057090499244032: None,
 666055525042405380: None,
 666051853826850816: None,
 666050758794694657: None,
 666044226329800704: None,
 666033412701032449: None,
 666029285002620928: None}
In [34]:
name_column_index = list(df_twit_archive.columns).index('name')
name_column_index
Out[34]:
7
In [35]:
name_column_index = list(df_twit_archive.columns).index('name')
for index, row in df_twit_archive.iterrows():
    if row['tweet_id'] in correct_name_dic.keys():
        df_twit_archive.iloc[index, name_column_index] = correct_name_dic[row.tweet_id]

Test

In [36]:
df_twit_archive.query('name in ["a", "an", "the", "such"]')
Out[36]:
tweet_id timestamp source text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo

df_twit_archive: The rating_numerator is represented by an integer whose rating is a decimal point.

Define

In the text column of df_twit_archive table, extract the rating of 00.0 / 00 type using regex and modify rating_numerator type to 00.0

Code

In [37]:
series_with_regex = df_twit_archive.text.str.extract('((\d+\.)?\d+)\/(\d+)', expand=True)
In [38]:
series_with_regex.dropna()
Out[38]:
0 1 2
41 13.5 13. 10
531 9.75 9. 10
588 11.27 11. 10
1486 11.26 11. 10
In [39]:
rating_numerator_index = 5
df_twit_archive.iloc[40, rating_numerator_index] = 13.5
df_twit_archive.iloc[549, rating_numerator_index] = 9.75
df_twit_archive.iloc[605, rating_numerator_index] = 11.27
df_twit_archive.iloc[1429, rating_numerator_index] = 11.26

Test

In [40]:
df_twit_archive.iloc[40].rating_numerator, df_twit_archive.iloc[549].rating_numerator, df_twit_archive.iloc[605].rating_numerator, df_twit_archive.iloc[1429].rating_numerator
Out[40]:
(13.5, 9.75, 11.27, 11.26)

df_twit_archive: rating_denominator is not 10

Define

drop the rows with rating_denominator != 10

Code

In [41]:
index_of_not_having_10_denominator = df_twit_archive.query('rating_denominator != 10').index
df_twit_archive.drop(index_of_not_having_10_denominator, inplace=True)

Test

In [42]:
df_twit_archive.query('rating_denominator != 10')
Out[42]:
tweet_id timestamp source text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo

df_twit_archive: rating_denominator and rating_numerator can be combined in one column

Define

Evaluate rating_numerator / rating_denominator and save it as rating, delete rating_numerator and rating_denominator

In [43]:
df_twit_archive['rating'] = df_twit_archive.rating_numerator / df_twit_archive.rating_denominator
In [44]:
df_twit_archive.drop(['rating_numerator', 'rating_denominator'], axis=1, inplace=True)

Test

In [45]:
df_twit_archive.head(1)
Out[45]:
tweet_id timestamp source text expanded_urls name doggo floofer pupper puppo rating
0 892420643555336193 2017-08-01 16:23:56 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... https://twitter.com/dog_rates/status/892420643... Phineas None None None None 1.3

df_twit_archive: expanded_url has rows with multiple URLs

Define

remove duplicated URLs from expanded_url column. For example, urls = 'a,a,a' then, extract 'a' only

Code

In [46]:
def delete_duplicated_urls(url):
    first_url = url.split(',')[0]
    return first_url  
In [47]:
df_twit_archive['correct_expanded_urls'] = df_twit_archive.apply(lambda x: delete_duplicated_urls(x['expanded_urls']), axis=1)
In [48]:
df_twit_archive.drop('expanded_urls', axis=1, inplace=True)

Test

In [49]:
df_twit_archive.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2099 entries, 0 to 2116
Data columns (total 11 columns):
tweet_id                 2099 non-null int64
timestamp                2099 non-null datetime64[ns]
source                   2099 non-null object
text                     2099 non-null object
name                     2051 non-null object
doggo                    2099 non-null object
floofer                  2099 non-null object
pupper                   2099 non-null object
puppo                    2099 non-null object
rating                   2099 non-null float64
correct_expanded_urls    2099 non-null object
dtypes: datetime64[ns](1), float64(1), int64(1), object(8)
memory usage: 196.8+ KB
In [50]:
for index, row in df_twit_archive.iterrows():
    if len(row['correct_expanded_urls'].split(',')) > 1:
        print(row['correct_expanded_urls'])

**df_twit_archive: 1 variable in 4 columns in df_twit_archive table (doggo, floofer, pupper, puppo)

Define

Change doggo, floofer, pupper, puppo columns to kinds column

Code

In [51]:
df_twit_archive.groupby(["doggo", "floofer", "pupper", "puppo"]).size().reset_index().rename(columns={0: "count"})
Out[51]:
doggo floofer pupper puppo count
0 None None None None 1761
1 None None None puppo 23
2 None None pupper None 222
... ... ... ... ... ...
5 doggo None None puppo 1
6 doggo None pupper None 10
7 doggo floofer None None 1

8 rows × 5 columns

In [52]:
df_twit_archive.doggo.replace('None', '', inplace=True)
df_twit_archive.doggo.replace(np.NaN, '', inplace=True)
df_twit_archive.floofer.replace('None', '', inplace=True)
df_twit_archive.floofer.replace(np.NaN, '', inplace=True)
df_twit_archive.pupper.replace('None', '', inplace=True)
df_twit_archive.pupper.replace(np.NaN, '', inplace=True)
df_twit_archive.puppo.replace('None', '', inplace=True)
df_twit_archive.puppo.replace(np.NaN, '', inplace=True)
In [53]:
df_twit_archive['stage'] = df_twit_archive.doggo + df_twit_archive.floofer + df_twit_archive.pupper + df_twit_archive.puppo
df_twit_archive.loc[df_twit_archive.stage == 'doggopupper', 'stage'] = 'doggo,pupper'
df_twit_archive.loc[df_twit_archive.stage == 'doggopuppo', 'stage'] = 'doggo,puppo'
df_twit_archive.loc[df_twit_archive.stage == 'doggofloofer', 'stage'] = 'doggo,floofer'
In [54]:
df_twit_archive.drop(['doggo', 'floofer', 'pupper', 'puppo'], axis=1, inplace=True)

Test

In [55]:
df_twit_archive.stage.value_counts()
Out[55]:
                 1761
pupper            222
doggo              72
                 ... 
floofer             9
doggo,puppo         1
doggo,floofer       1
Name: stage, Length: 8, dtype: int64
In [56]:
df_twit_archive
Out[56]:
tweet_id timestamp source text name rating correct_expanded_urls stage
0 892420643555336193 2017-08-01 16:23:56 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... Phineas 1.3 https://twitter.com/dog_rates/status/892420643...
1 892177421306343426 2017-08-01 00:17:27 <a href="http://twitter.com/download/iphone" r... This is Tilly. She's just checking pup on you.... Tilly 1.3 https://twitter.com/dog_rates/status/892177421...
2 891815181378084864 2017-07-31 00:18:03 <a href="http://twitter.com/download/iphone" r... This is Archie. He is a rare Norwegian Pouncin... Archie 1.2 https://twitter.com/dog_rates/status/891815181...
... ... ... ... ... ... ... ... ...
2114 666033412701032449 2015-11-15 23:21:54 <a href="http://twitter.com/download/iphone" r... Here is a very happy pup. Big fan of well-main... None 0.9 https://twitter.com/dog_rates/status/666033412...
2115 666029285002620928 2015-11-15 23:05:30 <a href="http://twitter.com/download/iphone" r... This is a western brown Mitsubishi terrier. Up... None 0.7 https://twitter.com/dog_rates/status/666029285...
2116 666020888022790149 2015-11-15 22:32:08 <a href="http://twitter.com/download/iphone" r... Here we have a Japanese Irish Setter. Lost eye... None 0.8 https://twitter.com/dog_rates/status/666020888...

2099 rows × 8 columns

df_twit_add: Attach favorite_count, retweet_count of df_twit_add to df_twit_archive

Define

Attach only favorite count and retwit count columns of df_twit_add to df_twit_archive based on tweet_id

Code

In [57]:
df_twit_add.rename(columns={'id': 'tweet_id'}, inplace=True)
In [58]:
df_twit_archive = df_twit_archive.merge(df_twit_add)

df_twit_add_columns = list(df_twit_add.columns)
df_twit_add_columns.remove('tweet_id')
df_twit_add_columns.remove('favorite_count')
df_twit_add_columns.remove('retweet_count')


df_twit_archive.drop(columns=df_twit_add_columns, inplace=True)

Test

In [59]:
df_twit_archive
Out[59]:
tweet_id timestamp text name rating correct_expanded_urls stage favorite_count retweet_count
0 892420643555336193 2017-08-01 16:23:56 This is Phineas. He's a mystical boy. Only eve... Phineas 1.3 https://twitter.com/dog_rates/status/892420643... 37511 8176
1 892177421306343426 2017-08-01 00:17:27 This is Tilly. She's just checking pup on you.... Tilly 1.3 https://twitter.com/dog_rates/status/892177421... 32256 6051
2 891815181378084864 2017-07-31 00:18:03 This is Archie. He is a rare Norwegian Pouncin... Archie 1.2 https://twitter.com/dog_rates/status/891815181... 24298 4003
... ... ... ... ... ... ... ... ... ...
2092 666033412701032449 2015-11-15 23:21:54 Here is a very happy pup. Big fan of well-main... None 0.9 https://twitter.com/dog_rates/status/666033412... 120 43
2093 666029285002620928 2015-11-15 23:05:30 This is a western brown Mitsubishi terrier. Up... None 0.7 https://twitter.com/dog_rates/status/666029285... 124 46
2094 666020888022790149 2015-11-15 22:32:08 Here we have a Japanese Irish Setter. Lost eye... None 0.8 https://twitter.com/dog_rates/status/666020888... 2503 493

2095 rows × 9 columns

df_image_predic: Attach p1, p1_conf, p1_dog of df_image_predic to df_twit_archive

Define

Attach df_image_predic to df_twit_archive (p1, p1_conf, p1_dog only)

Code

In [60]:
df_twit_archive = pd.merge(df_twit_archive, df_image_predic)
df_twit_archive
Out[60]:
tweet_id timestamp text name rating correct_expanded_urls stage favorite_count retweet_count jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 892420643555336193 2017-08-01 16:23:56 This is Phineas. He's a mystical boy. Only eve... Phineas 1.3 https://twitter.com/dog_rates/status/892420643... 37511 8176 https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg 1 orange 0.097049 False bagel 0.085851 False banana 0.076110 False
1 892177421306343426 2017-08-01 00:17:27 This is Tilly. She's just checking pup on you.... Tilly 1.3 https://twitter.com/dog_rates/status/892177421... 32256 6051 https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg 1 Chihuahua 0.323581 True Pekinese 0.090647 True papillon 0.068957 True
2 891815181378084864 2017-07-31 00:18:03 This is Archie. He is a rare Norwegian Pouncin... Archie 1.2 https://twitter.com/dog_rates/status/891815181... 24298 4003 https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg 1 Chihuahua 0.716012 True malamute 0.078253 True kelpie 0.031379 True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1969 666033412701032449 2015-11-15 23:21:54 Here is a very happy pup. Big fan of well-main... None 0.9 https://twitter.com/dog_rates/status/666033412... 120 43 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
1970 666029285002620928 2015-11-15 23:05:30 This is a western brown Mitsubishi terrier. Up... None 0.7 https://twitter.com/dog_rates/status/666029285... 124 46 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
1971 666020888022790149 2015-11-15 22:32:08 Here we have a Japanese Irish Setter. Lost eye... None 0.8 https://twitter.com/dog_rates/status/666020888... 2503 493 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True

1972 rows × 20 columns

In [61]:
df_image_predic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB
In [62]:
df_twit_archive.drop(columns=['jpg_url', 'img_num', 'p2', 'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog'], inplace=True)

Test

In [63]:
df_twit_archive
Out[63]:
tweet_id timestamp text name rating correct_expanded_urls stage favorite_count retweet_count p1 p1_conf p1_dog
0 892420643555336193 2017-08-01 16:23:56 This is Phineas. He's a mystical boy. Only eve... Phineas 1.3 https://twitter.com/dog_rates/status/892420643... 37511 8176 orange 0.097049 False
1 892177421306343426 2017-08-01 00:17:27 This is Tilly. She's just checking pup on you.... Tilly 1.3 https://twitter.com/dog_rates/status/892177421... 32256 6051 Chihuahua 0.323581 True
2 891815181378084864 2017-07-31 00:18:03 This is Archie. He is a rare Norwegian Pouncin... Archie 1.2 https://twitter.com/dog_rates/status/891815181... 24298 4003 Chihuahua 0.716012 True
... ... ... ... ... ... ... ... ... ... ... ... ...
1969 666033412701032449 2015-11-15 23:21:54 Here is a very happy pup. Big fan of well-main... None 0.9 https://twitter.com/dog_rates/status/666033412... 120 43 German_shepherd 0.596461 True
1970 666029285002620928 2015-11-15 23:05:30 This is a western brown Mitsubishi terrier. Up... None 0.7 https://twitter.com/dog_rates/status/666029285... 124 46 redbone 0.506826 True
1971 666020888022790149 2015-11-15 22:32:08 Here we have a Japanese Irish Setter. Lost eye... None 0.8 https://twitter.com/dog_rates/status/666020888... 2503 493 Welsh_springer_spaniel 0.465074 True

1972 rows × 12 columns

Analysis

The largest number of breeds on WeRateDogs (with a confidence of more than 70% for accuracy and p1_dog being true)

In [64]:
df_believable_dog = df_twit_archive.query('p1_conf >= 0.7 and p1_dog == True')
series_breeds = df_believable_dog.p1.value_counts()
series_breeds[:15]
Out[64]:
golden_retriever            83
Pembroke                    51
Labrador_retriever          44
                            ..
cocker_spaniel              10
Shetland_sheepdog            8
Chesapeake_Bay_retriever     8
Name: p1, Length: 15, dtype: int64
In [65]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(12,5));
plt.bar(x=series_breeds[:15].index, height=series_breeds[:15]);
plt.xlabel('breeds')
plt.ylabel('counts')
plt.title('The largest number of breeds on WeRateDogs')
plt.xticks(rotation=90);

There are much more Golden retriever than other breeds. The next many breeds are Pembroke, Labrador retriever, and Chihuahua.

In [66]:
df_twit_archive.describe()
Out[66]:
tweet_id rating favorite_count retweet_count p1_conf
count 1.972000e+03 1972.000000 1972.000000 1972.000000 1972.000000
mean 7.359168e+17 1.169867 8595.159736 2600.624239 0.593051
std 6.758359e+16 4.089260 12622.476979 4649.741849 0.271938
... ... ... ... ... ...
50% 7.083530e+17 1.100000 3901.500000 1247.500000 0.587222
75% 7.880674e+17 1.200000 10729.250000 2982.750000 0.843469
max 8.924206e+17 177.600000 162025.000000 82343.000000 1.000000

8 rows × 5 columns

In [67]:
df_believable_famous_dog = df_twit_archive.query('p1_conf >= 0.7 and p1_dog == True and favorite_count > 10000')
series_famous_breeds = df_believable_famous_dog.p1.value_counts()
series_famous_breeds[:15]
Out[67]:
golden_retriever      37
Pembroke              21
Labrador_retriever    19
                      ..
Cardigan               3
vizsla                 3
miniature_pinscher     3
Name: p1, Length: 15, dtype: int64
In [68]:
plt.figure(figsize=(12,5));
plt.bar(x=series_famous_breeds[:15].index, height=series_famous_breeds[:15]);
plt.xlabel('breeds')
plt.ylabel('counts')
plt.title('The largest number of famous breeds on WeRateDogs')
plt.xticks(rotation=90);

In this case, there are much more Golden retriever than other breeds too. The next many breeds are Pembroke, Labrador retriever, and Chihuahua as like above

In [69]:
df_twit_archive.stage.value_counts()
Out[69]:
                 1666
pupper            203
doggo              63
                 ... 
floofer             7
doggo,puppo         1
doggo,floofer       1
Name: stage, Length: 8, dtype: int64

There are 1666 pupper, 203 doggo.

Store Dataset

combined and cleaned data to csv file

In [70]:
df_twit_archive.to_csv('df_twit_archive_master', index=True)