#!/usr/bin/env python # coding: utf-8 # # Project: Wrangling and Analyze Data # ## Data Gathering # In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different. # 1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv) # In[3]: #importing the necessary libraries import numpy as np import pandas as pd import requests import os import tweepy import json import matplotlib.pyplot as plt get_ipython().run_line_magic('matplotlib', 'inline') # In[4]: #reading and checking the csv file df1=pd.read_csv('twitter-archive-enhanced.csv') # 2. Use the Requests library to download the tweet image prediction (image_predictions.tsv) # In[5]: #downloading the image_predictions file folder='twitter' if not os.path.exists(folder): os.makedirs(folder) # In[6]: url=' https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv' response=requests.get(url) # In[7]: #writing the downloaded doc to a file so we can access it with open(os.path.join(folder, url.split('/')[-1]), mode='wb')as file: file.write(response.content) # In[8]: #accessing and viewing the tsv file df2= pd.read_csv('image-predictions.tsv', sep='\t') # 3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt) # In[9]: #Loading the json file into a dataframe new_list = [] #an empty list with open('tweet-json.txt') as file: for tweet in file: data = json.loads(tweet) tweet_id=data['id'] retweet_count=data['retweet_count'] favorite_count=data['favorite_count'] new_list.append({"tweet_id": tweet_id, "retweet_count": int(retweet_count), "favourite_count": favorite_count}) df3=pd.DataFrame(new_list, columns= ['tweet_id', 'retweet_count', 'favourite_count']) # ## Assessing Data # In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment # programmatic assessement to assess the data. # # **Note:** pay attention to the following key points when you access the data. # # * You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used. # # # In[10]: #viewing the twitter-archive-enhanced dataframe df1.head() # In[11]: #viewing the image-predictions dataframe df2 # In[12]: #checking the tweet-json file df3 # In[13]: #checking the information of df1 df1.info() # In[14]: #checking the information of df2 df2.info() # In[15]: #checking the information of df3 df3.info() # In[16]: #checking the descriptive statistics of df2 df2.describe() # In[17]: #checking for duplicated columns across the 3 dataframes columns=pd.Series(list(df1)+ list(df2)+ list(df3)) columns[columns.duplicated()] # In[18]: #checking for which values in the df1 that are retweets df1[df1.retweeted_status_id.notnull()] # In[19]: #checking for duplicated values in df1 sum(df1.duplicated()) # In[20]: #checking for duplicated values in df2 sum(df2.duplicated()) # In[21]: #checking for duplicated values in df1 sum(df3.duplicated()) # In[22]: #checking the number of missing values in each column of df1 df1.isnull().sum() # In[23]: #checking the number of missing values in each column of df2 df2.isnull().sum() # In[24]: #checking the number of missing values in each column of df3 df3.isnull().sum() # ### Quality issues # #### df1 (twiiter-archive) # - Erroneous datatypes in some columns. # - Some values in the name column are in lowercase and appear to not be dog names. # - Some tweet-id's have values in the *retweeted_status_id, retweeted_status_user_id and retweeted_status_timestamp* column. # - Missing data in the *in_reply_to_status, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, expanded_url*. # # #### df2 (image-prediction) # - Erroneous datatype in tweet_id column. # - Outliers in *p2_conf and p3_conf* column. # - Non-descriptive column names i.e *p1, p2, p3, p1_conf* e.t.c. # - Inconsistent alphabet case in *p1, p2, p3*. # # #### df3 (tweet-json) # - Erroneous datatype in *tweet_id* column. # # ### Tidiness issues # - Dog stages (doggo, popper, puppo, and floofer) should be in one column. # - *tweet_id* in twitter-archive dataframe is duplicated in the other two dataframe. # - Merging the three dataframe as one. # ## Cleaning Data # # In[25]: # Make copies of original pieces of data df1_clean=df1.copy() df2_clean=df2.copy() df3_clean=df3.copy() # ## Quality # ### df1 # ### Issue #1: Erroneous datatypes in the *tweet_id, timestamp, retweeted_status_timestamp* # #### Define: Tweet_id would be changed from *int* to *string* datatype, while the other two would be changed to *datetime* # #### Code # In[26]: df1_clean['tweet_id']=df1_clean['tweet_id'].astype(str) # In[27]: date=['timestamp', 'retweeted_status_timestamp'] for value in date: df1_clean[value]=df1_clean[value].apply(pd.to_datetime) # #### Test # In[28]: df1_clean.info() # ### Issue #2: Some values in the name column are in lowercase and appear to not be dog names # #### Define: Replace the values that have lower case. # #### Code # In[29]: #we define a function that shows us the values in lowercase list=[] for lower in df1_clean.name: if lower.islower() and lower not in list: list.append(lower) print(list) # In[30]: #replace the values in [list] with 'None' since there are also other values that have None df1_clean['name'].replace(list, 'None', inplace=True) # #### Test # In[31]: df1_clean.name.sample(20) # ### Issue #3: Some tweet-id's have values in the retweeted_status_id, retweeted_status_user_id and retweeted_status_timestamp column. # #### Define: We remove these columns since what we need is original tweets and not retweets # #### Code # In[32]: #lets check the sum of the values in these columns print(df1_clean.retweeted_status_id.notnull().sum()) print(df1_clean.retweeted_status_user_id.notnull().sum()) print(df1_clean.retweeted_status_timestamp.notnull().sum()) # In[33]: #dropping the column df1_clean=df1_clean[df1_clean.retweeted_status_id.isnull()] # #### Test # In[34]: #cross check to be sure the columns have dropped print(df1_clean.retweeted_status_id.notnull().sum()) print(df1_clean.retweeted_status_user_id.notnull().sum()) print(df1_clean.retweeted_status_timestamp.notnull().sum()) # ### Issue #4: Missing data in the *in_reply_to_status, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, expanded_url*. # #### Define: Dropping off these columns as the amount of data missing is enormous and the columns won't be used for further analysis. # #### Code # In[35]: #group all columns into a list and drop them column=['in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp', 'expanded_urls'] df1_clean.drop(column, axis=1, inplace=True) # #### Test # In[36]: df1_clean.isnull().sum() # ### df2 # ### Issue #5: Erroneous datatype in tweet_id column. # #### Define: Chane the datatype from int to string # #### Code # In[37]: #changing the datatype to string using .astype() df2_clean['tweet_id']=df2_clean.tweet_id.astype(str) # #### Test # In[38]: df2_clean.info() # ### Issue #6: Outliers in *p2_conf and p3_conf* column. # #### Define: Dropping the columns as they would not be used in further analysis # #### Code # In[39]: list=['p2_conf', 'p3_conf'] df2_clean.drop(list, axis=1, inplace=True) # #### Test # In[40]: df2_clean.info() # ### Issue #7: Non-descriptive column names i.e p1, p2, p3, p1_conf e.t.c. # #### Define: Rename the columns with the following descriptive names- prediction1, pred1_confidence, pred1_asdog, prediction2, pred2_asdog, prediction3, pred3_asdog # #### Code # In[41]: df2_clean=df2_clean.rename(columns={'p1': 'prediction1', 'p1_conf': 'p1_confidence', 'p1_dog': 'p1_asdog', 'p2':'prediction2', 'p2_dog':'p2_asdog', 'p3':'prediction3', 'p3_dog':'p3_asdog'}) # #### Test # In[42]: df2_clean.head(1) # ### Issue #8: Inconsistent alphabet case in *p1, p2, p3* # #### Define: we change the values in the 3 columns to lowercase # #### Code # In[43]: df2_clean['prediction1']=df2_clean.prediction1.str.lower() df2_clean['prediction2']=df2_clean.prediction2.str.lower() df2_clean['prediction3']=df2_clean.prediction3.str.lower() # #### Test # In[44]: df2_clean.prediction1.sample(20) # ### df3 # ### Issue #9: Erroneous datatype in tweet_id column. # #### Define: Change the datatype from int to string using the astype() method # #### Code # In[45]: df3_clean['tweet_id']=df3_clean.tweet_id.astype(str) # #### Test # In[46]: df3_clean.info() # ### Tidiness # ### Issue #1: Dog stages (doggo, popper, puppo, and floofer) should be in one column. # #### Define: Using pd.melt, we convert the four columns of dog stages into one column. # #### Code # In[47]: #using the pd.melt method df1_clean=pd.melt(df1_clean, id_vars=['tweet_id', 'timestamp', 'source', 'text', 'rating_numerator', 'rating_denominator', 'name'], var_name='types', value_name='dog_stages') # In[48]: #removing the var_name column as this just repeats the types of dog_stages and also to get rid of duplicates df1_clean.drop(columns='types', inplace=True) df1_clean.drop_duplicates(inplace=True) #replacing the 'None' values with NaN df1_clean['dog_stages']=df1_clean.dog_stages.replace('None', np.nan) # #### Test # In[49]: df1_clean['dog_stages'].value_counts() # ### Issue #2: *tweet_id* in twitter-archive (df1) dataframe is duplicated in the other two dataframe. # #### Define: Merge the favorite count and retweet count to the twitter-archive table, joining on tweet_id. # #### Code # In[50]: df1_clean=pd.merge(df1_clean, df3_clean, on='tweet_id', how='left') # #### Test # In[51]: df1_clean.info() # ### Issue #3: Merging the dataframes into one. # #### Define: Before merging the image prediction table, we would be dropping some columns as they would not be used for further analysis # #### Code # In[52]: #dropping the prediction2, p2_asdog, prediction3, p3_asdog columns list = ['prediction2', 'p2_asdog', 'prediction3', 'p3_asdog'] df2_clean.drop(list, axis=1, inplace=True) # In[53]: #then we merge the image-prediction dataframe with the twitter-archive on tweet_id df1_clean=pd.merge(df1_clean, df2_clean, on='tweet_id', how='left') # #### Test # In[54]: df1_clean.info() # - The datatype of the *img_num* column so we would need to change it back to integer. First we convert all the NaN values to 0 to avoid errors. # In[55]: df1_clean['img_num']=df1_clean.img_num.fillna(0) df1_clean['img_num']=df1_clean.img_num.astype(int) # In[56]: #checking to see that the datatype has changed df1_clean.info() # ## Storing Data # Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv". # In[57]: df1_clean.to_csv('twitter_archive_master.csv', index=False) # ## Analyzing and Visualizing Data # In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.** # In[58]: #load the new csv into a dataframe and view it df_new=pd.read_csv('twitter_archive_master.csv') df_new.head(4) # In[59]: #shows the descriptive statistics of our dataset df_new.describe() # In[60]: df_new['img_num'].value_counts() # In[61]: df_new.groupby('p1_asdog')['img_num'].value_counts().unstack() # In[62]: df_new['dog_stages'].value_counts() # ### Insights: # 1. Tweets that have one image were mostly predicted and we have more of them as dogs. # # 2. The mean of the first prediction confidence is approximately 60%. # # 3. We have more favorite_count than retweet_count. # # 4. We have more dogs as pupper (smaller and younger dogs) than the rest of the other dog stages. # ### Visualization # #### Q1. What dog stage has the highest count? # In[71]: #we would show the distribution of the dogstages on a barchart stages=df_new['dog_stages'].value_counts() stages.plot(kind='bar', figsize = (5,6)) plt.xticks(rotation=0) plt.title("Dog Stages") plt.xlabel("Stages") plt.ylabel("Count"); # - Most dogs were classified as a pupper, which is a smaller and usually younger type of dogs. But the data for the dog stages is incomplete so we cannot acertain the accuracy of this. # #### Q2. What image number occured most and how accurate was it prediction as a dog? # In[67]: #plotting the image numbers with the prediction as dog pred=df_new.groupby('p1_asdog')['img_num'].value_counts().unstack() pred.index=['False', 'True'] pred.plot(kind='bar', figsize= (5,6)) plt.legend(['1', '2', '3', '4']) plt.xticks(rotation=0) plt.title("Image Number Count with Prediction as Dog") plt.xlabel("Prediction as Dog") plt.ylabel("Image Number Count"); # - Tweet with one image were mostly predicted through the neural network and more 50% of the prediction were dogs. # #### Q3. What correlation exists between the favourite count and retweet count # In[73]: #we want to see the correlation between the favorite and retweet count df_new.plot(x='favourite_count', y='retweet_count', kind='scatter') plt.title('Favorite and Retweet Count'); # - From the above, we can see a positive correlation between the retweet and favorite count.