Guide project 2: Exploring Hacker News


In this project,we'll work with a data set of submissions to popular technology site Hacker News and at the same time i will have to get the fact for my friend.

My friend had a too much compains with his submission in the Hacker News site.He argues that all the post he submitted has lesser comments compared to other people's posts. As data analyst,I had to check into the site to come up with the fact. We had a long phone conversion for him to believe me.

This how our conversion went through, but before that, you can click here to have the data set( note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions)

Friend:Helow, how are you doing.

Me: I am doing well, tell me ,hope this time it's not about the apps.

Friend:Not really,I don't trust this site, imagine ever since i stated submitting my post, the number of comments i recieved are so wanting compared to other posts which do end up with uncountable comments.

Me: That's so funny, and which site are you talking about if you don't mind.

Friend: Hacker news,i can't trust it anymore.

Me: You don't have to say that yet, let me check into the site, so that we can get the fact.You will have to hold on for atmost 45 minutes.

Friend: It's okay.

1).importing and reading the data

Without further ado, we will import and read our data set which is found on CSV file named hacker_news.csv

In [1]:
from csv import reader

opened_file = open('hacker_news.csv') # we open our file
reader_file= reader(opened_file) # we read it here
hn = list(reader_file)          # we convert it to list of list result then stored in a variable named hn

# let's display the first five rows

print(hn[:5])
          
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]

with header row, we may end up with errors while analysing the data, and my friend as well may not get the actual facts, so in the cell below we will remove the first row(header row) from hn

Have a look

In [2]:
headers = hn[0]
hn = hn[1:]

# let's now display the headers and the first five rows of hn for verification
print(headers)
print("\n")  # this creates a  space
print(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]

2). Filtering our data

Since we have removed the headers from hn, we can now filter our data. BUt before that, I have to get back to my friend to know the type of post he has been submitting.

Me:can you hear me friend,

Friend:yes I can hear you.

Me:can you recall the kind of post you have been submitting.

Friend:ask post

Me:It's okay I'll get back to you.

for now the response is somehow positive, since in the cell below we will be concerned only on the post begining with Ask HN or SHow HN and we'll create new lists of lists containing just the data for those titles.

In [3]:
# let's now create three empty list
ask_posts = []   # this will hold posts titled ask hn
show_posts = []  # hold posts titled show hn
other_posts = [] # post's titles are neither ask hn nor show hn

#le't now loop through each row in hn

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
# Let's check the number of post in each list

print("posts in ask hn: ",len(ask_posts))
print("posts in show hn: ",len(show_posts))
print("posts in other posts: ",len(other_posts))
        
posts in ask hn:  1744
posts in show hn:  1162
posts in other posts:  17194

Let's check on few rows in ask posts and show posts

1.Ask posts

In [4]:
print(ask_posts[:2])
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']]

2. Show posts

In [5]:
print(show_posts[:2])
[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']]

3).Finding post with more comments

Remember our aim is to determine which post recieves more comments, to do this we will compute the average number of comments in each post i.e ask posts and show posts

i). Ask posts

In [6]:
# working on average number of comments in ask posts

total_ask_comments = 0

for row in ask_posts:
    num_comments = row[4]
    total_ask_comments += int(num_comments)
    
#let's now compute the average

avg_ask_comments = total_ask_comments/len(ask_posts)
print("sum total of comments in ask posts: ",total_ask_comments)
print("Average number of comments in ask posts: ",avg_ask_comments)
print("\n") # to create space

# working on average number of comments in show posts

total_show_comments = 0

for row in show_posts:
    num_comments = row[4]
    total_show_comments += int(num_comments)
    
#let's now compute the average

avg_show_comments = total_show_comments/len(show_posts)

print("sum total of comments in show posts: ",total_show_comments)
print("Average number of comments in show posts: ",avg_show_comments)
    
sum total of comments in ask posts:  24483
Average number of comments in ask posts:  14.038417431192661


sum total of comments in show posts:  11988
Average number of comments in show posts:  10.31669535283993

From the above workings we find that, ask posts recieve more comment of an average of 14 compared to show posts which have 10 comments on average.

We are very lucky, my friend had an issue in ask posts, and this is the posts with more comments compared to show posts, we'll now focus our remaining analysins on these posts(ask posts).

4).Finding the amount of Ask Posts and Comments by Hour Created

With this I'll be in position to answer my friend on the fact about this site. since ,we'll determine if ask posts created at a certain time are more likely to attract comments.

To do this, We'll,

i.Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.

ii.Calculate the average number of comments ask posts receive by hour created.

In [7]:
# we will first import datetime module

import datetime as dt

result_list = [] # this will contain number of comments created at different times

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

#let's now two empt ditionary

counts_by_hour = {}
comments_by_hour ={}

for row in result_list:
    extract_date = row[0]  # extract_date
    comment = row[1]       #extract comments
    
    date_time = dt.datetime.strptime(extract_date, "%m/%d/%Y %H:%M") #etract the date
    extract_hour = dt.datetime.strftime(date_time, "%H") # strftime extract the hour

    if extract_hour not in counts_by_hour:
        counts_by_hour[extract_hour] = 1
        comments_by_hour[extract_hour] = comment
    else:
        counts_by_hour[extract_hour] += 1
        comments_by_hour[extract_hour] += comment
        
print("the number of comments on asked post create by the hour:")
print("hour: " "comments")
comments_by_hour
        
the number of comments on asked post create by the hour:
hour: comments
Out[7]:
{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

From the above output,we can see that ask posts uploaded noon time to late in the evening i.e between 13:00 to 21:00 recieves more comment with a peak at aroud 15:00.

We can also notice that,ask posts uploaded between midnight to 8.00 in the morning with exception of 2:00 recieve less comments.

let me now get back to my friend to acquare more information

Me:hellow...what hours do you usually upload your post.

Friend:Oftenly, I do submit my posts in late afternoon.

Me: Be specific please...

Friend: between 12:00 to 15:00 (time zone is East Africa time,EAT)

Me:Now I catch you, but hold on in not more than 5 minutes I present to you the fact.

so you can see why my friend had to complain, can we say it's was out of ignorance ? I don't know, but there wasn't reasons at all for blackmailing the site.

5).Canculating average number of commets per post.

we haven't done yet, let's canculate the average number of comments per posts for posts created during each hour of the day.

To achieve this, we use the two dictionary we created in cell 6, have alook;

In [8]:
avg_by_hour = [] 

for hours in comments_by_hour:
    avg_by_hour.append([hours,comments_by_hour[hours]/counts_by_hour[hours]])

print("average number of comments for posts created during each hour of the day")

avg_by_hour
average number of comments for posts created during each hour of the day
Out[8]:
[['12', 9.41095890410959],
 ['10', 13.440677966101696],
 ['22', 6.746478873239437],
 ['15', 38.5948275862069],
 ['11', 11.051724137931034],
 ['07', 7.852941176470588],
 ['00', 8.127272727272727],
 ['18', 13.20183486238532],
 ['21', 16.009174311926607],
 ['23', 7.985294117647059],
 ['02', 23.810344827586206],
 ['14', 13.233644859813085],
 ['17', 11.46],
 ['05', 10.08695652173913],
 ['08', 10.25],
 ['20', 21.525],
 ['13', 14.741176470588234],
 ['09', 5.5777777777777775],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['06', 9.022727272727273],
 ['03', 7.796296296296297],
 ['04', 7.170212765957447],
 ['16', 16.796296296296298]]

6).Sorting and printing values from a list of list

when, we sort the output above it will be easily readable

Have a look;

In [9]:
# we create an empty list first

swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]]) #this makes it easy to sort our list
    
# let's print our new list to confirms the order of arrangement
print("list with swapped columns:")
print("\n") #to create space
print(*swap_avg_by_hour[:3], sep="\n")
print("\n")

# sorting the list

sorted_swap = sorted(swap_avg_by_hour, reverse = True) # this sort the list in descending oder

print("Top 5 Hours for Ask posts comments")
print("\n")
print(*sorted_swap[:5], sep="\n")
print("\n")

# formating our top 5 hours for ask post comments

for row in sorted_swap[:5]:
    avg_comments = row[0]
    hour = row[1]
    print("{} : {:.2f} average comments per post.".format(dt.datetime.strptime(hour, "%H").strftime("%H:%M"),avg_comments)           )
list with swapped columns:


[9.41095890410959, '12']
[13.440677966101696, '10']
[6.746478873239437, '22']


Top 5 Hours for Ask posts comments


[38.5948275862069, '15']
[23.810344827586206, '02']
[21.525, '20']
[16.796296296296298, '16']
[16.009174311926607, '21']


15:00 : 38.59 average comments per post.
02:00 : 23.81 average comments per post.
20:00 : 21.52 average comments per post.
16:00 : 16.80 average comments per post.
21:00 : 16.01 average comments per post.

7) My time zone (EAT zone)

Before i get back to my friend, I have to check for my time zone; East Africa Time(EAT) which is 7 hours ahead to the given data.

Have a look

In [14]:
print("Top 5 Hours for Ask Posts Comments in EAT time zone")
print("\n")

date_format = "%H"

for row in sorted_swap[:5]:
    avg_comments ="{:.2f}".format(row[0]) # we use two decimal place
    hour = dt.datetime.strptime(row[1], date_format)
    hour_in_gmt = hour + dt.timedelta(hours=7)
    time = hour_in_gmt.strftime("%H:%M")
    print ("{}: {} average cooments per post".format(time, avg_comments))   
Top 5 Hours for Ask Posts Comments in EAT time zone


22:00: 38.59 average cooments per post
09:00: 23.81 average cooments per post
03:00: 21.52 average cooments per post
23:00: 16.80 average cooments per post
04:00: 16.01 average cooments per post

I think now I have the facts for my friend.

Me:Hello friend I am done...

Friend:hope you've seen for youself.**

Me:Not really,first ,posts(Ask post) in hacker news site, are highly affected by time, which do varry.If by any chance, you submitted your post during late noon at 15:00 ( or 22:00 in EAT zone) which is not for your case, you will automatically recieve more comments.BUt your submission has been during morning hours between 05:00 to 08:00 (or 12:00 to 15: 00 in EAT zone) which have the least comments, of an average of atmost 10.so, always do your submission between 15:00 t0 21:00 (or between 22:00 to 04:00 in EAT zone) and you'll have uncountable comments just the way you claimed before.

Friend:Wow! I can't believe this,you mean it's a matter of time(hours)! Thank you very much friend and I wish you all the best in your journey(Data science).

Me:Thank you, and always welcome.

Conclusion

From the results above we can conclude;

1.) Ask HN posts received more comments on average than Show_posts.

2.) Ask posts get a higher recieve upvotes, especially those uploaded between 13:00 and 21:00. (or 20:00 to 04:00 in EAT zone.)

In [ ]: