This project will compare posts on Hacker News (HN) platform specifically Ask HN posts (users ask the HN community a question) and Show HN posts (users show the HN community a project, product, or generally something interesting).
The objective is to discover:
The dataset to be used for this data analysis can be found here
The dataset has 300,000 rows, by removing posts which did not have any comments and then randomly sampling, this has been reduced to 20,000.
import csv
file = open('hacker_news.csv')
hn_data = list(csv.reader(file))
hn_header = hn_data[0]
hn_data = hn_data[1:]
# Preview data
print(hn_header)
print('\n')
print(hn_data[:2])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]
This dataset includes the following columns:
title
: title of the post (self explanatory)url
: the url of the item being linked tonum_points
: the number of upvotes the post receivednum_comments
: the number of comments the post receivedauthor
: the name of the account that made the postcreated_at
: the date and time the post was madePost have been categorised into three: 'ask hn', 'show hn' and 'others' based on the post 'title'. Posts which have a title that begins with 'ask hn' will be categorised as such, for instance, 'Ask HN: How to improve my personal website?'. The same technique will be used for categorising 'show hn' posts. The 'others' category will be used for the rest of user posts.
post_ask = []
post_show = []
post_other = []
for post in hn_data:
hn_title = post[1]
if hn_title.lower().startswith('ask hn'):
post_ask.append(post)
elif hn_title.lower().startswith('show hn'):
post_show.append(post)
else:
post_other.append(post)
# Post count
print(len(post_ask), 'posts for Ask HN')
print(len(post_show), 'posts for Show HN')
print(len(post_other), 'posts for Others')
# Preview posts
print('\n')
print('Posts for Ask HN')
print(post_ask[:2])
print('\n')
print('Posts for Show HN')
print(post_show[:2])
print('\n')
print('Posts for others')
print(post_other[:2])
1744 posts for Ask HN 1162 posts for Show HN 17194 posts for Others Posts for Ask HN [['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']] Posts for Show HN [['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']] Posts for others [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]
The breakdown shows that the larger portion of the data set belongs to posts in the 'others' category. The ask posts have a total count of 1744 while for show posts, there are 1162.
The breakdown for hacker news posts based on the total count for comments has been displayed below. There are over twice as much user comments for ask posts than there are comments on show posts from the hacker news community. The data shows that there are more comments on average for ask posts.
Post Type | Comments (Total) | Comments (Avg) |
---|---|---|
Ask HN | 24,483 | 14.04 |
Show HN | 11,988 | 10.32 |
# Total and comment avg for ask posts
total_ask_comments = 0
for post in post_ask:
comments = int(post[4])
total_ask_comments += comments
avg_ask_comments = total_ask_comments / len(post_ask)
print ('{:,}'.format(total_ask_comments))
print (round(avg_ask_comments,2))
24,483 14.04
# Total and comment avg for show posts
total_show_comments = 0
for post in post_show:
comments = int(post[4])
total_show_comments += comments
avg_show_comments = total_show_comments / len(post_show)
print ('{:,}'.format(total_show_comments))
print (round(avg_show_comments,2))
11,988 10.32
Progressing further with the analysis, we would like to discover if posts created at a certain time are more likely to recieve more comments from the hacker news community. This part of the analysis will only utilise ask posts reason being there is more data to work with.
Below we determine the amount of ask posts created in each hour of the day, along with the number of comments received by time period. The data shows that most ask posts recieve more comments around 15:00pm.
import datetime as dt
result_list = []
for post in post_ask:
result_list.append([post[6], int(post[4])])
comments_by_hour = {}
counts_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
for each_row in result_list:
dt_created = each_row[0]
num_comment = each_row[1]
time = dt.datetime.strptime(dt_created, date_format).strftime("%H")
if time in counts_by_hour:
comments_by_hour[time] += num_comment
counts_by_hour[time] += 1
else:
comments_by_hour[time] = num_comment
counts_by_hour[time] = 1
comments_by_hour
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
Based on the dataset documentation, the timezone used is the eastern time in the US, 15:00 will be equivalent to 3:00 pm est. The top 5 hours for most comments on Ask Posts are 15:00, 02:00, 20.00, 16:00 and 21.00. The hour that receives the most comments on average is 15:00 with thirty nine comments per post.
avg_by_hour = []
for hr in comments_by_hour:
avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])
#avg_by_hour
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1], row[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("The Top 5 Hours for 'Ask HN' Comments")
for avg, hr in sorted_swap[:5]:
print(
"{}: {:.2f}".format(
dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
)
)
The Top 5 Hours for 'Ask HN' Comments 15:00: 38.59 02:00: 23.81 20:00: 21.52 16:00: 16.80 21:00: 16.01
The purpose of this part of the analysis will be to discover the breakdown for hacker news posts based on points. Points represent the number of upvotes the post has received over time from the hacker news community. The data shows that there are more points on average for show posts. This suggests that while ask posts will likely recieve more comments from the HN community, show posts tend have more upvotes.
Post Type | Points (Total) | Points (Avg) |
---|---|---|
Ask HN | 26,268 | 15.06 |
Show HN | 32,019 | 27.56 |
# Total and point avg for ask posts
total_ask_points = 0
for post in post_ask:
points = int(post[3])
total_ask_points += points
avg_ask_points = total_ask_points / len(post_ask)
print ('Points total for ask posts:', '{:,}'.format(total_ask_points))
print ('Points avg for ask posts:', round(avg_ask_points,2))
print ('\n')
# Total and point avg for show posts
total_show_points = 0
for post in post_show:
points = int(post[3])
total_show_points += points
avg_show_points = total_show_points / len(post_show)
print ('Points total for show posts:', '{:,}'.format(total_show_points))
print ('Points avg for ask posts:', round(avg_show_points,2))
Points total for ask posts: 26,268 Points avg for ask posts: 15.06 Points total for show posts: 32,019 Points avg for ask posts: 27.56
Progressing further with the point analysis, we would like to discover if posts created at a certain time are more likely to recieve more upvotes. The same technique used for the analysis of post comments will be used here, lets determine the average number of points for show posts by hour.
import datetime as dt
result_list = []
for post in post_show:
result_list.append([post[6], int(post[3])])
points_by_hour = {}
counts_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
for each_row in result_list:
dt_created = each_row[0]
num_point = each_row[1]
time = dt.datetime.strptime(dt_created, date_format).strftime("%H")
if time in counts_by_hour:
points_by_hour[time] += num_point
counts_by_hour[time] += 1
else:
points_by_hour[time] = num_point
counts_by_hour[time] = 1
points_by_hour
avg_by_hour = []
for hr in points_by_hour:
avg_by_hour.append([hr, points_by_hour[hr] / counts_by_hour[hr]])
avg_by_hour
[['14', 25.430232558139537], ['22', 40.34782608695652], ['18', 36.31147540983606], ['07', 19.0], ['20', 30.316666666666666], ['05', 5.473684210526316], ['16', 28.322580645161292], ['19', 30.945454545454545], ['15', 28.564102564102566], ['03', 25.14814814814815], ['17', 27.107526881720432], ['06', 23.4375], ['02', 11.333333333333334], ['13', 24.626262626262626], ['08', 15.264705882352942], ['21', 18.425531914893618], ['04', 14.846153846153847], ['11', 33.63636363636363], ['12', 41.68852459016394], ['23', 42.388888888888886], ['09', 18.433333333333334], ['01', 25.0], ['10', 18.916666666666668], ['00', 37.83870967741935]]
The hour that receives the most points for show posts on average is 23:00. The top 5 hours for most upvotes from the hacker news community for show posts are 23:00, 12 noon, 22.00, 00:00 and 18.00.
points_avg_by_hour = []
for row in avg_by_hour:
points_avg_by_hour.append([row[1], row[0]])
sorted_list = sorted(points_avg_by_hour, reverse=True)
print("Top 5 hours for points on 'Show HN' posts")
for avg, hr in sorted_list[:5]:
print(
"{}: {:.2f}".format(
dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
)
)
Top 5 hours for points on 'Show HN' posts 23:00: 42.39 12:00: 41.69 22:00: 40.35 00:00: 37.84 18:00: 36.31
This project analysed ask posts and show posts on the hacker news platform to determine which type of post and time period received the most comments and points on average.
Based on the analysis, to optimise the possibility of receiving more comments, we'd recommend users to post on the hacker news platform using the 'ask hn' title and possibly create the post sometime between the time period of 15:00pm and 16:00pm EST. This time period in my timezone is between 20:00pm and 21:00pm WAT.
Post Type | Comments (Total) | Comments (Avg) |
---|---|---|
Ask HN | 24,483 | 14.04 |
Show HN | 11,988 | 10.32 |
Furthermore, on comparing both types of posts based on the points / upvotes recieved, the data shows that there are more points on average for show posts than there are for ask posts. This suggests that while 'ask hn' posts are more likely to recieve more comments from the hacker news community, 'show hn' posts tend to recieve more points.
Post Type | Points (Total) | Points (Avg) |
---|---|---|
Ask HN | 26,268 | 15.06 |
Show HN | 32,019 | 27.56 |