Hacker News is site started by Y Combinator where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. You can find the data set here but it has been reduced from almost 300,000 rows to around 20,000 rows by removing posts that didn't receive any comments, then randomly sampling from the remaining submissions.
Below are the columns' descriptions:
In this project, we are more interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN to ask the Hacker News community a question. Below is an example of Ask HN
Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?
Users submit Show HN to show the community a project, product, or something interesting. Below is an example:
Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm
Our goal is to compare the 2 types of posts to determine:
Let's open and read the data. WWe'll display the header row and the first five rows.
#Opening dataset
import csv
def open_dataset(dataset):
opened_file = open(dataset)
read_file = csv.reader(opened_file)
rows = list(read_file)
return rows
hn = open_dataset('hacker_news.csv')[1:]
hn_header_row = open_dataset('hacker_news.csv')[0]
# function to count the numbers or rows and display row in a
#readable way
def explore_data(dataset, start, end, rows_and_columns=False):
dataset_slice = dataset[start:end] # to slice the dataset
for row in dataset_slice:
print(row) # to print the sliced data set
print('\n') # adds a new (empty) line after each row
if rows_and_columns:
print('Number of rows:', len(dataset))
print('Number of columns:', len(dataset[0]))
print(hn_header_row)
print('\n')
print(explore_data(hn, 0, 5, True))
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'] Number of rows: 20100 Number of columns: 7 None
We can see that the dataset has 20100 rows and 7 columns.
Now we are going to extract the Ask HN and Show HN posts from the other posts. We'll save Ask HN in ask_posts', Show HN in show_posts, and the rest of the posts in other_posts.
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
title = title.lower() # convert titles to lower case
if title.startswith('ask hn'):
ask_posts.append(row)
elif title.startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print('Number of Ask HN post:',len(ask_posts))
print('Number of Show HN post:',len(show_posts))
print('Other posts:',len(other_posts))
Number of Ask HN post: 1744 Number of Show HN post: 1162 Other posts: 17194
We separated the 'ask posts' and 'show posts' into 2 list of lists. You can see that we have 1744 ask posts and 1162 show posts. Below is the first five rows of the ask posts list
#Printing ask posts
print(hn_header_row)
print('\n')
print(explore_data(ask_posts, 0, 5, True))
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'] ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'] ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'] ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'] ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38'] Number of rows: 1744 Number of columns: 7 None
Below are the first five rows of the show posts list:
#Printing show posts
print(hn_header_row)
print('\n')
print(explore_data(show_posts, 0, 5, True))
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'] ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'] ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'] ['12178806', 'Show HN: Webscope Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'] ['10872799', 'Show HN: GeoScreenshot Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45'] Number of rows: 1162 Number of columns: 7 None
#find the total number of comments for ask posts
total_ask_comments = 0
for row in ask_posts:
total_ask_comments += int(row[4])
print('Total number of comments for ask show:', total_ask_comments)
avg_ask_comments = round(total_ask_comments / len(ask_posts))
print('Average number of comments per ask posts:', avg_ask_comments)
print('\n')
#number of comments for show posts
total_show_comments = 0
for row in show_posts:
total_show_comments += int(row[4])
print('Total number of show posts comments:', total_show_comments)
avg_show_comments = round(total_show_comments / len(show_posts))
print('Average number of comments per show post:', avg_show_comments)
Total number of comments for ask show: 24483 Average number of comments per ask posts: 14 Total number of show posts comments: 11988 Average number of comments per show post: 10
On average the ask posts receive more comments than the shop posts. To come to that conclusion, we calculated the total numbers of posts for, then divided that number the length of the dataset. Notice above, we found that there are 24483 ask post comments and in previous cell we found that there are 1744 ask posts. So to determine the average, we divided 24483/1744 and rounded down the answer. We did the same for show posts as well.
Ask posts has more comments on average(14 comments) than show posts with 10. I believe the reason is because the ask post involves more interaction and people are more willing to answer a question than to comment on a project or any different show post. Since ask post are more likely to receive comments, we'll focus our remaining analysis just on these posts.
Now let's determine if ask posts created at a certain time are more likely to attract comments. We'll
import datetime as dt
result_list = []
for row in ask_posts:
result_list.append([row[6], int(row[4])])
counts_by_hour = {}
comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'
for row in result_list:
hour = row[0]
hour = dt.datetime.strptime(hour, date_format).strftime('%H')
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = row[1]
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += row[1]
print('Posts created by hour:','\n',counts_by_hour)
print('\n')
print('Comments posted by hour:','\n',comments_by_hour)
Posts created by hour: {'16': 108, '11': 58, '17': 100, '20': 80, '15': 116, '13': 85, '04': 47, '22': 71, '03': 54, '07': 34, '21': 109, '08': 48, '23': 68, '02': 58, '14': 107, '10': 59, '09': 45, '05': 46, '01': 60, '00': 55, '18': 109, '19': 110, '12': 73, '06': 44} Comments posted by hour: {'16': 1814, '11': 641, '17': 1146, '20': 1722, '15': 4477, '13': 1253, '04': 337, '22': 479, '03': 421, '07': 267, '21': 1745, '08': 492, '23': 543, '02': 1381, '14': 1416, '10': 793, '09': 251, '05': 464, '01': 683, '00': 447, '18': 1439, '19': 1188, '12': 687, '06': 397}
So above, we created 2 dictionaries: counts_by_hour for the posts created per hour and comments_by_hour for the comments created by hour. The hours are in 24h format. For example you can see that at 18(6pm) there were 109 posts and 1439 comments created.
Now let's calculate the average number of comments for posts created during each hour of the day. We'll use the two dictionaries created above.
avg_by_hour = [] #list of list to store hours and avg comments
for c in comments_by_hour:
avg_by_hour.append([c, comments_by_hour[c]/counts_by_hour[c]])
print(avg_by_hour)
[['16', 16.796296296296298], ['11', 11.051724137931034], ['17', 11.46], ['20', 21.525], ['15', 38.5948275862069], ['13', 14.741176470588234], ['04', 7.170212765957447], ['22', 6.746478873239437], ['03', 7.796296296296297], ['07', 7.852941176470588], ['21', 16.009174311926607], ['08', 10.25], ['23', 7.985294117647059], ['02', 23.810344827586206], ['14', 13.233644859813085], ['10', 13.440677966101696], ['09', 5.5777777777777775], ['05', 10.08695652173913], ['01', 11.383333333333333], ['00', 8.127272727272727], ['18', 13.20183486238532], ['19', 10.8], ['12', 9.41095890410959], ['06', 9.022727272727273]]
Above we calculated the average number of comments created by post during each hour of the day.
Now let's find out the hours with the highest values. We are going to swap element's places in our 'avg_by_hour' list. For example the first value in our list is ['01', 11.383333333333333], we'll swap them so that it becomes [11.383333333333333, '01']. The average number of comments per hour will become the first element, the hour will become second. Once we do that, we'll sort through list in descending order
swap_avg_by_hour = []
for h, c in avg_by_hour:
swap_avg_by_hour.append([c,h])
print(swap_avg_by_hour)
[[16.796296296296298, '16'], [11.051724137931034, '11'], [11.46, '17'], [21.525, '20'], [38.5948275862069, '15'], [14.741176470588234, '13'], [7.170212765957447, '04'], [6.746478873239437, '22'], [7.796296296296297, '03'], [7.852941176470588, '07'], [16.009174311926607, '21'], [10.25, '08'], [7.985294117647059, '23'], [23.810344827586206, '02'], [13.233644859813085, '14'], [13.440677966101696, '10'], [5.5777777777777775, '09'], [10.08695652173913, '05'], [11.383333333333333, '01'], [8.127272727272727, '00'], [13.20183486238532, '18'], [10.8, '19'], [9.41095890410959, '12'], [9.022727272727273, '06']]
#sorting through the swapped list
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap[:5])
[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21']]
As you can see above we sorted through our swapped list and printed the Top 5 hours for Ask posts comments. 15(3pm) has the most comments per hour with 38.5 followed by 2am with 23.8, then 20(8pm) with 21.5
for comment, hour in sorted_swap[:5]:
each_hour = dt.datetime.strptime(hour, '%H').strftime('%H:%M')
comment_per_hour = '{h}: {c:.2f} average comments per post'.format(h = each_hour, c = comment)
print(comment_per_hour)
15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post
After our analysis of the Hacker News dataset, we have concluded that the ideal time to post and have a higher chance of receiving comments is 15(or 3PM), followed by 2am(I really don't know why anyone would stay awake at this time to comment but some people do lol). 15 or 3pm is at the top with the most received comments on average but that could also be influenced by the fact that it's also the hour with the most post created(116 posts).