Exploring Hacker News Posts

Hacker News is site started by Y Combinator where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. You can find the data set here but it has been reduced from almost 300,000 rows to around 20,000 rows by removing posts that didn't receive any comments, then randomly sampling from the remaining submissions.

Below are the columns' descriptions:

  • id: The unique identifier from Hacker News for the post
  • title: The title of the post
  • url: The URL that the posts links to, if it the post has a URL
  • num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
  • num_comments: The number of comments that were made on the post
  • author: The username of the person who submitted the post
  • created_at: The date and time at which the post was submitted

In this project, we are more interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN to ask the Hacker News community a question. Below is an example of Ask HN

Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?

Users submit Show HN to show the community a project, product, or something interesting. Below is an example:

Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm

Our goal is to compare the 2 types of posts to determine:

  • Do Ask HN or Show HN receive more comments on average?
  • Do posts created at a certain time receive more comments on average?

Let's open and read the data. WWe'll display the header row and the first five rows.

In [1]:
#Opening dataset

import csv
def open_dataset(dataset):
    opened_file = open(dataset)
    read_file = csv.reader(opened_file)
    rows = list(read_file)
    return rows
    
hn = open_dataset('hacker_news.csv')[1:]
hn_header_row = open_dataset('hacker_news.csv')[0]

# function to count the numbers or rows and display row in a 
#readable way

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end] # to slice the dataset
    for row in dataset_slice: 
        print(row) # to print the sliced data set
        print('\n') # adds a new (empty) line after each row
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print(hn_header_row)
print('\n')
print(explore_data(hn, 0, 5, True))
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


Number of rows: 20100
Number of columns: 7
None

We can see that the dataset has 20100 rows and 7 columns.

Now we are going to extract the Ask HN and Show HN posts from the other posts. We'll save Ask HN in ask_posts', Show HN in show_posts, and the rest of the posts in other_posts.

In [2]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower() # convert titles to lower case
    if title.startswith('ask hn'): 
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('Number of Ask HN post:',len(ask_posts))
print('Number of Show HN post:',len(show_posts))
print('Other posts:',len(other_posts))
Number of Ask HN post: 1744
Number of Show HN post: 1162
Other posts: 17194

We separated the 'ask posts' and 'show posts' into 2 list of lists. You can see that we have 1744 ask posts and 1162 show posts. Below is the first five rows of the ask posts list

In [3]:
#Printing ask posts
print(hn_header_row)
print('\n')
print(explore_data(ask_posts, 0, 5, True))
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']


['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']


['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']


['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']


['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']


Number of rows: 1744
Number of columns: 7
None

Below are the first five rows of the show posts list:

In [4]:
#Printing show posts
print(hn_header_row)
print('\n')
print(explore_data(show_posts, 0, 5, True))
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']


['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']


['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']


['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11']


['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']


Number of rows: 1162
Number of columns: 7
None

Now let's see if ask posts or show posts receive more comments on average

In [5]:
#find the total number of comments for ask posts
total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])
print('Total number of comments for ask show:', total_ask_comments)
avg_ask_comments = round(total_ask_comments / len(ask_posts))
print('Average number of comments per ask posts:', avg_ask_comments)
print('\n')

#number of comments for show posts
total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])
print('Total number of show posts comments:', total_show_comments)
avg_show_comments = round(total_show_comments / len(show_posts))
print('Average number of comments per show post:', avg_show_comments)
Total number of comments for ask show: 24483
Average number of comments per ask posts: 14


Total number of show posts comments: 11988
Average number of comments per show post: 10

On average the ask posts receive more comments than the shop posts. To come to that conclusion, we calculated the total numbers of posts for, then divided that number the length of the dataset. Notice above, we found that there are 24483 ask post comments and in previous cell we found that there are 1744 ask posts. So to determine the average, we divided 24483/1744 and rounded down the answer. We did the same for show posts as well.

Ask posts has more comments on average(14 comments) than show posts with 10. I believe the reason is because the ask post involves more interaction and people are more willing to answer a question than to comment on a project or any different show post. Since ask post are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Now let's determine if ask posts created at a certain time are more likely to attract comments. We'll

  1. Calculate the amount of ask posts created each hour of the day, along with the number of comments received
  2. Calculate the average number of comments ask posts receive by hour created.

Amount of ask posts and comments created each hour

In [6]:
import datetime as dt
result_list = []
for row in ask_posts:
    result_list.append([row[6], int(row[4])])

counts_by_hour = {}
comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'

for row in result_list:
    hour = row[0]
    hour = dt.datetime.strptime(hour, date_format).strftime('%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
print('Posts created by hour:','\n',counts_by_hour)
print('\n')
print('Comments posted by hour:','\n',comments_by_hour)
Posts created by hour: 
 {'16': 108, '11': 58, '17': 100, '20': 80, '15': 116, '13': 85, '04': 47, '22': 71, '03': 54, '07': 34, '21': 109, '08': 48, '23': 68, '02': 58, '14': 107, '10': 59, '09': 45, '05': 46, '01': 60, '00': 55, '18': 109, '19': 110, '12': 73, '06': 44}


Comments posted by hour: 
 {'16': 1814, '11': 641, '17': 1146, '20': 1722, '15': 4477, '13': 1253, '04': 337, '22': 479, '03': 421, '07': 267, '21': 1745, '08': 492, '23': 543, '02': 1381, '14': 1416, '10': 793, '09': 251, '05': 464, '01': 683, '00': 447, '18': 1439, '19': 1188, '12': 687, '06': 397}

So above, we created 2 dictionaries: counts_by_hour for the posts created per hour and comments_by_hour for the comments created by hour. The hours are in 24h format. For example you can see that at 18(6pm) there were 109 posts and 1439 comments created.

Average number of comments for Ask HN posts by hour

Now let's calculate the average number of comments for posts created during each hour of the day. We'll use the two dictionaries created above.

In [7]:
avg_by_hour = [] #list of list to store hours and avg comments
for c in comments_by_hour:
    avg_by_hour.append([c, comments_by_hour[c]/counts_by_hour[c]])
print(avg_by_hour)
        
[['16', 16.796296296296298], ['11', 11.051724137931034], ['17', 11.46], ['20', 21.525], ['15', 38.5948275862069], ['13', 14.741176470588234], ['04', 7.170212765957447], ['22', 6.746478873239437], ['03', 7.796296296296297], ['07', 7.852941176470588], ['21', 16.009174311926607], ['08', 10.25], ['23', 7.985294117647059], ['02', 23.810344827586206], ['14', 13.233644859813085], ['10', 13.440677966101696], ['09', 5.5777777777777775], ['05', 10.08695652173913], ['01', 11.383333333333333], ['00', 8.127272727272727], ['18', 13.20183486238532], ['19', 10.8], ['12', 9.41095890410959], ['06', 9.022727272727273]]

Above we calculated the average number of comments created by post during each hour of the day.

Now let's find out the hours with the highest values. We are going to swap element's places in our 'avg_by_hour' list. For example the first value in our list is ['01', 11.383333333333333], we'll swap them so that it becomes [11.383333333333333, '01']. The average number of comments per hour will become the first element, the hour will become second. Once we do that, we'll sort through list in descending order

In [8]:
swap_avg_by_hour = []
for h, c in avg_by_hour:
    swap_avg_by_hour.append([c,h])
print(swap_avg_by_hour)
[[16.796296296296298, '16'], [11.051724137931034, '11'], [11.46, '17'], [21.525, '20'], [38.5948275862069, '15'], [14.741176470588234, '13'], [7.170212765957447, '04'], [6.746478873239437, '22'], [7.796296296296297, '03'], [7.852941176470588, '07'], [16.009174311926607, '21'], [10.25, '08'], [7.985294117647059, '23'], [23.810344827586206, '02'], [13.233644859813085, '14'], [13.440677966101696, '10'], [5.5777777777777775, '09'], [10.08695652173913, '05'], [11.383333333333333, '01'], [8.127272727272727, '00'], [13.20183486238532, '18'], [10.8, '19'], [9.41095890410959, '12'], [9.022727272727273, '06']]
In [9]:
#sorting through the swapped list
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap[:5])
[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21']]

As you can see above we sorted through our swapped list and printed the Top 5 hours for Ask posts comments. 15(3pm) has the most comments per hour with 38.5 followed by 2am with 23.8, then 20(8pm) with 21.5

In [14]:
for comment, hour in sorted_swap[:5]:
    each_hour = dt.datetime.strptime(hour, '%H').strftime('%H:%M')
    comment_per_hour = '{h}: {c:.2f} average comments per post'.format(h = each_hour, c = comment)
    print(comment_per_hour)
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post

After our analysis of the Hacker News dataset, we have concluded that the ideal time to post and have a higher chance of receiving comments is 15(or 3PM), followed by 2am(I really don't know why anyone would stay awake at this time to comment but some people do lol). 15 or 3pm is at the top with the most received comments on average but that could also be influenced by the fact that it's also the hour with the most post created(116 posts).