Exploring Hacker News Posts

Hacker News is a site where users can submit posts are voted and commented upon. Hacker News is very popular in technology and startup cicles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

There are two types of posts that are important for this project: "Ask HN" and "Show HN". People submit "Ask HN" posts to ask Hacker News community a specific question. Likewise, users submit "Shoe HN" posts to show a project, product or something interesting.

The purpose of this project is to compare these two types of posts and answer the following questions:

  1. Do "Ask HN" or "Show HN" receive more comments on average?
  2. Do posts created at a certain time receive more comments on average?
In [1]:
import csv
opened_file = open("HN_posts_year_to_Sep_26_2016.csv")
hn = list(csv.reader(opened_file))
print(hn[:5])
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]

Remove the header row from hn.

In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-\xc3\x82\xc2\x93the-data-vault\xc3\x82\xc2\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]

Split ask posts and show posts into two different lists:

In [3]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
9139
10158
273822

Calculate average number of comments for ask posts:

In [5]:
total_ask_comments = 0.0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)
10.3934784987

Calculate average number of comments for show posts:

In [6]:
total_show_comments = 0.0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)
4.88609962591

On average, ask posts receive more comments than show posts.

Next, we will decide if ask posts created at certain time are more likely to get more comments. There are two steps to perform this analysis:

  1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
  2. Calculate the average number of comments ask posts reveive by hour created.
In [39]:
import datetime as dt
result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at,num_comments])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_time = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = date_time.strftime("%H")
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
        
comments_by_hour
Out[39]:
{'00': 2277,
 '01': 2089,
 '02': 2996,
 '03': 2154,
 '04': 2360,
 '05': 1838,
 '06': 1587,
 '07': 1585,
 '08': 2362,
 '09': 1477,
 '10': 3013,
 '11': 2797,
 '12': 4234,
 '13': 7245,
 '14': 4972,
 '15': 18525,
 '16': 4466,
 '17': 5547,
 '18': 4877,
 '19': 3954,
 '20': 4462,
 '21': 4500,
 '22': 3372,
 '23': 2297}
In [40]:
avg_by_hour = []
for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
avg_by_hour
Out[40]:
[['02', 11],
 ['03', 7],
 ['00', 7],
 ['01', 7],
 ['20', 8],
 ['21', 8],
 ['22', 8],
 ['23', 6],
 ['08', 9],
 ['09', 6],
 ['14', 9],
 ['06', 6],
 ['07', 7],
 ['11', 8],
 ['10', 10],
 ['13', 16],
 ['12', 12],
 ['15', 28],
 ['04', 9],
 ['17', 9],
 ['16', 7],
 ['19', 7],
 ['18', 7],
 ['05', 8]]
In [ ]: