Exploring Hacker News (HN) Posts

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

In this project, we'll work with a data set of submissions to popular technology site Hacker News.

Our goal in this project is to determine the following:

  1. Do Ask HN or Show HN receive more comments on average?
  2. Do posts created at a certain time receive more comments on average?

The Data set comprises of 30,000 Rows and 7 Columns. Each column description is given below for reference.

id: The unique identifier from Hacker News for the post\ title: The title of the post\ url: The URL that the posts links to, if it the post has a URL\ num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes\ num_comments: The number of comments that were made on the post\ author: The username of the person who submitted the post\ created_at: The date and time at which the post was submitted

In [115]:
from csv import reader
hn_opened_file = open('HN_posts_year_to_Sep_26_2016.csv', encoding='utf8')
hn_csv_file = reader(hn_opened_file)
hn = list(hn_csv_file)
hn[:5]
Out[115]:
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16']]

Remove the first row that contains column descriptions

In [116]:
headers = hn[:1] # column descriptions
hn = hn[1:] # remove the row that contains column descriptions
print(headers)
hn[:5]
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]
Out[116]:
[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16'],
 ['12578979',
  'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake',
  'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94',
  '1',
  '0',
  'markgainor1',
  '9/26/2016 3:14']]

We're only interested in posts that start with either 'Ask NH' or 'Show NH'.

Create 3 different lists named ask_posts, show_posts, and other_posts

In [117]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('Ask  NH posts count : ', len(ask_posts))
print('Show NH posts count : ', len(show_posts))
print('Other   posts count : ', len(other_posts))
Ask  NH posts count :  9139
Show NH posts count :  10158
Other   posts count :  273822

Let's start with our first goal of determining "whether Ask HN or Show HN receive more comments on average?"

In [118]:
total_ask_comments = 0
total_num_ask_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])
    total_num_ask_comments += 1

avg_ask_comments = total_ask_comments/total_num_ask_comments
print("Average 'Ask NH'  posts comments :", avg_ask_comments)

total_show_comments = 0
total_num_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])
    total_num_show_comments += 1

avg_show_comments = total_show_comments/total_num_show_comments
print("Average 'SHow NH' posts comments :", avg_show_comments)
Average 'Ask NH'  posts comments : 10.393478498741656
Average 'SHow NH' posts comments : 4.886099625910612

Above, we calculated the average comments received by the posts starting with title 'Ask NH' and 'Show NH'. We found Ask NH posts receive higher number of comments on average.

We'll move on to next and final goal of determining if "posts created at a certain time receive more comments on average."

Since we found that 'Ask NH' received more comments in average, we'll analyze these posts further. To determine the time of posts that receive more comments, we'll perform below steps.

  1. Calculate the amount of 'Ask NH' posts created in each hour of the day, along with the number of comments received.
  2. Calculate the average number of comments 'Ask NH' posts receive by hour created.

Calculate the amount of 'Ask NH' posts created in each hour of the day, along with the number of comments received.

In [121]:
import datetime as dt
result_list = []
for row in ask_posts:
    post_create_time = row[6] # post creation time
    post_comments = int(row[4]) # number of comments received by the post
    result_list.append([post_create_time, post_comments])

counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    time = row[0]
    num_comments = row[1]
    dt_obj = dt.datetime.strptime(time, "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(dt_obj, "%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments
        
comments_by_hour
Out[121]:
{'02': 2996,
 '01': 2089,
 '22': 3372,
 '21': 4500,
 '19': 3954,
 '17': 5547,
 '15': 18525,
 '14': 4972,
 '13': 7245,
 '11': 2797,
 '10': 3013,
 '09': 1477,
 '07': 1585,
 '03': 2154,
 '23': 2297,
 '20': 4462,
 '16': 4466,
 '08': 2362,
 '00': 2277,
 '18': 4877,
 '12': 4234,
 '04': 2360,
 '06': 1587,
 '05': 1838}

Calculate the average number of comments 'Ask NH' posts receive by hour created.

In [122]:
avg_by_hour = []
for hour in counts_by_hour:
    avg_by_hour.append([hour, (comments_by_hour[hour]/counts_by_hour[hour])])
    
avg_by_hour
Out[122]:
[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

Sorting and printing top 5 hours that receive highest comments for posts in average

In [123]:
swap_avg_by_hour = []
for row in avg_by_hour:
    hour = row[0]
    avg_comments = row[1]
    swap_avg_by_hour.append([avg_comments, hour])
    
swap_avg_by_hour
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Ask Posts Comments")
for avg_cmts, hour in sorted_swap[:5]:
    display_str = "{}: {:.2f} average comments per post"
    date_object = dt.datetime.strptime(hour, "%H")
    time_str = dt.datetime.strftime(date_object, "%H:%M")
    display_str = display_str.format(time_str, avg_cmts)
    print(display_str)
Top 5 Hours for Ask Posts Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post

From the output, posts created at time 15:00 receive highest comments in average. Next, posts created at 13:00 are likely to get more comments. From the data set, time is represented in Eastern Time (ET). I live in INDIA. Let's convert these top 5 entries to IST timezone which is 9 hrs 30 mins ahead of ET.

In [124]:
print("Top 5 Hours for Ask Posts Comments in IST timezone")
for avg_cmts, hour in sorted_swap[:5]:
    display_str = "{}: {:.2f} average comments per post"
    datetime_object = dt.datetime.strptime(hour, "%H")
    time_object = dt.timedelta(hours=9, minutes=30)
    ist_time_object = datetime_object + time_object
    time_str = dt.datetime.strftime(ist_time_object, "%H:%M")
    display_str = display_str.format(time_str, avg_cmts)
    print(display_str)
Top 5 Hours for Ask Posts Comments in IST timezone
00:30: 28.68 average comments per post
22:30: 16.32 average comments per post
21:30: 12.38 average comments per post
11:30: 11.14 average comments per post
19:30: 10.68 average comments per post

Posts created from INDIA at 12:30 AM are likely to receive highest comments in average.