Exploring Hacker News Posts

Aim

The aim of this project is to investigate the features of a post that predict the number of comments/votes it will attract and gain insight into the type and timing of posting that attracts most comments on Hacker News. We will look at which type of posts Ask HN or Show HN receive on average the highest number of comments and then analyze the type with the higher average in more detail by determining the average number of comments for posts at each hour of the day (adjusted for my current timezone in Romania!) We will then do a similar analysis of the number of points per post.

About Hacker News

Hacker News (HN) is a website started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

Full dataset

The full dataset of approx. 300,000 rows from which the reduced dataset of approx. 20,000 rows used for this project was extracted is Hacker News posts from the 12 months up to September 26 2016. The original dataset was reduced by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

The dataset includes the following columns:

  • title: title of the post (self explanatory)

  • url: the url of the item being linked to

  • num_points: the number of upvotes the post received

  • num_comments: the number of comments the post received

  • author: the name of the account that made the post

  • created_at: the date and time the post was made (the time zone is Eastern Time in the US)

GitHub link to HN scraper.

Reading the reduced dataset csv file

The reduced dataset used in this project is stored in the file 'hacker_news.csv'. We open this file and read it into the workspace, assigning as a list of lists to the variable hn.

In [1]:
import csv

opened_file = open('/content/drive/My Drive/Datasets/hacker_news.csv')
read_file = csv.reader(opened_file)
hn = list(read_file)
hn_headers = hn[0]
hn = hn[1:]

# display the headers row
print(hn_headers)
print()
# display first 5 posts in database
print(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]

Separate types of post into 'Ask HN', 'Show HN' and other

Some post titles begin with either Ask HN or Show HN. Users submit Show HN posts to show the HN community a project, product, or just generally something interesting. Likewise, users submit Ask HN posts to ask the HN community a specific question. We will investigate which of these post types receives more comments on average.

We begin by separating the dataset into 3 lists called ask_posts, show_posts and other_posts:

In [2]:
ask_posts, show_posts, other_posts = [], [], []

for row in hn:
  title = row[1]
  if title.lower().startswith('ask hn'):
    ask_posts.append(row)
  elif title.lower().startswith('show hn'):
    show_posts.append(row)
  else:
    other_posts.append(row)

# display number of posts in each list
print('Number of posts with title beginning \'Ask HN\': ',len(ask_posts))
print('Number of posts with title beginning \'Show HN\': ',len(show_posts))
print('Number of posts with title beginning with neither : ',len(other_posts))
Number of posts with title beginning 'Ask HN':  1744
Number of posts with title beginning 'Show HN':  1162
Number of posts with title beginning with neither :  17194

We will compare the average number of comments for Ask HN posts vs. Show HN posts by creating a function to return the average number of comments for posts in a list given as argument:

In [3]:
def average_comments(posts):
  total_comments = 0
  for post in posts:
    total_comments += int(post[4])
  
  num_posts = len(posts)
  return total_comments / num_posts

avg_ask_comments = average_comments(ask_posts)
avg_show_comments = average_comments(show_posts)
avg_other_comments = average_comments(other_posts)

# display average comments for types of post
print('Average number of comments for posts with title beginning \'Ask HN\': ', avg_ask_comments)
print('Average number of comments for posts with title beginning \'Show HN\': ', avg_show_comments)
print('Average number of comments for other posts: ', avg_other_comments)
Average number of comments for posts with title beginning 'Ask HN':  14.038417431192661
Average number of comments for posts with title beginning 'Show HN':  10.31669535283993
Average number of comments for other posts:  26.8730371059672

The mean number of comments on posts beginning Ask_HN is greater than the mean number of comments on posts beginning Show_HN. Since Ask_HN posts are more likely to receive comments we will focus on these posts in more detail.

How does the time a post is created affect the number of comments?

We will determine if ask posts created at a certain time are more likely to attract comments. For each hour of the day, we will calculate the number of posts and the total number of comments for these posts and use these values to calculate the average number of comments for posts created by each hour of the day.

By displaying the value of the created_at column for the first 10 posts in the hn list of lists, with reference to a Python strftime format guide, we can deduce the format of the dates and times in the created_at column is '%-m/%-d/%Y %-H:%M'

In [4]:
for post in hn[:10]:
  print(post[-1])
8/4/2016 11:52
1/26/2016 19:30
6/23/2016 22:20
6/17/2016 0:01
9/30/2015 4:12
10/31/2015 9:48
11/13/2015 0:45
8/16/2016 9:55
3/22/2016 16:18
10/13/2015 9:30
In [5]:
import datetime as dt

strftime_format = '%m/%d/%Y %H:%M'
result_list = []

for row in ask_posts: # iterate over ask_posts
  created_at_string = row[-1]
  created_at_dt = dt.datetime.strptime(created_at_string, strftime_format)
  num_comments = int(row[4])
  result_list.append([created_at_dt, num_comments])

counts_by_hour = {}
comments_by_hour = {}

for result in result_list:
  if str(result[0].hour) in counts_by_hour:
    counts_by_hour[str(result[0].hour)] += 1
    comments_by_hour[str(result[0].hour)] += result[1]
  else:
    counts_by_hour[str(result[0].hour)] = 1
    comments_by_hour[str(result[0].hour)] = result[1]

# initialize and populate a list of lists where each element is a list containing\
# the hour of day and the average number of comments for posts made during that hour
avg_by_hour = []
for hour in counts_by_hour:
  avg = comments_by_hour[hour] / counts_by_hour[hour]
  avg_by_hour.append([hour, avg])

# order the list by desc average number of comments and display
sorted_avg_by_hour = sorted(avg_by_hour, key=lambda x: x[1], reverse=True)
print(sorted_avg_by_hour)

# display top 5 hours with highest average number of comments for 'Ask HN' posts
print()
print('Top 5 hours for \'Ask HN\' post comments')
template = '{hour}: average number of comments per post: {average:.2f}'
for item in sorted_avg_by_hour[:5]:
  hour_dt = dt.datetime.strptime(item[0], '%H')
  print(template.format(hour=hour_dt.strftime('%H:%M'), average = item[1]))  
[['15', 38.5948275862069], ['2', 23.810344827586206], ['20', 21.525], ['16', 16.796296296296298], ['21', 16.009174311926607], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['18', 13.20183486238532], ['17', 11.46], ['1', 11.383333333333333], ['11', 11.051724137931034], ['19', 10.8], ['8', 10.25], ['5', 10.08695652173913], ['12', 9.41095890410959], ['6', 9.022727272727273], ['0', 8.127272727272727], ['23', 7.985294117647059], ['7', 7.852941176470588], ['3', 7.796296296296297], ['4', 7.170212765957447], ['22', 6.746478873239437], ['9', 5.5777777777777775]]

Top 5 hours for 'Ask HN' post comments
15:00: average number of comments per post: 38.59
02:00: average number of comments per post: 23.81
20:00: average number of comments per post: 21.52
16:00: average number of comments per post: 16.80
21:00: average number of comments per post: 16.01

We can see that posts published to HN between 15:00 and 16:00 hours (Eastern Time) receive the highest mean number of comments. The timezone of the time values in the created_at column of the dataset is US Eastern Time (ET) (UTC -5h). I am currently in Romania (land of Dracula and lots and lots of meat!) so my timezone is Eastern European Standard Time (EEST) (UTC +2h) so we will print out the top 5 hours for average number of post comments adjusting for this time difference:

In [6]:
print()
print('Top 5 hours for \'Ask HN\' post comments in EEST')
template = '{hour}: average number of comments per post: {average:.2f}'
for item in sorted_avg_by_hour[:5]:
  new_hour = (int(item[0]) + 7) % 24 # modular 24 value required
  hour_dt = dt.datetime.strptime(str(new_hour), '%H')
  print(template.format(hour=hour_dt.strftime('%H:%M'), average = item[1])) 
Top 5 hours for 'Ask HN' post comments in EEST
22:00: average number of comments per post: 38.59
09:00: average number of comments per post: 23.81
03:00: average number of comments per post: 21.52
23:00: average number of comments per post: 16.80
04:00: average number of comments per post: 16.01

We can see that the data is telling us that the best hour of day to post a Ask HN post to maximize the number of comments received is between 10pm and 11pm in Romania. Ask HN posts published during this hour receive on average over 50% more comments than Ask HN posts published during any other single hour of the day! Recall that we have only analyzed the Ask HN posts by hour of day (since these posts received more comments on average than Show HN posts) so the conclusion could be different for non Ask HN posts.

Analyzing the number of points different posts receive

We have analyzed at the number of comments received by different types of posts and the average number of posts for each hour of the day for Ask HN posts. We will now repeat a similar analysis for the number of points received by HN posts:

In [7]:
def average_points(posts):
  total_points = 0
  for post in posts:
    total_points += int(post[3])
  
  num_posts = len(posts)
  return total_points / num_posts

avg_ask_points = average_points(ask_posts)
avg_show_points = average_points(show_posts)
avg_other_points = average_points(other_posts)

# display average points for types of posts
print('Average number of points for posts with title beginning \'Ask HN\': ', avg_ask_points)
print('Average number of points for posts with title beginning \'Show HN\': ', avg_show_points)
print('Average number of points for other posts: ', avg_other_points)
Average number of points for posts with title beginning 'Ask HN':  15.061926605504587
Average number of points for posts with title beginning 'Show HN':  27.555077452667813
Average number of points for other posts:  55.4067698034198

It appears that Show HN posts receive more points on average than Ask HN posts, so we will calculate an hourly average number of points for this subset of the dataset.

In [8]:
# datetime module already imported

# strftime_format = '%m/%d/%Y %H:%M'
result_list = []

for row in show_posts: # iterate over show_posts
  created_at_string = row[-1]
  created_at_dt = dt.datetime.strptime(created_at_string, strftime_format)
  num_points = int(row[3])
  result_list.append([created_at_dt, num_points])

counts_by_hour = {}
points_by_hour = {}

for result in result_list:
  if str(result[0].hour) in counts_by_hour:
    counts_by_hour[str(result[0].hour)] += 1
    points_by_hour[str(result[0].hour)] += result[1]
  else:
    counts_by_hour[str(result[0].hour)] = 1
    points_by_hour[str(result[0].hour)] = result[1]

# initialize and populate a list of lists where each element is a list containing\
# the hour of day and the average number of points for posts made during that hour
avg_by_hour = []
for hour in counts_by_hour:
  avg = points_by_hour[hour] / counts_by_hour[hour]
  avg_by_hour.append([hour, avg])

# order the list by desc average number of points and display
sorted_avg_by_hour = sorted(avg_by_hour, key=lambda x: x[1], reverse=True)
print(sorted_avg_by_hour)

# display top 5 hours with highest average number of points for 'Show HN' posts
print()
print('Top 5 hours for \'Show HN\' post points')
template = '{hour}: average number of points per post: {average:.2f}'
for item in sorted_avg_by_hour[:5]:
  hour_dt = dt.datetime.strptime(item[0], '%H')
  print(template.format(hour=hour_dt.strftime('%H:%M'), average = item[1]))  
[['23', 42.388888888888886], ['12', 41.68852459016394], ['22', 40.34782608695652], ['0', 37.83870967741935], ['18', 36.31147540983606], ['11', 33.63636363636363], ['19', 30.945454545454545], ['20', 30.316666666666666], ['15', 28.564102564102566], ['16', 28.322580645161292], ['17', 27.107526881720432], ['14', 25.430232558139537], ['3', 25.14814814814815], ['1', 25.0], ['13', 24.626262626262626], ['6', 23.4375], ['7', 19.0], ['10', 18.916666666666668], ['9', 18.433333333333334], ['21', 18.425531914893618], ['8', 15.264705882352942], ['4', 14.846153846153847], ['2', 11.333333333333334], ['5', 5.473684210526316]]

Top 5 hours for 'Ask HN' post points
23:00: average number of points per post: 42.39
12:00: average number of points per post: 41.69
22:00: average number of points per post: 40.35
00:00: average number of points per post: 37.84
18:00: average number of points per post: 36.31
In [10]:
# adjusting to the land of Dracula (I waaaanntt to succckkk youur blooood...) timezone
print()
print('Top 5 hours for \'Show HN\' post points in EEST')
template = '{hour}: average number of points per post: {average:.2f}'
for item in sorted_avg_by_hour[:5]:
  new_hour = (int(item[0]) + 7) % 24 # modular 24 value required
  hour_dt = dt.datetime.strptime(str(new_hour), '%H')
  print(template.format(hour=hour_dt.strftime('%H:%M'), average = item[1])) 
Top 5 hours for 'Ask HN' post points in EEST
06:00: average number of points per post: 42.39
19:00: average number of points per post: 41.69
05:00: average number of points per post: 40.35
07:00: average number of points per post: 37.84
01:00: average number of points per post: 36.31

Show HN posts published between 6am and 7am EEST receive on average the highest number of points but there is not much difference between the average number of points per post for the top 5 hours in the list (the number 1 hour for posting has approx 17% more points per post than the number 5 hour).

Analyzing 'other' type posts

In this analysis other_posts is a subset of rows from the dataset which contains posts with a title which does not begin with 'Ask HN' or 'Show HN', we refer to these as 'other' posts.

We redisplay earlier analysis below:

In [11]:
print('Comments: ')
# display average comments for types of post
print('Average number of comments for posts with title beginning \'Ask HN\': ', avg_ask_comments)
print('Average number of comments for posts with title beginning \'Show HN\': ', avg_show_comments)
print('Average number of comments for other posts: ', avg_other_comments)
print()
print('Points: ')
# display average points for types of posts
print('Average number of points for posts with title beginning \'Ask HN\': ', avg_ask_points)
print('Average number of points for posts with title beginning \'Show HN\': ', avg_show_points)
print('Average number of points for other posts: ', avg_other_points)
Comments: 
Average number of comments for posts with title beginning 'Ask HN':  14.038417431192661
Average number of comments for posts with title beginning 'Show HN':  10.31669535283993
Average number of comments for other posts:  26.8730371059672
Points: 
Average number of points for posts with title beginning 'Ask HN':  15.061926605504587
Average number of points for posts with title beginning 'Show HN':  27.555077452667813
Average number of points for other posts:  55.4067698034198

It is clear that other posts receive significantly more comments per post and more points per post than Ask HN and Show HN type posts.

We will repeat the earlier processes of breaking down by hour the comments per post and the points per post for posts in the other_posts subset of the dataset:

In [17]:
# datetime module already imported

# strftime_format = '%m/%d/%Y %H:%M'
result_list = []

for row in other_posts: # iterate over other_posts
  created_at_string = row[-1]
  created_at_dt = dt.datetime.strptime(created_at_string, strftime_format)
  num_comments = int(row[4])
  num_points = int(row[3])
  result_list.append([created_at_dt, num_comments, num_points])

counts_by_hour = {}
points_by_hour = {}
comments_by_hour = {}

for result in result_list:
  if str(result[0].hour) in counts_by_hour:
    counts_by_hour[str(result[0].hour)] += 1
    comments_by_hour[str(result[0].hour)] += result[1]
    points_by_hour[str(result[0].hour)] += result[2]
  else:
    counts_by_hour[str(result[0].hour)] = 1
    comments_by_hour[str(result[0].hour)] = result[1]
    points_by_hour[str(result[0].hour)] = result[2]

# initialize and populate a list of lists where each element is a list containing\
# the hour of day, the average number of points and the average number of comments\
# for posts made during that hour
avg_by_hour = []
for hour in counts_by_hour:
  avg_comments = comments_by_hour[hour] / counts_by_hour[hour]
  avg_points = points_by_hour[hour] / counts_by_hour[hour]
  avg_by_hour.append([hour, avg_comments, avg_points])

# order the list by desc average of number of points per post and number of\
# comments per post
sorted_avg_by_hour = sorted(avg_by_hour, key=lambda x: (x[1] + x[2]) / 2, reverse=True)

# display by hour (adjusted to EEST from ET) average of number of points per post and number of\
# comments per post ordered desc
print()
print('Average of comments per post and points per post for \'other\' posts by hour (EEST) ordered desc: ')
template = '{hour}: average of comments and points per post: {average:.2f}'
for item in sorted_avg_by_hour[:24]:
  new_hour = (int(item[0]) + 7) % 24 # modular 24 value required
  hour_dt = dt.datetime.strptime(str(new_hour), '%H')
  avg_comments_points = (item[1] + item[2]) / 2
  print(template.format(hour=hour_dt.strftime('%H:%M'), average = avg_comments_points))  
Average of comments per post and points per post for 'other' posts by hour (EEST) ordered desc: 
21:00: average of comments and points per post: 47.06
20:00: average of comments and points per post: 46.71
22:00: average of comments and points per post: 45.03
19:00: average of comments and points per post: 43.87
18:00: average of comments and points per post: 43.58
17:00: average of comments and points per post: 43.55
02:00: average of comments and points per post: 43.36
09:00: average of comments and points per post: 43.13
00:00: average of comments and points per post: 42.99
07:00: average of comments and points per post: 42.77
10:00: average of comments and points per post: 41.87
14:00: average of comments and points per post: 41.82
16:00: average of comments and points per post: 40.76
15:00: average of comments and points per post: 40.56
01:00: average of comments and points per post: 40.43
23:00: average of comments and points per post: 39.79
06:00: average of comments and points per post: 38.32
12:00: average of comments and points per post: 37.57
11:00: average of comments and points per post: 36.90
08:00: average of comments and points per post: 36.84
05:00: average of comments and points per post: 36.75
04:00: average of comments and points per post: 36.49
03:00: average of comments and points per post: 34.19
13:00: average of comments and points per post: 33.80

Looking at the average of the number of comments per post and the number of points per post for 'other' posts it is clear that posts published in the evening 6pm - 11pm EEST are most successful.

Conclusion

It stands to reason that Ask HN posts receive more comments and less points per post than Show HN posts. Questions are likely to attract more comments since by their nature they solicit qualitative interaction; wheras expositions are more likely to attract quantitative kudos than comment engagement. Posts in the other_posts subset appear to be more successful in the evening Romanian time which is midday to afternoon time US Eastern Time. It is important to focus on the timezones where Hacker News has the largest number of subscribers to draw the most useful conclusions from this analysis, which is most likely US time: Eastern or Pacific.
The dataset has a limited number of columns to perform interesting analysis which limits the scope of furthering this project unless we find or scrape data with more features but it would be possible to investigate further the profiles of the other type posts: maybe there are posts with certain authors or URLs that receive a high number of comments/points per post.