In this project we are tasked with providing a recommendation on what to post on Hacker News in order to reach the most people. Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
We're specifically interested in posts whose titles begin with either 'Ask HN' or 'Show HN'. Users submit 'Ask HN' posts to ask the Hacker News community a specific question and 'Show HN' to show off projects or information relevant to the community.
We'll compare these two types of posts to determine the following:
The dataset can be downloaded here.
On the whole, we see the below for average metrics of show and asks posts.
Average comments per ask post: 10.39
Average points per ask post: 11.31
Average comments per show post: 4.89
Average points per show post: 16.50
If we want to pursue comment engagement then we should focus on ask posts. Otherwise, show posts receive more points engagements but far fewer comments.
# Read in the data as list of lists
from csv import reader
handle = open("HN_posts_year_to_Sep_26_2016.csv")
read = reader(handle)
data_set = list(read)
# Seperate headers and the data into their own variables
data_set_header = data_set[:1]
data_set = data_set[1:]
# Simple exploratory data query
print("Headers:")
print(data_set_header)
print('\n')
print("Sample Data:")
for row in data_set[:5]:
print(row)
print('\n')
print('Number of rows:', len(data_set))
Headers: [['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']] Sample Data: ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'] ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'] ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'] ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'] ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14'] Number of rows: 293119
Our dataset consists of 7 columns and 293,119 rows of post data. The most useful columns for our purposes are title
, num_comments
, num_points
, and created_at
.
We know that the posts that we are interested in analyzing are those that start with 'Ask HN' or 'Show HN'. There be other categories of data, however, we will focus on these. Let's start by making sure there are enough of each type of post to conduct a fair analysis.
# Lists that will store our three categories of posts
ask_posts = []
show_posts = []
other_posts = []
# Loop through the dataset and separate the rows, by title, into our three lists
for row in data_set:
title = row[1]
title = title.lower()
if title.startswith('ask hn') is True: #ask_posts #startswith lets us look at strings and see if the first characters match an argument
ask_posts.append(row)
if title.startswith('show hn') is True: #show_posts
show_posts.append(row)
else:
other_posts.append(row) #other_posts
#Sample output for our two relevant lists
print("Ask Posts Titles Sample:")
print(ask_posts[:5])
print("Total posts: ", len(ask_posts))
print('\n')
print("Show Posts Titles Sample:")
print(show_posts[:5])
print("Total posts: ", len(show_posts))
Ask Posts Titles Sample: [['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']] Total posts: 9139 Show Posts Titles Sample: [['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'], ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/25/2016 23:44'], ['12577991', 'Show HN: Pomodoro-centric, heirarchical project management with ES6 modules', 'https://github.com/jakebian/zeal', '2', '0', 'dbranes', '9/25/2016 23:17'], ['12577142', 'Show HN: Jumble Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06']] Total posts: 10158
We can see that there is enough data in both categories of posts to do a fair analysis, with there being a total of 9,139 'Ask HN' posts and 10,158 'Show HN' posts.
Let's move forward with categorizing our posts in a way that will allow us to use Python analytic techniques.
# Create a list we will use to bucket comment counts from our ask posts
total_ask_comments = 0
total_ask_points = 0
for row in ask_posts:
num_comments = int(row[4])
num_points = int(row[3])
total_ask_comments += num_comments #add that rows comment count to our bucket
total_ask_points += num_points
#print(total_ask_comments) #test to see if above code works
avg_ask_comments = total_ask_comments / len(ask_posts)#average comments per post
avg_ask_points = total_ask_points / len(ask_posts)
print("Average comments per ask post:", avg_ask_comments)
print("Average points per ask post:", avg_ask_points)
print('\n')
# Create a list we will use to buckent comment counts for our show posts
total_show_comments = 0
total_show_points = 0
for row in show_posts:
num_comments = int(row[4])
num_points = int(row[3])
total_show_comments += num_comments
total_show_points += num_points
avg_show_comments = total_show_comments / len(show_posts)
avg_show_points = total_show_points / len(ask_posts)
print("Average comments per show post:", avg_show_comments)
print("Average points per show post:", avg_show_points)
Average comments per ask post: 10.393478498741656 Average points per ask post: 11.31174089068826 Average comments per show post: 4.886099625910612 Average points per show post: 16.49863223547434
Ask posts receive slightly less points, but more comments per average post. Our goal is to create posts with a lot of community engagement, comments are more important and 'Ask HN' would be a better opportunity. However, it may be prudent to create a mix of both types of posts.
We'll want to determine the best time of day to post each type of post. We'll do this by evaluating the best day of the week and hour of day with the most average comments and points. First, however, we will need to prepare our dataset.
Datetime format will allow us to extract more information out of our dates than we could with just the numerical representation. Let's convert our dates to datetimes for the ask and show lists.
import datetime as dt #import the Python 'datetime' library
date_format = "%m/%d/%Y %H:%M" # template to positionally encode our timestamps to datetime format
for row in ask_posts:
time_stamp = row[6]
datetime_date = dt.datetime.strptime(time_stamp, date_format)# use our template to convert our date strings to datetime format
row[6] = datetime_date # reassign back
for row in show_posts:
time_stamp = row[6]
datetime_date = dt.datetime.strptime(time_stamp, date_format)# use our template to convert our date strings to datetime format
row[6] = datetime_date # reassign back
print(ask_posts[:1],'\n',show_posts[:1])
[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', datetime.datetime(2016, 9, 26, 2, 53)]] [['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', datetime.datetime(2016, 9, 26, 0, 36)]]
We will be repeating many of the same processes whether we're analyzing the hour of day, the day of the week, the number of posts, or the number of comments. A function will save us time, so let's create one now.
def analyze_time(data, weekday_or_hour, points_or_comments):
# check that the right values are entered
if (weekday_or_hour == 'weekday' or weekday_or_hour == 'hour'):
pass
else :
print('please enter weekday or hour for weekday_or_hour')
return
if points_or_comments == 'points' or points_or_comments == 'comments':
pass
else :
print('please use points or comments for points_or_comments')
return
# reusable non-specific metrics
count_by_time = {}
metric_by_time = {}
# count and sum tables based on conditions
for row in data:
# selecting data based on condition
if weekday_or_hour == 'weekday' and points_or_comments == 'points':
time = row[6].weekday()
metric = float(row[3])
elif weekday_or_hour == 'hour' and points_or_comments == 'points':
time = row[6].hour
metric = float(row[3])
elif weekday_or_hour == 'weekday' and points_or_comments == 'comments':
time = row[6].weekday()
metric = float(row[4])
else:
time = row[6].hour
metric = float(row[4])
# create a frequency table
if time not in count_by_time:
count_by_time[time] = 1
else :
count_by_time[time] += 1
# create a sum table
if time not in metric_by_time:
metric_by_time[time] = metric
else :
metric_by_time[time] += metric
# create a average metric list by dividing our dictionaries
avg_metric_by_time = []
for time in count_by_time:
avg_metric_by_time.append([(metric_by_time[time]) / count_by_time[time], time])
# sort the table
sorted_avg_metric_by_time = sorted(avg_metric_by_time, reverse = True)
return sorted_avg_metric_by_time
We'll start by looking at the averages comment and point totals for weekdays over both show and ask posts. Weekdays are numerically encoded, with 0 representing Monday and 6 Sunday.
analyze_time(show_posts, 'weekday', 'points')
[[16.116863905325445, 4], [15.418122270742359, 6], [14.691267605633803, 0], [14.646458583433374, 3], [14.642737896494157, 1], [14.419898819561551, 2], [14.231386025200457, 5]]
analyze_time(show_posts, 'weekday', 'comments')
[[5.154585798816568, 4], [5.135469364811692, 2], [4.967587034813926, 3], [4.857961053837342, 5], [4.80056338028169, 0], [4.6705620478575405, 1], [4.472707423580786, 6]]
The most overall engaging day for show posts is Friday. Interestingly we see that on Sunday people tend to interact with posts but enage through comments less.
analyze_time(ask_posts, 'weekday', 'points')
[[15.244648318042813, 6], [13.04232424677188, 4], [12.223585548738923, 0], [11.214525139664804, 5], [10.561085972850679, 2], [9.589481373265157, 3], [8.61843876177658, 1]]
analyze_time(ask_posts, 'weekday', 'comments')
[[12.576757532281205, 4], [12.281345565749236, 6], [11.773006134969325, 0], [9.934078212290503, 5], [9.130021913805697, 3], [9.109017496635262, 1], [8.538461538461538, 2]]
For ask posts we see less variation between points and comments over the weekdays. The best four dates to post in order, regardless of targeting points or comments posts, is Sunday, Friday, Monday, and Saturday.
Next we'll look at the average comment and point totals for hours of the day over both show and ask posts. Hours are in army time, 0 represents Midnight, 12 representd Noon, and 23 represent 11pm.
analyze_time(show_posts, 'hour', 'points')
[[20.905038759689923, 12], [19.258706467661693, 11], [17.018032786885247, 13], [16.057553956834532, 19], [15.994791666666666, 6], [15.862068965517242, 23], [15.547101449275363, 0], [15.144817073170731, 18], [15.09051724137931, 14], [14.683544303797468, 8], [14.340823970037453, 16], [13.995762711864407, 7], [13.95360824742268, 4], [13.94377990430622, 15], [13.930232558139535, 21], [13.88042049934297, 17], [13.331564986737401, 22], [13.321981424148607, 10], [13.234285714285715, 20], [13.224880382775119, 2], [12.456953642384105, 9], [11.866396761133604, 1], [10.662790697674419, 5], [10.524271844660195, 3]]
analyze_time(show_posts, 'hour', 'comments')
[[6.994186046511628, 12], [6.682203389830509, 7], [6.002487562189055, 11], [5.6044303797468356, 8], [5.515804597701149, 14], [5.432786885245902, 13], [5.148325358851674, 2], [5.041237113402062, 4], [5.01978417266187, 19], [4.942073170731708, 18], [4.708333333333333, 6], [4.705368289637953, 16], [4.672185430463577, 9], [4.648550724637682, 0], [4.574162679425838, 15], [4.533980582524272, 3], [4.5266457680250785, 23], [4.252299605781866, 17], [4.158095238095238, 20], [4.090697674418605, 21], [4.0728744939271255, 1], [3.8461538461538463, 22], [3.801857585139319, 10], [3.441860465116279, 5]]
11:00 am to Noon is the best time to post show posts.
analyze_time(ask_posts, 'hour', 'points')
[[21.637770897832816, 15], [17.93243243243243, 13], [13.576023391812866, 12], [13.436170212765957, 10], [12.189097103918229, 17], [11.156351791530945, 18], [10.944237918215613, 2], [10.905349794238683, 4], [10.67704280155642, 8], [10.50682261208577, 14], [10.310880829015543, 16], [9.789473684210526, 5], [9.733590733590734, 21], [9.439716312056738, 1], [9.418604651162791, 0], [9.402088772845953, 22], [9.3690036900369, 3], [9.153846153846153, 11], [9.026548672566372, 7], [8.805882352941177, 20], [8.675213675213675, 6], [8.66304347826087, 19], [7.941441441441442, 9], [7.626822157434402, 23]]
analyze_time(ask_posts, 'hour', 'comments')
[[28.676470588235293, 15], [16.31756756756757, 13], [12.380116959064328, 12], [11.137546468401487, 2], [10.684397163120567, 10], [9.7119341563786, 4], [9.692007797270955, 14], [9.449744463373083, 17], [9.190661478599221, 8], [8.96474358974359, 11], [8.804177545691905, 22], [8.794258373205741, 5], [8.749019607843136, 20], [8.687258687258687, 21], [7.948339483394834, 3], [7.94299674267101, 18], [7.713298791018998, 16], [7.5647840531561465, 0], [7.407801418439717, 1], [7.163043478260869, 19], [7.013274336283186, 7], [6.782051282051282, 6], [6.696793002915452, 23], [6.653153153153153, 9]]
3:00 pm is the best time to post ask posts.
Depending on the type of engagement that we're after, we may choose to post an Ask post or a Show post. Points can be inferred ss some funcion of the amount of people a post reaches, i.e. the more likes a post receives, the more eyeballs that likely saw that post. Comments is engaging in a less passive way and may have a larger impact on those that do see it.
On the whole, we see the below for average metrics of show and asks posts.
Average comments per ask post: 10.39
Average points per ask post: 11.31
Average comments per show post: 4.89
Average points per show post: 16.50
If we want to pursue comment engagement then we should focus on ask posts. Otherwise, show posts receive more points engagements but far fewer comments.