Notebook

Identifying How and When to Post on Hacker News¶

In this project we are tasked with providing a recommendation on what to post on Hacker News in order to reach the most people. Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

We're specifically interested in posts whose titles begin with either 'Ask HN' or 'Show HN'. Users submit 'Ask HN' posts to ask the Hacker News community a specific question and 'Show HN' to show off projects or information relevant to the community.

We'll compare these two types of posts to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?
Does this behavior change with points?

The dataset can be downloaded here.

Summary of Results¶

Show Posts vs Ask Posts on Average¶

On the whole, we see the below for average metrics of show and asks posts.
Average comments per ask post: 10.39
Average points per ask post: 11.31

Average comments per show post: 4.89
Average points per show post: 16.50

If we want to pursue comment engagement then we should focus on ask posts. Otherwise, show posts receive more points engagements but far fewer comments.

Best Time to Post¶

Weekday Insights¶

The most overall engaging day for show posts is Friday. Interestingly we see that on Sunday people tend to interact with posts but enage through comments less.
For ask posts we see less variation between points and comments over the weekdays. The best four dates to post in order, regardless of targeting points or comments posts, is Sunday, Friday, Monday, and Saturday.

Hourly Insights¶

11:00 am to Noon is the best time to post show posts.
3:00 pm is the best time to post ask posts.

Data Exploration¶

In [1]:

# Read in the data as list of lists
from csv import reader
handle = open("HN_posts_year_to_Sep_26_2016.csv")
read = reader(handle)
data_set = list(read)

# Seperate headers and the data into their own variables
data_set_header = data_set[:1]
data_set = data_set[1:]

# Simple exploratory data query
print("Headers:")
print(data_set_header)
print('\n')
print("Sample Data:")
for row in data_set[:5]:
    print(row)
    print('\n')
    
print('Number of rows:', len(data_set))

Headers:
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]


Sample Data:
['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']


Number of rows: 293119

Our dataset consists of 7 columns and 293,119 rows of post data. The most useful columns for our purposes are title, num_comments, num_points, and created_at.

Engagement by Post Type¶

We know that the posts that we are interested in analyzing are those that start with 'Ask HN' or 'Show HN'. There be other categories of data, however, we will focus on these. Let's start by making sure there are enough of each type of post to conduct a fair analysis.

In [2]:

# Lists that will store our three categories of posts
ask_posts = []
show_posts = []
other_posts = []

# Loop through the dataset and separate the rows, by title, into our three lists
for row in data_set:
    title = row[1] 
    title = title.lower()
    if title.startswith('ask hn') is True: #ask_posts #startswith lets us look at strings and see if the first characters match an argument
        ask_posts.append(row)
    if title.startswith('show hn') is True: #show_posts
        show_posts.append(row)
    else: 
        other_posts.append(row) #other_posts
        
#Sample output for our two relevant lists
print("Ask Posts Titles Sample:")
print(ask_posts[:5])
print("Total posts: ", len(ask_posts))
print('\n')
print("Show Posts Titles Sample:")
print(show_posts[:5])
print("Total posts: ", len(show_posts)) 

Ask Posts Titles Sample:
[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']]
Total posts:  9139


Show Posts Titles Sample:
[['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'], ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/25/2016 23:44'], ['12577991', 'Show HN: Pomodoro-centric, heirarchical project management with ES6 modules', 'https://github.com/jakebian/zeal', '2', '0', 'dbranes', '9/25/2016 23:17'], ['12577142', 'Show HN: Jumble  Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06']]
Total posts:  10158

We can see that there is enough data in both categories of posts to do a fair analysis, with there being a total of 9,139 'Ask HN' posts and 10,158 'Show HN' posts.

Let's move forward with categorizing our posts in a way that will allow us to use Python analytic techniques.

In [3]:

# Create a list we will use to bucket comment counts from our ask posts
total_ask_comments = 0
total_ask_points = 0

for row in ask_posts:
    num_comments = int(row[4])
    num_points = int(row[3])
    total_ask_comments += num_comments #add that rows comment count to our bucket
    total_ask_points += num_points
    
#print(total_ask_comments) #test to see if above code works
avg_ask_comments = total_ask_comments / len(ask_posts)#average comments per post
avg_ask_points = total_ask_points / len(ask_posts)
print("Average comments per ask post:", avg_ask_comments) 
print("Average points per ask post:", avg_ask_points) 

print('\n')

# Create a list we will use to buckent comment counts for our show posts
total_show_comments = 0
total_show_points = 0

for row in show_posts:
    num_comments = int(row[4])
    num_points = int(row[3])
    total_show_comments += num_comments
    total_show_points += num_points
    
avg_show_comments = total_show_comments / len(show_posts)
avg_show_points = total_show_points / len(ask_posts)
print("Average comments per show post:", avg_show_comments)
print("Average points per show post:", avg_show_points) 

Average comments per ask post: 10.393478498741656
Average points per ask post: 11.31174089068826


Average comments per show post: 4.886099625910612
Average points per show post: 16.49863223547434

Ask posts receive slightly less points, but more comments per average post. Our goal is to create posts with a lot of community engagement, comments are more important and 'Ask HN' would be a better opportunity. However, it may be prudent to create a mix of both types of posts.

Engagement by Time¶

We'll want to determine the best time of day to post each type of post. We'll do this by evaluating the best day of the week and hour of day with the most average comments and points. First, however, we will need to prepare our dataset.

Turning Date Values to Datetime Format¶

Datetime format will allow us to extract more information out of our dates than we could with just the numerical representation. Let's convert our dates to datetimes for the ask and show lists.

In [4]:

import datetime as dt #import the Python 'datetime' library

date_format = "%m/%d/%Y %H:%M" # template to positionally encode our timestamps to datetime format

for row in ask_posts: 
    time_stamp = row[6]
    datetime_date = dt.datetime.strptime(time_stamp, date_format)# use our template to convert our date strings to datetime format
    row[6] = datetime_date # reassign back 
    
for row in show_posts:
    time_stamp = row[6]
    datetime_date = dt.datetime.strptime(time_stamp, date_format)# use our template to convert our date strings to datetime format
    row[6] = datetime_date # reassign back 
    
print(ask_posts[:1],'\n',show_posts[:1])

[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', datetime.datetime(2016, 9, 26, 2, 53)]] 
 [['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', datetime.datetime(2016, 9, 26, 0, 36)]]

Identifying Best Times to Post¶

We will be repeating many of the same processes whether we're analyzing the hour of day, the day of the week, the number of posts, or the number of comments. A function will save us time, so let's create one now.

In [5]:

def analyze_time(data, weekday_or_hour, points_or_comments):
    # check that the right values are entered
    if (weekday_or_hour == 'weekday' or weekday_or_hour == 'hour'):
        pass
    else :
        print('please enter weekday or hour for weekday_or_hour')
        return
        
    if points_or_comments == 'points' or points_or_comments == 'comments':
        pass
    else : 
        print('please use points or comments for points_or_comments')
        return
    
    # reusable non-specific metrics
    count_by_time = {}
    metric_by_time = {}

    # count and sum tables based on conditions
    for row in data:
        # selecting data based on condition
        if weekday_or_hour == 'weekday' and points_or_comments == 'points':
            time = row[6].weekday()
            metric = float(row[3])
        elif weekday_or_hour == 'hour' and points_or_comments == 'points':
            time = row[6].hour
            metric = float(row[3])
        elif weekday_or_hour == 'weekday' and points_or_comments == 'comments':
            time = row[6].weekday()
            metric = float(row[4])
        else:
            time = row[6].hour
            metric = float(row[4])
            
        # create a frequency table 
        if time not in count_by_time:
            count_by_time[time] = 1
        else :
            count_by_time[time] += 1
        
        # create a sum table
        if time not in metric_by_time:
            metric_by_time[time] = metric
        else : 
            metric_by_time[time] += metric
    
    # create a average metric list by dividing our dictionaries
    avg_metric_by_time = []
    for time in count_by_time:
        avg_metric_by_time.append([(metric_by_time[time]) / count_by_time[time], time])
        
    # sort the table    
    sorted_avg_metric_by_time = sorted(avg_metric_by_time, reverse = True)
    
    return sorted_avg_metric_by_time

The Best Weekdays to Post¶

We'll start by looking at the averages comment and point totals for weekdays over both show and ask posts. Weekdays are numerically encoded, with 0 representing Monday and 6 Sunday.

Show_posts Weekday Points and Comments¶

In [6]:

analyze_time(show_posts, 'weekday', 'points')

Out[6]:

[[16.116863905325445, 4],
 [15.418122270742359, 6],
 [14.691267605633803, 0],
 [14.646458583433374, 3],
 [14.642737896494157, 1],
 [14.419898819561551, 2],
 [14.231386025200457, 5]]

In [7]:

analyze_time(show_posts, 'weekday', 'comments')

Out[7]:

[[5.154585798816568, 4],
 [5.135469364811692, 2],
 [4.967587034813926, 3],
 [4.857961053837342, 5],
 [4.80056338028169, 0],
 [4.6705620478575405, 1],
 [4.472707423580786, 6]]

The most overall engaging day for show posts is Friday. Interestingly we see that on Sunday people tend to interact with posts but enage through comments less.

ask_posts Weekday Points and Comments¶

In [8]:

analyze_time(ask_posts, 'weekday', 'points')

Out[8]:

[[15.244648318042813, 6],
 [13.04232424677188, 4],
 [12.223585548738923, 0],
 [11.214525139664804, 5],
 [10.561085972850679, 2],
 [9.589481373265157, 3],
 [8.61843876177658, 1]]

In [9]:

analyze_time(ask_posts, 'weekday', 'comments')

Out[9]:

[[12.576757532281205, 4],
 [12.281345565749236, 6],
 [11.773006134969325, 0],
 [9.934078212290503, 5],
 [9.130021913805697, 3],
 [9.109017496635262, 1],
 [8.538461538461538, 2]]

For ask posts we see less variation between points and comments over the weekdays. The best four dates to post in order, regardless of targeting points or comments posts, is Sunday, Friday, Monday, and Saturday.

The Best Hours to Post¶

Next we'll look at the average comment and point totals for hours of the day over both show and ask posts. Hours are in army time, 0 represents Midnight, 12 representd Noon, and 23 represent 11pm.

Show_posts Hourly Points and Comments¶

In [10]:

analyze_time(show_posts, 'hour', 'points')

Out[10]:

[[20.905038759689923, 12],
 [19.258706467661693, 11],
 [17.018032786885247, 13],
 [16.057553956834532, 19],
 [15.994791666666666, 6],
 [15.862068965517242, 23],
 [15.547101449275363, 0],
 [15.144817073170731, 18],
 [15.09051724137931, 14],
 [14.683544303797468, 8],
 [14.340823970037453, 16],
 [13.995762711864407, 7],
 [13.95360824742268, 4],
 [13.94377990430622, 15],
 [13.930232558139535, 21],
 [13.88042049934297, 17],
 [13.331564986737401, 22],
 [13.321981424148607, 10],
 [13.234285714285715, 20],
 [13.224880382775119, 2],
 [12.456953642384105, 9],
 [11.866396761133604, 1],
 [10.662790697674419, 5],
 [10.524271844660195, 3]]

In [11]:

analyze_time(show_posts, 'hour', 'comments')

Out[11]:

[[6.994186046511628, 12],
 [6.682203389830509, 7],
 [6.002487562189055, 11],
 [5.6044303797468356, 8],
 [5.515804597701149, 14],
 [5.432786885245902, 13],
 [5.148325358851674, 2],
 [5.041237113402062, 4],
 [5.01978417266187, 19],
 [4.942073170731708, 18],
 [4.708333333333333, 6],
 [4.705368289637953, 16],
 [4.672185430463577, 9],
 [4.648550724637682, 0],
 [4.574162679425838, 15],
 [4.533980582524272, 3],
 [4.5266457680250785, 23],
 [4.252299605781866, 17],
 [4.158095238095238, 20],
 [4.090697674418605, 21],
 [4.0728744939271255, 1],
 [3.8461538461538463, 22],
 [3.801857585139319, 10],
 [3.441860465116279, 5]]

11:00 am to Noon is the best time to post show posts.

ask_posts Hourly Points and Comments¶

In [12]:

analyze_time(ask_posts, 'hour', 'points')

Out[12]:

[[21.637770897832816, 15],
 [17.93243243243243, 13],
 [13.576023391812866, 12],
 [13.436170212765957, 10],
 [12.189097103918229, 17],
 [11.156351791530945, 18],
 [10.944237918215613, 2],
 [10.905349794238683, 4],
 [10.67704280155642, 8],
 [10.50682261208577, 14],
 [10.310880829015543, 16],
 [9.789473684210526, 5],
 [9.733590733590734, 21],
 [9.439716312056738, 1],
 [9.418604651162791, 0],
 [9.402088772845953, 22],
 [9.3690036900369, 3],
 [9.153846153846153, 11],
 [9.026548672566372, 7],
 [8.805882352941177, 20],
 [8.675213675213675, 6],
 [8.66304347826087, 19],
 [7.941441441441442, 9],
 [7.626822157434402, 23]]

In [13]:

analyze_time(ask_posts, 'hour', 'comments')

Out[13]:

[[28.676470588235293, 15],
 [16.31756756756757, 13],
 [12.380116959064328, 12],
 [11.137546468401487, 2],
 [10.684397163120567, 10],
 [9.7119341563786, 4],
 [9.692007797270955, 14],
 [9.449744463373083, 17],
 [9.190661478599221, 8],
 [8.96474358974359, 11],
 [8.804177545691905, 22],
 [8.794258373205741, 5],
 [8.749019607843136, 20],
 [8.687258687258687, 21],
 [7.948339483394834, 3],
 [7.94299674267101, 18],
 [7.713298791018998, 16],
 [7.5647840531561465, 0],
 [7.407801418439717, 1],
 [7.163043478260869, 19],
 [7.013274336283186, 7],
 [6.782051282051282, 6],
 [6.696793002915452, 23],
 [6.653153153153153, 9]]

3:00 pm is the best time to post ask posts.

Conclusion¶

Depending on the type of engagement that we're after, we may choose to post an Ask post or a Show post. Points can be inferred ss some funcion of the amount of people a post reaches, i.e. the more likes a post receives, the more eyeballs that likely saw that post. Comments is engaging in a less passive way and may have a larger impact on those that do see it.