In this project, we will use Hacker News data to understand which posting behaviors receive the highest levels of positive engagement from the Hacker News community.
GOAL: Determine the type and timing of posts that lead to positive engagement from Hacker News users to inform what and when we should publish.
The dataset has been shortened to include only a random sample of posts that received comments on Hacker News. The columns are labeled as follows:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]
for row in hn[:4]:
print(row)
print("\n")
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
This data does not have any 'category' delineations. To drill down to certain types of posts, we will need to use their titles, as there are some naming conventions that are in widespread use across the site.
We will focus specifically on posts whose titles begin with Ask HN and Show HN. These are posts that either pose a question to the community or show the community something new, respectively. We will filter down to those posts below:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
lowercase = title.lower()
if lowercase.startswith('ask hn'):
ask_posts.append(row)
elif lowercase.startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print("There are " + str(len(ask_posts)) + " Ask HN posts")
print("There are " + str(len(show_posts)) + " Show HN posts")
print("There are " + str(len(other_posts)) + " Other posts")
There are 1744 Ask HN posts There are 1162 Show HN posts There are 17194 Other posts
Next, we can dig in and look at engagement with the Ask HN and Show HN posts posts. One proxy for engagement is the number of comments that each type of post received on average.
total_ask_comments = 0
for row in ask_posts:
num_comments = int(row[4])
total_ask_comments += num_comments
avg_ask_comments = (total_ask_comments) / len(ask_posts)
print("Avg Ask HN comments:")
print(round(avg_ask_comments,2))
Avg Ask HN comments: 14.04
total_show_comments = 0
for row in show_posts:
num_comments = int(row[4])
total_show_comments += num_comments
avg_show_comments = (total_show_comments) / len(show_posts)
print("Avg Show HN comments:")
print(round(avg_show_comments,2))
Avg Show HN comments: 10.32
We know that there are more Ask posts than Show posts, and in looking at the average comments per post, we also see heightened engagement with Ask posts. On average, Show posts see about 10 comments each, while Ask posts see around 14.
For our analysis of comment activity, let's focus our attention on these Ask posts. Next, we'll look at the timing of the post - does the time of day that the publisher posted their thoughts correlate to higher or lower engagement?
import datetime as dt
result_list = []
for row in ask_posts:
created_at = row[6]
num_comments = int(row[4])
result_list.append([created_at, num_comments])
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
date = row[0]
num_comments = row[1]
clean_date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
hour = clean_date.strftime("%H")
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = num_comments
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += num_comments
avg_by_hour = []
for hour in comments_by_hour:
avg = comments_by_hour[hour] / counts_by_hour[hour]
avg_by_hour.append([hour, avg])
swap_avg_by_hour = []
for row in avg_by_hour:
new_zero = row[1]
new_one = row[0]
swap_avg_by_hour.append([new_zero, new_one])
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("Top 5 Hours for Ask Post Comments")
print("\n")
for row in sorted_swap[:5]:
avg_comments = row[0]
hour = row[1]
clean_hour = dt.datetime.strptime(hour, "%H")
final_hour = dt.datetime.strftime(clean_hour, "%H:%M")
output_string = "{0}: {1:.2f} average comments per post".format(final_hour,avg_comments)
print(output_string)
Top 5 Hours for Ask Post Comments 15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post
The five most popular times of day to comment on Ask posts are 3pm, 2am, 8pm, 4pm, and 9pm.
It is intuitive that people would be more likely to engage with posts toward the end of, or after, their work day. We do see engagement throughout the day, but it is a bit lower on average:
work_day = []
for row in sorted_swap:
if 8 <= int(row[1]) <=14:
avg_comments = row[0]
hour = row[1]
clean_hour = dt.datetime.strptime(hour, "%H")
final_hour = dt.datetime.strftime(clean_hour, "%H:%M")
output_string = "{0}: {1:.2f} average comments per post".format(final_hour,avg_comments)
print(output_string)
13:00: 14.74 average comments per post 10:00: 13.44 average comments per post 14:00: 13.23 average comments per post 11:00: 11.05 average comments per post 08:00: 10.25 average comments per post 12:00: 9.41 average comments per post 09:00: 5.58 average comments per post
It is clear that 3pm is the best time to catch commenters, at an average of 39 comments. This may be because folks are already at their computers for work, but are winding down their day, or perhaps burning out for the day and exploring Hacker News.
If we are not able to post at 3pm, posting late at night around 2am could help us catch the night owl visitors to the site, who also seem to comment quite often.
Comments aren't the only way to determine engagement. After all, people could be furiously commenting on a post because they strongly dislike or disagree with it. We care not only about attracting attention, but also about making a positive contribution to the Hacker News community.
Point counts on Hacker News factor in positive as well as negative reactions. Lets take a look at Ask vs Show posts and see which type of post is earning the most points on average.
show_post_count = 0
show_point_count = 0
for row in show_posts:
points = int(row[3])
show_post_count += 1
show_point_count += points
avg_show_points = show_point_count / show_post_count
print("Average points of Show HN posts:")
print(round(avg_show_points,2))
ask_post_count = 0
ask_point_count = 0
for row in ask_posts:
points = int(row[3])
ask_post_count += 1
ask_point_count += points
avg_ask_points = ask_point_count / ask_post_count
print("Average points of Ask HN posts:")
print(round(avg_ask_points,2))
Average points of Show HN posts: 27.56 Average points of Ask HN posts: 15.06
In looking at point counts, we can see that even though Show posts get ~4 fewer comments on average than Ask posts, they are seeing more positive engagement in the form of points. Show posts are earning close to double the points of Ask posts.
This may be because a 'Show' post and an 'Ask' post invite two different types of engagement. A 'Show' post brings a new idea or thought to the community, which will often provoke excitement and engagement in the form of upvotes. An 'Ask' post invites conversation and troubleshooting, which most effectively takes the form of comments.
Both types of engagement can be positive - that said, using this data, we are able to have more confidence in point counts as a measure of positive engagement, since they represent both upvotes and downvotes.
With this in mind, I'll focus on point counts as a proxy for positive engagement moving forward.
Let's see if there's a best time of day to post in order to gain positive engagement through points. Here we will focus on 'Show' posts, since we've seen above that they outperform Ask posts in terms of point counts.
result_list = []
for row in show_posts:
num_points = int(row[3])
date = row[6]
result_list.append([num_points, date])
show_points_by_hour = {}
show_count_by_hour = {}
for row in result_list:
date = row[1]
num_points = row[0]
clean_date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
hour = clean_date.strftime("%H")
if hour not in show_count_by_hour:
show_count_by_hour[hour] = 1
show_points_by_hour[hour] = num_points
else:
show_count_by_hour[hour] += 1
show_points_by_hour[hour] += num_points
show_avg_by_hour = []
for hour in show_points_by_hour:
avg = show_points_by_hour[hour] / show_count_by_hour[hour]
show_avg_by_hour.append([hour, avg])
sort_show_avg = []
for hour in show_avg_by_hour:
new_zero = hour[1]
new_one = hour[0]
sort_show_avg.append([new_zero, new_one])
sort_show_avg = sorted(sort_show_avg, reverse = True)
print("Top 5 Posting Hours for Point Counts")
for row in sort_show_avg[:5]:
avg_points = row[0]
hour = row[1]
clean_hour = dt.datetime.strptime(hour, "%H")
final_hour = dt.datetime.strftime(clean_hour, "%H:%M")
output_string = "{0}: {1:.2f} average points per post".format(final_hour,avg_points)
print(output_string)
Top 5 Posting Hours for Point Counts 23:00: 42.39 average points per post 12:00: 41.69 average points per post 22:00: 40.35 average points per post 00:00: 37.84 average points per post 18:00: 36.31 average points per post
We see some similarities between time-of-day engagement for comments on Ask posts and points on Show posts. In the case of point counts, however, three of the top 5 hours run in immediate succession, showing a clear trend of upvoting activity during the 10pm-12am window.
It looks as if this would be the ideal time frame to publish a Show HN post in order to maximize positive engagement.
For learning purposes, early on in this analysis I decided to focus on 'Ask HN' and 'Show HN' posts in the Hacker News community. As a quick gut check to ensure we're not missing any high-value posting types, we will do a quick comparison of our findings to the other_posts
list, which contains all posts that do not begin with the 'Ask HN' and 'Show HN' keywords.
Let's take a quick look at the average comment & point activity in the 'other' category, compared to the highest performing post type for each measure:
other_post_count = 0
other_comment_count = 0
other_point_count = 0
for row in other_posts:
points = int(row[3])
comments = int(row[4])
other_point_count += points
other_comment_count += comments
other_post_count += 1
other_avg_comments = other_comment_count / other_post_count
other_avg_points = other_point_count / other_post_count
print("Average 'Other' Comments")
print(round(other_avg_comments,2))
print("Average 'Ask HN' Comments")
print(round(avg_ask_comments,2))
print('\n')
print("Average 'Other' Points")
print(round(other_avg_points,2))
print("Average 'Show HN' Points")
print(round(avg_show_points,2))
Average 'Other' Comments 26.87 Average 'Ask HN' Comments 14.04 Average 'Other' Points 55.41 Average 'Show HN' Points 27.56
Checking our Ask and Show analysis against the 'Other' data provides a really interesting next path to explore in this analysis. Though we could get some great engagement through 'Show HN' posts, it seems as though there are some hidden opportunities in the 'Other' category that we should explore.
That said, one challenge we may run into is that this data is not categorized - we were able to use a proxy for the category by analyzing two types of posts that always begin with the same string. Let's take a high level look at the 'Other' data to see if there is a clear path forward.
Just to see if we can glean some immediate insights from, or easily categorize, this 'other' data, below we've taken a look at the top 20 titles by point value.
other_post_points = []
for row in other_posts:
title = row[1]
points = int(row[3])
author = row[5]
other_post_points.append([points, title])
sorted_other_points = sorted(other_post_points, reverse = True)
print("Top 20 'Other' Titles by Points")
for row in sorted_other_points[:20]:
title = row[1]
points = row[0]
string = "{} points : {}"
final_string = string.format(points, title)
print(final_string)
Top 20 'Other' Titles by Points 2553 points : Pardon Snowden 2381 points : Tell HN: New features and a moderator 1851 points : Master Plan, Part Deux 1622 points : Responsive Pixel Art 1573 points : I've Just Liberated My Modules 1565 points : Being sued, in East Texas, for using the Google Play Store [video] 1562 points : Instagram's Million Dollar Bug 1559 points : TensorFlow: open-source library for machine intelligence 1447 points : Amazon's customer service backdoor 1395 points : Lee Sedol Beats AlphaGo in Game 4 1368 points : VLC contributor living in Aleppo writing about the Paris attacks 1323 points : It's The Future 1304 points : He Always Had a Dark Side 1302 points : Graphing when your Facebook friends are awake 1284 points : SpaceX launch webcast: Orbcomm-2 Mission [video] 1282 points : My First 10 Minutes on a Server 1260 points : Let's Encrypt is Trusted 1238 points : Philae Found 1207 points : Google achieves AI 'breakthrough' by beating Go champion 1195 points : Massachusetts Bans Employers from Asking Applicants About Previous Pay
At first glance, no major categorizations of these top titles jump out that may be able to give us guidance on which type of 'other' post to pursue. I'm interested in hearing how one might approach digging further into this section of the data.
To maximize positive engagement on the Hacker News site, we should opt to publish an engaging 'Show HN' post between the hours of 10pm-12am. This type of post and time period have shown promising levels of engagement based on point values, and would be our best bet to catch the attention of the HN community.
That said, there is an interesting opportunity to explore the other_posts
list, which is more difficult to categorize & filter, but shows very high levels of engagement upon a cursory exploration of the data.