In this project, we will use Hacker News data to understand which posting behaviors receive the highest levels of positive engagement from the Hacker News community.
GOAL: Determine the type and timing of posts that lead to positive engagement from Hacker News users to inform what and when we should publish.
The dataset has been shortened to include only a random sample of posts that received comments on Hacker News. The columns are labeled as follows:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]
for row in hn[:4]:
print(row)
print("\n")
This data does not have any 'category' delineations. To drill down to certain types of posts, we will need to use their titles, as there are some naming conventions that are in widespread use across the site.
We will focus specifically on posts whose titles begin with Ask HN and Show HN. These are posts that either pose a question to the community or show the community something new, respectively. We will filter down to those posts below:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
lowercase = title.lower()
if lowercase.startswith('ask hn'):
ask_posts.append(row)
elif lowercase.startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print("There are " + str(len(ask_posts)) + " Ask HN posts")
print("There are " + str(len(show_posts)) + " Show HN posts")
print("There are " + str(len(other_posts)) + " Other posts")
Next, we can dig in and look at engagement with the Ask HN and Show HN posts posts. One proxy for engagement is the number of comments that each type of post received on average.
total_ask_comments = 0
for row in ask_posts:
num_comments = int(row[4])
total_ask_comments += num_comments
avg_ask_comments = (total_ask_comments) / len(ask_posts)
print("Avg Ask HN comments:")
print(round(avg_ask_comments,2))
total_show_comments = 0
for row in show_posts:
num_comments = int(row[4])
total_show_comments += num_comments
avg_show_comments = (total_show_comments) / len(show_posts)
print("Avg Show HN comments:")
print(round(avg_show_comments,2))
We know that there are more Ask posts than Show posts, and in looking at the average comments per post, we also see heightened engagement with Ask posts. On average, Show posts see about 10 comments each, while Ask posts see around 14.
For our analysis of comment activity, let's focus our attention on these Ask posts. Next, we'll look at the timing of the post - does the time of day that the publisher posted their thoughts correlate to higher or lower engagement?
import datetime as dt
result_list = []
for row in ask_posts:
created_at = row[6]
num_comments = int(row[4])
result_list.append([created_at, num_comments])
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
date = row[0]
num_comments = row[1]
clean_date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
hour = clean_date.strftime("%H")
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = num_comments
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += num_comments
avg_by_hour = []
for hour in comments_by_hour:
avg = comments_by_hour[hour] / counts_by_hour[hour]
avg_by_hour.append([hour, avg])
swap_avg_by_hour = []
for row in avg_by_hour:
new_zero = row[1]
new_one = row[0]
swap_avg_by_hour.append([new_zero, new_one])
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("Top 5 Hours for Ask Post Comments")
print("\n")
for row in sorted_swap[:5]:
avg_comments = row[0]
hour = row[1]
clean_hour = dt.datetime.strptime(hour, "%H")
final_hour = dt.datetime.strftime(clean_hour, "%H:%M")
output_string = "{0}: {1:.2f} average comments per post".format(final_hour,avg_comments)
print(output_string)
The five most popular times of day to comment on Ask posts are 3pm, 2am, 8pm, 4pm, and 9pm.
It is intuitive that people would be more likely to engage with posts toward the end of, or after, their work day. We do see engagement throughout the day, but it is a bit lower on average:
work_day = []
for row in sorted_swap:
if 8 <= int(row[1]) <=14:
avg_comments = row[0]
hour = row[1]
clean_hour = dt.datetime.strptime(hour, "%H")
final_hour = dt.datetime.strftime(clean_hour, "%H:%M")
output_string = "{0}: {1:.2f} average comments per post".format(final_hour,avg_comments)
print(output_string)
It is clear that 3pm is the best time to catch commenters, at an average of 39 comments. This may be because folks are already at their computers for work, but are winding down their day, or perhaps burning out for the day and exploring Hacker News.
If we are not able to post at 3pm, posting late at night around 2am could help us catch the night owl visitors to the site, who also seem to comment quite often.
Comments aren't the only way to determine engagement. After all, people could be furiously commenting on a post because they strongly dislike or disagree with it. We care not only about attracting attention, but also about making a positive contribution to the Hacker News community.
Point counts on Hacker News factor in positive as well as negative reactions. Lets take a look at Ask vs Show posts and see which type of post is earning the most points on average.
show_post_count = 0
show_point_count = 0
for row in show_posts:
points = int(row[3])
show_post_count += 1
show_point_count += points
avg_show_points = show_point_count / show_post_count
print("Average points of Show HN posts:")
print(round(avg_show_points,2))
ask_post_count = 0
ask_point_count = 0
for row in ask_posts:
points = int(row[3])
ask_post_count += 1
ask_point_count += points
avg_ask_points = ask_point_count / ask_post_count
print("Average points of Ask HN posts:")
print(round(avg_ask_points,2))
In looking at point counts, we can see that even though Show posts get ~4 fewer comments on average than Ask posts, they are seeing more positive engagement in the form of points. Show posts are earning close to double the points of Ask posts.
This may be because a 'Show' post and an 'Ask' post invite two different types of engagement. A 'Show' post brings a new idea or thought to the community, which will often provoke excitement and engagement in the form of upvotes. An 'Ask' post invites conversation and troubleshooting, which most effectively takes the form of comments.
Both types of engagement can be positive - that said, using this data, we are able to have more confidence in point counts as a measure of positive engagement, since they represent both upvotes and downvotes.
With this in mind, I'll focus on point counts as a proxy for positive engagement moving forward.
Let's see if there's a best time of day to post in order to gain positive engagement through points. Here we will focus on 'Show' posts, since we've seen above that they outperform Ask posts in terms of point counts.
result_list = []
for row in show_posts:
num_points = int(row[3])
date = row[6]
result_list.append([num_points, date])
show_points_by_hour = {}
show_count_by_hour = {}
for row in result_list:
date = row[1]
num_points = row[0]
clean_date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
hour = clean_date.strftime("%H")
if hour not in show_count_by_hour:
show_count_by_hour[hour] = 1
show_points_by_hour[hour] = num_points
else:
show_count_by_hour[hour] += 1
show_points_by_hour[hour] += num_points
show_avg_by_hour = []
for hour in show_points_by_hour:
avg = show_points_by_hour[hour] / show_count_by_hour[hour]
show_avg_by_hour.append([hour, avg])
sort_show_avg = []
for hour in show_avg_by_hour:
new_zero = hour[1]
new_one = hour[0]
sort_show_avg.append([new_zero, new_one])
sort_show_avg = sorted(sort_show_avg, reverse = True)
print("Top 5 Posting Hours for Point Counts")
for row in sort_show_avg[:5]:
avg_points = row[0]
hour = row[1]
clean_hour = dt.datetime.strptime(hour, "%H")
final_hour = dt.datetime.strftime(clean_hour, "%H:%M")
output_string = "{0}: {1:.2f} average points per post".format(final_hour,avg_points)
print(output_string)
We see some similarities between time-of-day engagement for comments on Ask posts and points on Show posts. In the case of point counts, however, three of the top 5 hours run in immediate succession, showing a clear trend of upvoting activity during the 10pm-12am window.
It looks as if this would be the ideal time frame to publish a Show HN post in order to maximize positive engagement.
For learning purposes, early on in this analysis I decided to focus on 'Ask HN' and 'Show HN' posts in the Hacker News community. As a quick gut check to ensure we're not missing any high-value posting types, we will do a quick comparison of our findings to the other_posts
list, which contains all posts that do not begin with the 'Ask HN' and 'Show HN' keywords.
Let's take a quick look at the average comment & point activity in the 'other' category, compared to the highest performing post type for each measure:
other_post_count = 0
other_comment_count = 0
other_point_count = 0
for row in other_posts:
points = int(row[3])
comments = int(row[4])
other_point_count += points
other_comment_count += comments
other_post_count += 1
other_avg_comments = other_comment_count / other_post_count
other_avg_points = other_point_count / other_post_count
print("Average 'Other' Comments")
print(round(other_avg_comments,2))
print("Average 'Ask HN' Comments")
print(round(avg_ask_comments,2))
print('\n')
print("Average 'Other' Points")
print(round(other_avg_points,2))
print("Average 'Show HN' Points")
print(round(avg_show_points,2))
Checking our Ask and Show analysis against the 'Other' data provides a really interesting next path to explore in this analysis. Though we could get some great engagement through 'Show HN' posts, it seems as though there are some hidden opportunities in the 'Other' category that we should explore.
That said, one challenge we may run into is that this data is not categorized - we were able to use a proxy for the category by analyzing two types of posts that always begin with the same string. Let's take a high level look at the 'Other' data to see if there is a clear path forward.
Just to see if we can glean some immediate insights from, or easily categorize, this 'other' data, below we've taken a look at the top 20 titles by point value.
other_post_points = []
for row in other_posts:
title = row[1]
points = int(row[3])
author = row[5]
other_post_points.append([points, title])
sorted_other_points = sorted(other_post_points, reverse = True)
print("Top 20 'Other' Titles by Points")
for row in sorted_other_points[:20]:
title = row[1]
points = row[0]
string = "{} points : {}"
final_string = string.format(points, title)
print(final_string)
At first glance, no major categorizations of these top titles jump out that may be able to give us guidance on which type of 'other' post to pursue. I'm interested in hearing how one might approach digging further into this section of the data.
To maximize positive engagement on the Hacker News site, we should opt to publish an engaging 'Show HN' post between the hours of 10pm-12am. This type of post and time period have shown promising levels of engagement based on point values, and would be our best bet to catch the attention of the HN community.
That said, there is an interesting opportunity to explore the other_posts
list, which is more difficult to categorize & filter, but shows very high levels of engagement upon a cursory exploration of the data.