Hacker News (HN) is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. HN is extremely popular in technology and startup circles. On HN, users can post questions (ask hn
posts) as well as seek feedback on current projects (show hn
posts).
In this project, I will be exploring a cleaned dataset of Hacker News post data to answer two main research questions about ask hn
and show hn
posts:
ask hn
and show hn
items?I'll start the project by first reading in the data.
### open and format the hacker news dataset
import csv
opened_file = open('hacker_news')
read_file = csv.reader(opened_file)
hn = list(read_file)
header = hn[0]
hn = hn[1:]
### display the header and first few rows of the dataset
print(header)
print("\n")
print(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
Now that we've seen the data, let's categorize the posts into three categories -- ask_posts
, show_posts
, and other_posts
. These categories will help us see which types of posts are more common.
To categorize each post in the data set, we'll initiate empty lists for each category and iterate over each post in the hn
dataset using an if/elif/else
logic to correctly categorize each post. All ask hn
posts start with ask hn
and all show hn
posts start with show hn
. We will leverage this fact to categorize each post, using a startswith()
method. Just to be sure we aren't fooled by case sensitivity, we will apply the string method .lower()
to the title of each post.
### create new variables to count types of posts
ask_posts = []
show_posts = []
other_posts = []
for post in hn:
title = post[1].lower()
if title.startswith("ask hn"):
ask_posts.append(post)
elif title.startswith("show hn"):
show_posts.append(post)
else:
other_posts.append(post)
### display number of posts by category
print("# of ask hn posts: ", len(ask_posts))
print("# of show hn posts: ", len(show_posts))
print("# of other posts: ", len(other_posts))
print("check: ", len(hn) == len(ask_posts) + len(show_posts) + len(other_posts))
# of ask hn posts: 1744 # of show hn posts: 1162 # of other posts: 17194 check: True
Now that we see which category more users post under, let's see which category drives more comments. To do this, we will need to calculate the average number of comments by category.
We start by creating new variables for collecting the comment counts of each post type. We will call these variables total_ask_comments
and total_show_comments
. We will then loop over each category list to count the comments per category. Finally, we will calculate the average comment count for each category.
### create new variables to store comment counts
total_ask_comments = 0
total_show_comments = 0
for post in ask_posts:
total_ask_comments += int(post[4])
for post in show_posts:
total_show_comments += int(post[4])
### calculate average comment counts
avg_ask_comments = total_ask_comments / len(ask_posts)
avg_show_comments = total_show_comments / len(show_posts)
### output template
template = "average number of comments for {name}: {avg: ,.2f}"
print(template.format(name = "ask hn posts", avg = avg_ask_comments))
print(template.format(name = "show hn posts", avg = avg_show_comments))
average number of comments for ask hn posts: 14.04 average number of comments for show hn posts: 10.32
Since ask hn
posts receive about 40% more comments on average than show hn
posts, we will focus the rest of our analysis on these types of posts. Here, we will research what times are the best for creating ask hn
posts.
We will do this by calculating the amount of ask hn
posts created by hour of the day along with the comments received. We will then calculate the average number of comments ask hn
posts receive by hour created. To do all of this, we will need the datetime
module and its datetime.strptime()
constructor to parse dates stored as strings into datetime
objects. We will combine the datetime.strptime
constructor function with the strftime
method to convert the datetime object into an hour format.
### import the datetime module
import datetime as dt
### create empty list
results_list = []
### iterate over ask_posts for created_at times and comment counts
for post in ask_posts:
created_at = post[6]
num_comments = int(post[4])
results_list.append([created_at, num_comments])
### initiate two empty dictionaries for counts by hour
count_by_hour = {}
comments_by_hour = {}
### iterate over results_list to isolate counts and comments by hour into dictionaries above
for entry in results_list:
date = entry[0]
comments = entry[1]
hour = dt.datetime.strptime(date, "%m/%d/%Y %H:%M").strftime("%H")
if hour not in count_by_hour:
count_by_hour[hour] = 1
comments_by_hour[hour] = comments
else:
count_by_hour[hour] += 1
comments_by_hour[hour] += comments
count_by_hour
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
comments_by_hour
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
To calculate the average number of comments for posts created during each hour of the day, we will create an empty list that will hold the hour of the day when posts are created and the averge number of comments posted during that hour. These post_hour
and avg_comment_by_hour
items will be stored in the empty list as key-value pairs.
### create empty list
avg_by_hour = []
for hr in comments_by_hour:
avg_by_hour.append([hr, comments_by_hour[hr] / count_by_hour[hr]])
avg_by_hour
[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]
The output above is nice, but it is difficult to read. We will process a little bit of clean-up code to make the output more readable and insights easier to understand.
### sort the output
swap_avg_by_hour = []
for l in avg_by_hour:
swap_avg_by_hour.append([l[1], l[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("Top 5 Hours for Comments on Ask Posts")
print("-------------------------------------")
for avg, hr in sorted_swap[:5]:
print(
"{}: {:.2f} average comments per post".format(
dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
)
)
Top 5 Hours for Comments on Ask Posts ------------------------------------- 15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post
It appears that 3pm is the best time to post questions on Hacker News. The rationale for this may be that it's a generally accessible time across the United States. Late afternoon on the East Coast is around the middle of the day for those on the Left Coast.