Notebook

Exploring Hacker News: When Is the Best Time to Post?¶

Hacker News (HN) is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. HN is extremely popular in technology and startup circles. On HN, users can post questions (ask hn posts) as well as seek feedback on current projects (show hn posts).

In this project, I will be exploring a cleaned dataset of Hacker News post data to answer two main research questions about ask hn and show hn posts:

Which category elicits the most commentary and feedback?
When is the best time of day to post ask hn and show hn items?

I'll start the project by first reading in the data.

In [1]:

### open and format the hacker news dataset
import csv
opened_file = open('hacker_news')
read_file = csv.reader(opened_file)
hn = list(read_file)
header = hn[0]
hn = hn[1:]

### display the header and first few rows of the dataset
print(header)
print("\n")
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]

Categorizing Posts¶

Now that we've seen the data, let's categorize the posts into three categories -- ask_posts, show_posts, and other_posts. These categories will help us see which types of posts are more common.

To categorize each post in the data set, we'll initiate empty lists for each category and iterate over each post in the hn dataset using an if/elif/else logic to correctly categorize each post. All ask hn posts start with ask hn and all show hn posts start with show hn. We will leverage this fact to categorize each post, using a startswith() method. Just to be sure we aren't fooled by case sensitivity, we will apply the string method .lower() to the title of each post.

In [2]:

### create new variables to count types of posts
ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1].lower()
    
    if title.startswith("ask hn"):
        ask_posts.append(post)
    elif title.startswith("show hn"):
        show_posts.append(post)
    else:
        other_posts.append(post)

### display number of posts by category
print("# of ask hn posts: ", len(ask_posts))
print("# of show hn posts: ", len(show_posts))
print("# of other posts: ", len(other_posts))
print("check: ", len(hn) == len(ask_posts) + len(show_posts) + len(other_posts))

# of ask hn posts:  1744
# of show hn posts:  1162
# of other posts:  17194
check:  True

Which Post Category Receives More Comments on Average?¶

Now that we see which category more users post under, let's see which category drives more comments. To do this, we will need to calculate the average number of comments by category.

We start by creating new variables for collecting the comment counts of each post type. We will call these variables total_ask_comments and total_show_comments. We will then loop over each category list to count the comments per category. Finally, we will calculate the average comment count for each category.

In [3]:

### create new variables to store comment counts
total_ask_comments = 0
total_show_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])
    
for post in show_posts:
    total_show_comments += int(post[4])

### calculate average comment counts
avg_ask_comments = total_ask_comments / len(ask_posts)
avg_show_comments = total_show_comments / len(show_posts)

### output template
template = "average number of comments for {name}: {avg: ,.2f}"

print(template.format(name = "ask hn posts", avg = avg_ask_comments))
print(template.format(name = "show hn posts", avg = avg_show_comments))
    

average number of comments for ask hn posts:  14.04
average number of comments for show hn posts:  10.32

Digging Deeper on Ask HN Posts¶

Since ask hn posts receive about 40% more comments on average than show hn posts, we will focus the rest of our analysis on these types of posts. Here, we will research what times are the best for creating ask hn posts.

We will do this by calculating the amount of ask hn posts created by hour of the day along with the comments received. We will then calculate the average number of comments ask hn posts receive by hour created. To do all of this, we will need the datetime module and its datetime.strptime() constructor to parse dates stored as strings into datetime objects. We will combine the datetime.strptime constructor function with the strftime method to convert the datetime object into an hour format.

In [4]:

### import the datetime module
import datetime as dt

### create empty list
results_list = []

### iterate over ask_posts for created_at times and comment counts
for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])

    results_list.append([created_at, num_comments])
    
### initiate two empty dictionaries for counts by hour
count_by_hour = {}
comments_by_hour = {}

### iterate over results_list to isolate counts and comments by hour into dictionaries above
for entry in results_list:
    date = entry[0]
    comments = entry[1]
    hour = dt.datetime.strptime(date, "%m/%d/%Y %H:%M").strftime("%H")
    
    if hour not in count_by_hour:
        count_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        count_by_hour[hour] += 1
        comments_by_hour[hour] += comments

In [5]:

count_by_hour

Out[5]:

{'09': 45,
 '13': 85,
 '10': 59,
 '14': 107,
 '16': 108,
 '23': 68,
 '12': 73,
 '17': 100,
 '15': 116,
 '21': 109,
 '20': 80,
 '02': 58,
 '18': 109,
 '03': 54,
 '05': 46,
 '19': 110,
 '01': 60,
 '22': 71,
 '08': 48,
 '04': 47,
 '00': 55,
 '06': 44,
 '07': 34,
 '11': 58}

In [6]:

comments_by_hour

Out[6]:

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

Calculating Average Number of Comments by Post per Hour¶

To calculate the average number of comments for posts created during each hour of the day, we will create an empty list that will hold the hour of the day when posts are created and the averge number of comments posted during that hour. These post_hour and avg_comment_by_hour items will be stored in the empty list as key-value pairs.

In [7]:

### create empty list
avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / count_by_hour[hr]])

In [8]:

avg_by_hour

Out[8]:

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

Cleaning up the Output¶

The output above is nice, but it is difficult to read. We will process a little bit of clean-up code to make the output more readable and insights easier to understand.

In [17]:

### sort the output
swap_avg_by_hour = []

for l in avg_by_hour:
    swap_avg_by_hour.append([l[1], l[0]])
    
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print("Top 5 Hours for Comments on Ask Posts")
print("-------------------------------------")
for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

Top 5 Hours for Comments on Ask Posts
-------------------------------------
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post

Conclusions¶

It appears that 3pm is the best time to post questions on Hacker News. The rationale for this may be that it's a generally accessible time across the United States. Late afternoon on the East Coast is around the middle of the day for those on the Left Coast.