Notebook

Finding Optimal Post Times -- Analyzing High Volume of Comments in Specific Time of the Day¶

We are analyzing the data set coming from the submissions to popular technology site Hacker News. Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

This project will unveil wether if "Ask HN" or "Show HN" receives more comments on average and use that to determine wether a specific time determines if the post receives more comments on average.

Summary of Result¶

We have determined that Ask HN has the most post and we also determined that 3:00 pm in the afternoon is the peak time that users post comments in posts every day.

This project will be beneficial to users wanting an answer to their specific needs. Now they will have an idea what time to post a question to HackerNews and gain a lot of answers to their question.

For more details, you can explore the steps below

click here to view the conclusion below

First let's make the data set readable from Jupyter Notebook.

In [1]:

from csv import reader;
# Read data set
opened_file = open("hacker_news.csv");
read_file = reader(opened_file);

For the header to not interupt to our analysis, we will separate it from the actual data set that we are querying.

In [2]:

# Turn data set into list of list
hn = list(read_file);

# Remove the header
headers = hn[0]; 
hn = hn[1:]; # Assigns the data set without header

print(headers, "\n"); 

# Display 5 row from the data set
for i in range(5, 25, 5): # Display rows starting from 5 with increments of 5 up until the 25th row
    print(hn[i], "\n");

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http://arstechnica.com/business/2015/10/comcast-and-other-isps-boost-network-investment-despite-net-neutrality/', '53', '22', 'Deinos', '10/31/2015 9:48'] 

['11370829', 'Crate raises $4M seed round for its next-gen SQL database', 'http://techcrunch.com/2016/03/15/crate-raises-4m-seed-round-for-its-next-gen-sql-database/', '3', '1', 'hitekker', '3/27/2016 18:08'] 

['12335860', 'How often to update third party libraries?', '', '7', '5', 'rabid_oxen', '8/22/2016 12:37'] 

['11079821', 'APOD: LIGO detects gravity waves...', 'http://apod.nasa.gov/apod/astropix.html', '1', '2', 'AliCollins', '2/11/2016 12:57']

Dataset Explanation¶

You can find the data set here, but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

id: The unique identifier from Hacker News for the post
title: The title of the post
url: The URL that the posts links to, if it the post has a URL
num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments: The number of comments that were made on the post
author: The username of the person who submitted the post
created_at: The date and time at which the post was submitted

Example¶

id	title	url	num_points	num_comments	author	created_at
12296411	Ask HN: How to improve my personal website?		2	6	ahmedbaracat	8/16/2016 9:55:00 AM
10975351	How to Use Open Source and Shut the F*ck Up at the Same Time	http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/	39	10	josep2	1/26/2016 19:30
11964716	Florida DJs May Face Felony for April Fools' Water Joke	http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/	2	1	vezycash	6/23/2016 22:20

As you can see there are entries that have missing columns and there are others that don't have a tag. For the missing values above which is the URL, we actually dont need this so that entry is good as it is, While the title that doesn't have any tag (Ask HN or Show HN) at the beginning will be separated from the actual data set that we will need later in the process

Prepping¶

Segregating Tagged Post for Analysis ¶

Since we are only interested with post that has Ask HN and Show HN, we will use the startswith() method to test if a title column of a row has word containing Ask HN and Show HN. We will then contain it in a list to separate each post within their own list.

proceed to next step

Code Explanation¶

ask_post, show_post, other_postis a list of list containing the separated tagged post

This code will segregrate each tagged post to their corresponding list in order to process our target post (ask_post and show_post)

Prepping¶

Segregating Tagged Post for Analysis ¶

proceed to next step

Code Explanation¶

ask_post, show_post, other_postis a list of list containing the separated tagged post

This code will segregrate each tagged post to their corresponding list in order to process our target post (ask_post and show_post)

In [3]:

ask_post = [];
show_post = [];
other_post = [];

# Code to separate "ask hn" and "show hn" posts into separate list
for row in hn: # iterate each row in data set "hn"
    title = row[1];
    
    # append titles to "ask_post" that has "ask hn" as title
    if title.lower().startswith("ask hn"):
        ask_post.append(row);
    # append titles to "show_post" that has "show hn" as title
    elif title.lower().startswith("show hn"):
        show_post.append(row);
    # if neither, append to "other_post"
    else:
        other_post.append(row);
        
# Code to print the length of rows in the list
print("Number of post in Ask HN: {:,}".format(len(ask_post)))
print("Number of post in Show HN: {:,}".format(len(show_post)))
print("Number of post in other tagged post: {:,}".format(len(other_post)))

Number of post in Ask HN: 1,744
Number of post in Show HN: 1,162
Number of post in other tagged post: 17,194

Output Explanation¶

Ask HN has more post than Show HN. This only would mean that for a technology site, there are a lot of post that ask questions rather than a post that show something interesting.

BUT this will not identify wether ask_post has more comments than show_post.

Analysis¶

Average Comments per Tag ¶

Now that we have segragated our data set based on their tag, we wil now use that to sum up all the comments on all post and divide that by the total number of the post leaving us with the average comments.

proceed to next step

Code Explanation¶

This code will determine the average comments of both tagged post (ask_hn and show_hn).

In [4]:

total_ask_comments = 0; # Count total ask_post comments
total_show_comments = 0; # Count total show_post comments

# Code to output average comments per post in "ask_post"
for post in ask_post:
    num_comments = int(post[4]); # Assign Column num_comments in a variable "num_comments"
    
    # Total all of the comments
    total_ask_comments += num_comments; 
    
# Code to output average comments per post in "show_post"
for post in show_post:
    num_comments = int(post[4]);
    
    # Total all of the comments
    total_show_comments += num_comments;
    
avg_ask_comments = total_ask_comments / len(ask_post) # Average comments per ask_post
print("Average Comments in ask_post: {:.2f}".format(avg_ask_comments)) # Output format ask_post

avg_show_comments = total_show_comments / len(show_post); # Average comments per ask_post
print("Average Comments in show_post: {:.2f}".format(avg_show_comments)) # Output format show_post

Average Comments in ask_post: 14.04
Average Comments in show_post: 10.32

Output Explanation¶

Show HN garnered an average comments per post of 10.32 while Ask HN is 3.72 ahead with 14.04 average comments per post. This only means that ask_post has more comments over show_post hence,

we will use ask_post first to determine what is the peak time for receiving comments on average.

Count Post and Comment by Hour ¶

In the previous cell, we have determined which tag has the most comments on average and now we will use that to have a greater detail what time is it optimal to post an Ask HN tag and garner a lot of comments? But first, to do that we need to count the post per hour of the day and count the comments per hour of the day and use the hour as a key to divide both count of post and comments leaving us the average comments per post on that hour.

proceed to next step

Code Explanation¶

Code to create a dictionary for counts of comments per hour and post per hour to use later for computing for average comments per post per hour.

We have determined that Ask HN has more average comments than Show HN and we will use ask_post now to count "post" and "comments" in each hour of the day to determine later what hours has more post and comments

In [5]:

import datetime as dt; # Module to process date and time
result_list = []; # list to create list of list

# Code to create a list of list data set 
for post in ask_post:
    time_created = post[6];
    comments = int(post[4]); # convert number into integer
    
    result_list.append([time_created, comments]);

counts_by_hour = {}; # Dictinary to store total post per hour of day
comments_by_hour = {}; # Dictinary to store total comments per hour

# Code to create a dictionary with "time" hour as `key` with `value` as comments
for result in result_list: # Retrieve each list in the data set of list of list (result_list)
    #  Assign each element of each row in descriptive variables
    date_str = result[0];
    comments = result[1];  
    
    time = dt.datetime.strptime(date_str, "%m/%d/%Y %H:%M"); # convert the "time_created" as datetime obj to process dates
    hour = str(time.hour); # Take `hour` only in the date object
    
    # Count and initialize posts and comments in corresponding dictionary
    if hour not in counts_by_hour: # if the key `time` is not in counts_by_hour dictionary, initialize
        counts_by_hour[hour] = 1;
        comments_by_hour[hour] = comments;
    else: # else add all of them for counting
        counts_by_hour[hour] += 1; 
        comments_by_hour[hour] += comments;

Determine Average Comments per Hour ¶

Since we have determined in the previous cell the count of post and comments in a specific hour, now what we need to do now is to find the average comments per post in that specific hour to determine later in the output what time of the day that has the most comment per post in that hour

proceed to next step

In [6]:

avg_by_hour = []; # list of list containing hours of day and avg comments per post in that hour

# Code to determine average comments per post in specific hour
for hour_post, post in counts_by_hour.items(): # Assign `key`=hour and `value`=total post in variables respectively
    for hour_comments, comments in comments_by_hour.items(): # Assign `key`=hour and `value`=total post in variables respectively
        
        # Code to determine if they have the same keys (hour) and determine avg comments per post in that hour
        if int(hour_post) == int(hour_comments):
            avg_by_hour.append([hour_post, comments/post]) # append a list of hour, avg_comments per hour respectively

Sort Average Comments per Post ¶

Now we have determined the average comments per post in specific hour of the day, all we have to do now is to make it readable for us to see the peak time for posting an Ask HN tag.

proceed to next step

Code Explanation¶

This code will sort the list of average comments in descending order

In order to sort this, we have to swap the elements of the list of list so that when we sort that list of list, it will sort the average comments per post value in descending order not the hour it posted.

In [7]:

swap_avg_by_hour = []; # list of swapped elements of `avg_by_hour` for sorting

# Code for sorting average comments in descending order
for row in avg_by_hour:
    # Assign column to variables for readability
    hour = row[0];
    comments = row[1];
    
    swap_avg_by_hour.append([comments, hour]); # Append swapped elements to list
    
sorted_swap = sorted(swap_avg_by_hour, reverse=True) # sort the list in descending order
sorted_swap

Out[7]:

[[38.5948275862069, '15'],
 [23.810344827586206, '2'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '1'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '8'],
 [10.08695652173913, '5'],
 [9.41095890410959, '12'],
 [9.022727272727273, '6'],
 [8.127272727272727, '0'],
 [7.985294117647059, '23'],
 [7.852941176470588, '7'],
 [7.796296296296297, '3'],
 [7.170212765957447, '4'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '9']]

Output Explanation ¶

proceed to next step

We have now sorted the list. The only step to do is to display the top 5 hours for posting an Ask HN and format it for readbility

In [8]:

print("Top 5 Hours for Ask Posts Comments in Eastern Time");
# Code for outputing top 5 Ask Post highest number of average comments in each hour
for i in range(5):
    hours = dt.datetime.strptime(sorted_swap[i][1], "%H").strftime("%H:%M:%S") # Turn each hour in the list(sorted_swap) as datetime obj
    print("{hours}: {comments:.2f} average comments per post".format(hours=hours, comments=sorted_swap[i][0])) # Formated output

Top 5 Hours for Ask Posts Comments in Eastern Time
15:00:00: 38.59 average comments per post
02:00:00: 23.81 average comments per post
20:00:00: 21.52 average comments per post
16:00:00: 16.80 average comments per post
21:00:00: 16.01 average comments per post

Convert to different Timezones ¶

proceed to conclusion

The output above is only applicable with the timezone of Eastern Times and may not be usable for other users that live with a different timezone. For your convenience, I will convert this Eastern Times to 3 different timezones.

In [9]:

import pytz;
from datetime import timedelta;

localFormat = "%H:%M:%S"; # Format of time
timezones = ["America/Los_Angeles", "Europe/Madrid", "America/Puerto_Rico"]; # Output different timezones
eastern = pytz.timezone("US/Eastern");

for tz in timezones:
    print("\nEastern Timezone -> {} Timezone".format(tz));
    for i in range(5): # Iterate to top 5 Hours with the most average comments per post
        hours = dt.datetime.strptime(sorted_swap[i][1], "%H") # Turn each hour in the list(sorted_swap) as datetime obj
        est_moment = eastern.localize(hours); # Localize EST time to convert to different timezones
        localDatetime = est_moment.astimezone(pytz.timezone(tz)); # Convert the top 5 hours in different timezone
        
        """I have to add additional hours
           to the generated timezone by pytz
           it seems it doesn't correctly output
           the different times in each place"""
        if tz == "America/Los_Angeles":
            add_time = timedelta(minutes=57);
        elif tz == "Europe/Madrid":
            add_time = timedelta(hours=2, minutes=19);
        elif tz == "America/Puerto_Rico":
            add_time = timedelta(minutes=28);
            
        localDatetime = localDatetime + add_time # Added the missing hours
        print("{0} -> {1} : {2:.2f} Average Comments per Post".format(est_moment.strftime(localFormat), localDatetime.strftime(localFormat), sorted_swap[i][0]));

Eastern Timezone -> America/Los_Angeles Timezone
15:00:00 -> 13:00:00 : 38.59 Average Comments per Post
02:00:00 -> 00:00:00 : 23.81 Average Comments per Post
20:00:00 -> 18:00:00 : 21.52 Average Comments per Post
16:00:00 -> 14:00:00 : 16.80 Average Comments per Post
21:00:00 -> 19:00:00 : 16.01 Average Comments per Post

Eastern Timezone -> Europe/Madrid Timezone
15:00:00 -> 22:00:00 : 38.59 Average Comments per Post
02:00:00 -> 09:00:00 : 23.81 Average Comments per Post
20:00:00 -> 03:00:00 : 21.52 Average Comments per Post
16:00:00 -> 23:00:00 : 16.80 Average Comments per Post
21:00:00 -> 04:00:00 : 16.01 Average Comments per Post

Eastern Timezone -> America/Puerto_Rico Timezone
15:00:00 -> 16:00:00 : 38.59 Average Comments per Post
02:00:00 -> 03:00:00 : 23.81 Average Comments per Post
20:00:00 -> 21:00:00 : 21.52 Average Comments per Post
16:00:00 -> 17:00:00 : 16.80 Average Comments per Post
21:00:00 -> 22:00:00 : 16.01 Average Comments per Post

And there you have it, the top 5 Hours to post your Ask HN to gain a large volume of comments. Hope you find this useful!

TIP!: You can convert this to your own timezone by modifying the list of timezone above!

Conclusion ¶

We have determined that Ask HN has the most post and we determined also that 3:00 pm in the afternoon is the peak time that users post comments in posts every day.

This answered the two question that we have earlier

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

To answer this we took the steps of:

Segregate Tagged Post for Analysis
Determine the Highest Average Comments in each Tagged Post
Count Post and Comment by Hour of the Highest Average Comments Determined Earlier
Determine Average Comments per Hour
Sort Average Comments per Post

This project will be beneficial to users wanting an answer to their specific needs. Now they will have an idea what time to post a question to HackerNews and gain a lot of answers to their question.