Ask HN vs Show HN posts: Hacker News Engagement analysis

Overview

In this project, we look at the popular technology news site Hacker News, and analyse the posts with the titles that begin with either Ask HN or Show HN.

Ask HN posts are the ones that users submit to ask the community a specific question, while, Show HN posts are the ones they submit to show the community a project, product, or something interesting.

We will compare the two types of posts to determine:

  • Which receive more comments on average?
  • Do posts created at a certain time receive more comments on average?

Dataset

Dataset for this project can be downloaded from here. Below are the description of the columns:

  • id: unique identifier from Hacker News for the post
  • title: the title of the post
  • url: the URL that the posts link to (if it has a URL)
  • num_points: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
  • num_comments: the number of comments on the post
  • author: the username of the person who submitted the post
  • created_at: the date and time of the post's submission (the time zone is US Eastern Time)

Note: This dataset is based on approximately 20,000 rows randomly sampled from the submissions after removing posts without any comments.

Let's read the dataset and display some posts.

Data Analysis

In [2]:
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]
print(headers)
for post in hn[:5]:
    print(post)
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']

Now, let's seperate the posts as we are only interested in Ask HN and Show HN posts.

In [3]:
ask_posts = []
show_posts = []
other_posts = []
for post in hn:
    title = post[1]
    title = title.lower()
    if title.startswith("ask hn"):
        ask_posts.append(post)
    elif title.startswith("show hn"):
        show_posts.append(post)
    else:
        other_posts.append(post)
print('Number of Ask HN posts: ',len(ask_posts))
print('Number of Show HN posts: ',len(show_posts))
print('Number of Other posts: ',len(other_posts))
print('\n')
print('='*5,'Ask HN posts','='*5)
print(ask_posts[:5])
print('\n')
print('='*5,'Show HN posts','='*5)
print(show_posts[:5])
Number of Ask HN posts:  1744
Number of Show HN posts:  1162
Number of Other posts:  17194


===== Ask HN posts =====
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


===== Show HN posts =====
[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]

1. Does Ask HN or Show HN posts receive more comments on average?

Let's determine the total number of comments and average comment per post for Ask HN vs Show HN posts.

In [4]:
def get_comments_stats(posts, print_answers = True):
    total_comments = 0
    avg_comments = 0
    total_posts = len(posts)
    for post in posts:
        num_comments = int(post[4])
        num_points = int(post[3])
        total_comments += num_comments
    avg_comments = round(total_comments/total_posts,2)
    
    if print_answers:
        print('Total posts = ',total_posts)
        print('Total comments = ',total_comments)
        print('Average comments per post = ',avg_comments)
        
    return avg_comments, total_comments

print('='*5,'Ask HN','='*5)
avg_ask_comments,total_ask_comments = get_comments_stats(ask_posts)
print('\n')
print('='*5,'Show HN','='*5)
avg_show_comments,total_show_comments = get_comments_stats(show_posts)
===== Ask HN =====
Total posts =  1744
Total comments =  24483
Average comments per post =  14.04


===== Show HN =====
Total posts =  1162
Total comments =  11988
Average comments per post =  10.32

After seperating the Ask HN and Show HN posts and calculating the average comments across posts, we can see that the average number of comments per post is higher for Ask HN posts (about 14 comments per post) than Show HN posts (about 10 comments per post).

This answers our first question

Ask HN posts receive more comments on average than Show HN posts

Since ask posts are more likely to receive comments, we will focus only on these posts for answering our next question.

2. Do posts created at a certain time receive more comments on average?

To answer this, we need to:

  1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received
  2. Calculate the average number of comments ask posts receive by hour created

For this, we will seperate the hour a post was made from the created_at datetime field in our dataset and then create two dictionaries - One for the number of posts and another for the number of comments by the hour.

In [5]:
import datetime as dt
import pytz
result_list = []
for post in ask_posts:
    created_at = dt.datetime.strptime(post[6],'%m/%d/%Y %H:%M')
    num_comments = int(post[4])
    result_list.append([created_at,num_comments])

counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    created_date = row[0]
    created_hour = created_date.strftime('%H')
    num_comments = row[1]
    if created_hour in counts_by_hour:
        counts_by_hour[created_hour] += 1
        comments_by_hour[created_hour] += num_comments
    else:
        counts_by_hour[created_hour] = 1
        comments_by_hour[created_hour] = num_comments

Now that we have the number of posts and total number of comments segmented by the hour in a day, we will determine the average number of comments per post by the hour the post was created in the day and display the top 5 hours during which Ask HN posts get more comments.

Note: The dataset timezone as per the documentation in US/Eastern - So we will display both this timezone and local timezone which in my case is Europe/London.

In [12]:
from IPython.display import display, Markdown

# Function to return the results (hour in the day) sorted by highest average comments
def sorted_result(freq_tbl):
    table = freq_tbl
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    return table_sorted

# Function to sort and display the results formatted - By default displays top 5 in US/Eastern timezone but can change to local timezone and pass a Local timezone
def display_result(freq_tbl, display_top_n = 5, display_in_local_time = False, local_timezone = 'Europe/London'):
    # Dataset timezone per documentation
    us_eastern = pytz.timezone('US/Eastern')
    # Ability to display in a different timezone by taking the parameter
    local_tz = pytz.timezone(local_timezone)
    heading_template = ""
    
    if display_in_local_time:
        heading_template = 'Top {} Hours ({}) for Ask Posts Comments'.format(display_top_n,local_tz.zone)
    else:
        heading_template = 'Top {} Hours ({}) for Ask Posts Comments'.format(display_top_n,us_eastern.zone)
        
    display(Markdown('**'+heading_template+'**'))
    #print(heading_template)
    #print('='*len(heading_template))
    
    table_sorted = sorted_result(freq_tbl)
    display_str_template = "{}: {:.2f} average comments per post"
    row_count = 0
    
    for entry in table_sorted:
        row_count += 1
        if row_count <= display_top_n:
            hour = dt.datetime.strptime(entry[1], '%H')
            
            if display_in_local_time:
                hour = hour.replace(tzinfo=us_eastern)
                hour = hour.astimezone(local_tz)

            hour_fmt = hour.strftime('%H:%M')
            print(display_str_template.format(hour_fmt,entry[0]))
        else:
            exit
            
avg_by_hour = {}
for hour in comments_by_hour:
    posts = counts_by_hour[hour]
    num_comments = comments_by_hour[hour]
    avg_comments = num_comments/posts
    avg_by_hour[hour] = avg_comments
display_result(freq_tbl = avg_by_hour,display_in_local_time = False)
print('\n')
display_result(freq_tbl = avg_by_hour,display_in_local_time = True,local_timezone = 'Europe/London')

Top 5 Hours (US/Eastern) for Ask Posts Comments

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Top 5 Hours (Europe/London) for Ask Posts Comments

19:55: 38.59 average comments per post
06:55: 23.81 average comments per post
00:55: 21.52 average comments per post
20:55: 16.80 average comments per post
01:55: 16.01 average comments per post

This answers our last question:

Ask HN posts created around 3 pm US/Eastern time (around 8 pm Europe/London time) seems to have highest average comments per post

Conclusion

After our data analysis on the Ask HN and Show HN posts, we conclude that Ask HN posts receive more comments and that best hour in the day for the comment activity on the Ask HN posts is around 3 pm US/Eastern time (8 pm Europe/London time).