Data Analysis of Hacker News Posts

Background

Hacker News (sometimes abbreviated as HN) is a social news website focusing on computer science and entrepreneurship. It is run by the investment fund and startup incubator Y Combinator. In general, content that can be submitted is defined as "anything that gratifies one's intellectual curiosity. Source Wikipedia

In this project, we'll compare two different types of posts from the Hacker News. The two types of posts we'll explore begin with either Ask HN or Show HN.

The datset used in this analysis can be found on this Kaggle page

Objective

The ojective is to compare Ask HN and Show HN types of posts to determine the following:

  • Do Ask HN or Show HN receive more comments on average?
  • Do posts created at a certain time receive more comments on average?

It is important to say that the dataset was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments.

Fields Description

Below is the description of the columns in the hacker_news.csv dataset:

id: The post unique identifier
title: The post title
url: The URL that the posts link to if the post has a URL
num_points: The total number of points the post received, which is calculated as the total number of upvotes minus the total number of downvotes
num_comments: The total number of comments the post received
author: The person that submitted the post
created_at: The date and time at which the post was submitted (time zone - Eastern Time in the US)

In [1]:
# Import the needed libraries
from csv import reader
import datetime as dt
In [2]:
# Read in the data
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)

# View the first 5 records
hn[:5]
Out[2]:
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

Removing Headers from a List of Lists

In [3]:
# #Remove header
headers = hn[0]
hn = hn[1:]
print(headers)
print('\n')
print(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]

Extracting Ask HN and Show HN Posts

First of all, we'll identify posts that begin with either Ask HN or Show HN and separate the data for those two types of posts into different lists. Separating the data makes it easier to analyze in the following steps.

In [4]:
# Identify posts that begin with either `Ask HN` or `Show HN` and separate the data into different lists.
ask_posts = []
ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1]
    
    if title.lower().startswith("ask hn"):
        ask_posts.append(post)
    elif title.lower().startswith("show hn"):
        show_posts.append(post)
    else:
        other_posts.append(post)
        
print("Total number of ASK Posts HN: ", len(ask_posts))
print("Total number of SHOW Posts HN: ", len(show_posts))
print("Total number of OTHER Posts HN:", len(other_posts))
        
    
Total number of ASK Posts HN:  1744
Total number of SHOW Posts HN:  1162
Total number of OTHER Posts HN: 17194
In [5]:
# Below is the first five rows of ask_posts
print(ask_posts[:5])
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]
In [6]:
# Below is the first five rows of show_posts
print(show_posts[:5])
[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]
In [7]:
# Below is the first five rows of other posts
print(other_posts[:5])
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]

Calculating the Average Number of Comments for Ask HN and Show HN Posts

Now that we have separated Ask HN posts and Show HN posts into different lists, we'll calculate the average number of comments each type of post receives.

In [8]:
# Calculate the average number of comments `Ask HN` posts receive.
total_ask_comments = 0

for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average number of Ask HN comments:", round(avg_ask_comments, 2))
Average number of Ask HN comments: 14.04
In [9]:
#Calculate average comments from show posts:
total_show_comments = 0
for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print("Average number of Show HN comments: ", round(avg_show_comments,2))
Average number of Show HN comments:  10.32

From the analysis above, it can be deduced that Ask HN posts received approximately 14 comments, whereas Show HN posts received approximately 10 comments. Thus, since Ask HN posts are likely to receive more comments, hence, the remaining analysis will focus on the Ask HN posts.

Finding the Amount of Ask Posts and Comments by Hour Created

Next line of action we'll determine if we can maximize the amount of comments an Ask post receives by creating it at a certain time. First, we'll find the amount of Ask posts created during each hour of day, along with the number of comments those posts received. Then, we'll calculate the average amount of comments Ask posts created at each hour of the day receive.

In [10]:
result_list = []
for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    result_list.append([created_at, num_comments])

counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    comment = row[1]
    time_t = dt.datetime.strptime(date, date_format).strftime("%H")
    if time_t not in counts_by_hour:
        counts_by_hour[time_t] = 1
        comments_by_hour[time_t] = comment
        
    else:
        counts_by_hour[time_t] +=1
        comments_by_hour[time_t] += comment
    
comments_by_hour
Out[10]:
{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

Calculating the Average Number of Comments for Ask HN Posts by Hour¶

We use the dictionaries we just created to calculate the average number of comments for Ask HN posts by hour.

In [11]:
avg_by_hour = []
for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])
avg_by_hour
Out[11]:
[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

Sorting and Printing Values from a List of Lists

The average number of comments for posts created during each hour of the day was calculated above, however, the format makes it difficult to identify the hours with the highest values. Thus, the need to sort the list and print the highest five values in a format that is more readable

In [12]:
swap_avg_by_hour = []

for row in avg_by_hour:
    first_e = row[1]
    second_e = row[0]
    swap_avg_by_hour.append([first_e, second_e])

print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print('\n')
sorted_swap
[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


Out[12]:
[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]
In [27]:
# Sort the values and print the the 5 hours with the highest average comments.
# West Africa Time is 5 hours ahead of Eastern Time
print("Top 5 Hours for 'Ask HN' Comments")

for row in sorted_swap[:5]:    
    # West Africa Time (WAT)
    # Converting the `Hour` from EST to WAT, WAT is 5 hours ahead of EST
    wat_hr_dt = dt.datetime.strptime(row[1], '%H') + dt.timedelta(hours=5)
    wat_hr_str = wat_hr_dt.strftime('%H:%M')
    
    print('   ', '{wat_time} WAT:    {avg:.2f} average comments per post'.format(wat_time=wat_hr_str, avg=row[0]))
Top 5 Hours for 'Ask HN' Comments
    20:00 WAT:    38.59 average comments per post
    07:00 WAT:    23.81 average comments per post
    01:00 WAT:    21.52 average comments per post
    21:00 WAT:    16.80 average comments per post
    02:00 WAT:    16.01 average comments per post

Conclusion

From the analysis, it can be concluded that the hour that receives the most comments per post on average is 20:00, with an average of 38.59 comments per post. The analysis also shows that, there's 41.5% diffrence in the number of comments between the hours with the highest and the hours with the least average number of comments.

The source of the dataset revealed that the timezone used is Eastern Time in the US, however, this was converted to WAT, thus, 20:00 can also be written as 8:00 PM WAT. Note that WAT is 5 hours ahead of EST.

Finally, it is recommended that post should be categorized as Ask HN post and advisably should be created between 20:00 and 21:00 (8:00 pm WAT - 9:00 pm WAT).