Hacker News is a site where users ask questions or share information about products or projects. Questions or information are ranked based on the number of comments/views.

This project analyzes and compares the number of views received by question-based posts (Ask HN) and information-based posts (Show HN) to find the best hours of the day to create a post. To make our recommendation, we'll try to find out:

  • Which posts (Ask HN or Show HN) receive more comments on average.
  • Which hour of the day a post is likely to receive more comments on average.

Summary of Results

After analyzing the data, we found out that posts created at 13:00 (Eastern Time in the US) have the highest chance of receiving more comments followed by those crated at 02:00.

For further details, please refer to the the full analysis below.

In [1]:
# Read in the data
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)

# transform the read file into list of lists
hn = list(read_file)

#display the first 5 rows
hn_5  = hn[:5]
for row in hn_5:
    print(row)
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
In [2]:
#display headers
headers = hn[0] #the first row of data, it contains headers
print(headers)
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
In [3]:
#Display the first 5 rows of data without the header.
hn_without_header = hn[1:]
for row in hn_without_header[:5]:
    print(f'Row {hn_without_header.index(row)+1}: {row}')
Row 1: ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
Row 2: ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
Row 3: ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
Row 4: ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
Row 5: ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']

Filtering Out The Data of Interest

As mentioned in the introduction, we're only interested in question-based and information-based posts, so we need to filter out the unrelated posts not related to the above.

Fortunately, a brief exploration of the data shows that question-based posts have titles that begin with Ask HN. And information-based posts have their titles begin with Show HN. Based on this, we'll separate the posts into three categories:

  • ask posts (question-based)
  • show posts (information-based)
  • other posts (others)
In [4]:
ask_posts = [] #list to hold the question-based posts
show_posts = [] #lists to hold the information-based posts
other_posts = [] #lists of other unrelated posts

for row in hn_without_header:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
In [5]:
#Display the first 5 rows of the lists of question-based posts.
for row in ask_posts[:5]:
    print(f'Row {ask_posts.index(row)+1}: {row}')
Row 1: ['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']
Row 2: ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']
Row 3: ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']
Row 4: ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']
Row 5: ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']
In [6]:
#Display the first 5 rows of the lists of information-based posts.
for row in show_posts[:5]:
    print(f'Row {show_posts.index(row)+1}: {row}')
Row 1: ['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']
Row 2: ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']
Row 3: ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']
Row 4: ['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11']
Row 5: ['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']

Which posts (Ask HN or Show HN) receive more comments on average?

Recall that the first objective of this project is to determine the posts (Ask HN or Show HN) that receive more comments on average. We will accomplish this by calculating the:

  • total number of comments received by Ask HN and Show HN
  • average number of comments received by Ask HN and Show HN
  • compare the average
In [7]:
# Function to calculate the total number of comments
def calcTotalComment(posts):
    total_comments = 0
    for item in posts:
        comments = item[4]
        comments = int(comments)
        total_comments+=comments
    return total_comments
        
In [8]:
# Function to calculate the average number of comments
def avg_comments(total_comments, posts):
    comment_counter = 0
    for item in posts:
        comments = item[4]
        comment_counter+=1
    calc_avg_comments = total_comments/comment_counter
    return f'{calc_avg_comments:.2f}'
In [9]:
#Total number of comments received by the question-based posts(Ask HN).
total_comments_askHN = calcTotalComment(ask_posts)
print(f'Total number of Ask HN comments = {total_comments_askHN:,}')


#Total number of comments received by the information-based posts(Show HN).
total_comments_showHN = calcTotalComment(show_posts)
print(f'Total number of Show HN comments = {total_comments_showHN:,}')
Total number of Ask HN comments = 24,483
Total number of Show HN comments = 11,988
In [10]:
#Average number of comments for the question-based posts from the total number of comments.
avg_askHN_comments = avg_comments(total_comments_askHN, ask_posts)
print(f'Average number of Ask HN comments = {avg_askHN_comments}')

#Average number of comments for the information-based posts from the total number of comments.
avg_showHN_comments = avg_comments(total_comments_showHN, show_posts)
print(f'Average number of Show HN comments = {avg_showHN_comments}')
Average number of Ask HN comments = 14.04
Average number of Show HN comments = 10.32

As we can see, on average, question-based posts (Ask HN) receive more comments than information-based posts (Show HN). Now that we have successfully identified the posts with a higher number of comments on average, we can focus our attention on the Ask HN posts to establish our second objective.

Does the time the question-based posts are created influence the number of comments attracted by the posts?

To answer the above question we need to:

  • calculate the number of ask posts created per hour
  • calculate the total number of comments at the hour
  • calculate the average number of comments per post at eachhour of the day
In [13]:
#Extract comments and time created 
result_list = [] #A list to hold the comments and time created. 
for item in ask_posts:
    time_created = item[6]
    comment_number = item[4]
    result_list.append([time_created, int(comment_number)])
In [14]:
#convert the date string and parse to datetime format
import datetime as dt
counts_by_hr = {} #A list to hold number of ask posts created per hour
comments_by_hr = {} #list to hold the total number of comments at the hour
for row in result_list:
    date = row[0]
    date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M") #parse date to datetime
    hour = date.strftime("%H") # extract hour from the date
    comment = row[1]
    if hour not in counts_by_hr:
        counts_by_hr[hour] = 1
        comments_by_hr[hour] = comment
    else:
        counts_by_hr[hour]+=1
        comments_by_hr[hour] += comment
In [16]:
#calculate the average number of comments per post during each hour of the day
avg_by_hour = []

for comment in comments_by_hr:
    for count in counts_by_hr:
        if comment == count:
            avg_by_hour.append([comment, comments_by_hr[comment]/counts_by_hr[count]])
In [17]:
avg_by_hour
Out[17]:
[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

We successfully calculated the average number of comments received for posts created during each hour of the day. However, it's difficult to identify the hour with the highest values.

For readable results, we'll sort the lists and print the five highest values in a format that's easier to read.

In [18]:
swap_avg_by_hour = []

for item in avg_by_hour:
    item.reverse()
    swap_avg_by_hour.append(item)
    
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

Top 5 Hours for Ask Posts Comments

In [20]:
import datetime as dt

top_five = sorted_swap[:5]
for item in top_five:
    item[0] = f'{item[0]:.2f}'
    item[1] = dt.datetime.strptime(item[1],"%H")
    item[1] = item[1].strftime('%H:%M')
    print(f'The average comments at {item[1]} is {item[0]}')
The average comments at 15:00 is 38.59
The average comments at 02:00 is 23.81
The average comments at 20:00 is 21.52
The average comments at 16:00 is 16.80
The average comments at 21:00 is 16.01

Conclusion and Recommendation

As we can see, the data show that posts created at 13:00 (Eastern Time in the US) have a higher chance of receiving more comments than those created at any other hour of the day.

For a contributor living in the United Arab Emirates, posts created at 01:00 (Gulf Standard Time) have a higher chance of receiving more comments.

If one cannot wake in the night to create a post, one can do so at 12:00 (Gulf Standard Time) as posts created in the midday can also attract a high audience.

In [ ]: