Hacker News is a site where users ask questions or share information about products or projects. Questions or information are ranked based on the number of comments/views.
This project analyzes and compares the number of views received by question-based posts (Ask HN) and information-based posts (Show HN) to find the best hours of the day to create a post. To make our recommendation, we'll try to find out:
After analyzing the data, we found out that posts created at 13:00 (Eastern Time in the US) have the highest chance of receiving more comments followed by those crated at 02:00.
For further details, please refer to the the full analysis below.
# Read in the data
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
# transform the read file into list of lists
hn = list(read_file)
#display the first 5 rows
hn_5 = hn[:5]
for row in hn_5:
print(row)
#display headers
headers = hn[0] #the first row of data, it contains headers
print(headers)
#Display the first 5 rows of data without the header.
hn_without_header = hn[1:]
for row in hn_without_header[:5]:
print(f'Row {hn_without_header.index(row)+1}: {row}')
As mentioned in the introduction, we're only interested in question-based and information-based posts, so we need to filter out the unrelated posts not related to the above.
Fortunately, a brief exploration of the data shows that question-based posts have titles that begin with Ask HN. And information-based posts have their titles begin with Show HN. Based on this, we'll separate the posts into three categories:
ask_posts = [] #list to hold the question-based posts
show_posts = [] #lists to hold the information-based posts
other_posts = [] #lists of other unrelated posts
for row in hn_without_header:
title = row[1]
if title.lower().startswith("ask hn"):
ask_posts.append(row)
elif title.lower().startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
#Display the first 5 rows of the lists of question-based posts.
for row in ask_posts[:5]:
print(f'Row {ask_posts.index(row)+1}: {row}')
#Display the first 5 rows of the lists of information-based posts.
for row in show_posts[:5]:
print(f'Row {show_posts.index(row)+1}: {row}')
Recall that the first objective of this project is to determine the posts (Ask HN or Show HN) that receive more comments on average. We will accomplish this by calculating the:
# Function to calculate the total number of comments
def calcTotalComment(posts):
total_comments = 0
for item in posts:
comments = item[4]
comments = int(comments)
total_comments+=comments
return total_comments
# Function to calculate the average number of comments
def avg_comments(total_comments, posts):
comment_counter = 0
for item in posts:
comments = item[4]
comment_counter+=1
calc_avg_comments = total_comments/comment_counter
return f'{calc_avg_comments:.2f}'
#Total number of comments received by the question-based posts(Ask HN).
total_comments_askHN = calcTotalComment(ask_posts)
print(f'Total number of Ask HN comments = {total_comments_askHN:,}')
#Total number of comments received by the information-based posts(Show HN).
total_comments_showHN = calcTotalComment(show_posts)
print(f'Total number of Show HN comments = {total_comments_showHN:,}')
#Average number of comments for the question-based posts from the total number of comments.
avg_askHN_comments = avg_comments(total_comments_askHN, ask_posts)
print(f'Average number of Ask HN comments = {avg_askHN_comments}')
#Average number of comments for the information-based posts from the total number of comments.
avg_showHN_comments = avg_comments(total_comments_showHN, show_posts)
print(f'Average number of Show HN comments = {avg_showHN_comments}')
As we can see, on average, question-based posts (Ask HN) receive more comments than information-based posts (Show HN). Now that we have successfully identified the posts with a higher number of comments on average, we can focus our attention on the Ask HN posts to establish our second objective.
Does the time the question-based posts are created influence the number of comments attracted by the posts?
To answer the above question we need to:
#Extract comments and time created
result_list = [] #A list to hold the comments and time created.
for item in ask_posts:
time_created = item[6]
comment_number = item[4]
result_list.append([time_created, int(comment_number)])
#convert the date string and parse to datetime format
import datetime as dt
counts_by_hr = {} #A list to hold number of ask posts created per hour
comments_by_hr = {} #list to hold the total number of comments at the hour
for row in result_list:
date = row[0]
date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M") #parse date to datetime
hour = date.strftime("%H") # extract hour from the date
comment = row[1]
if hour not in counts_by_hr:
counts_by_hr[hour] = 1
comments_by_hr[hour] = comment
else:
counts_by_hr[hour]+=1
comments_by_hr[hour] += comment
#calculate the average number of comments per post during each hour of the day
avg_by_hour = []
for comment in comments_by_hr:
for count in counts_by_hr:
if comment == count:
avg_by_hour.append([comment, comments_by_hr[comment]/counts_by_hr[count]])
avg_by_hour
We successfully calculated the average number of comments received for posts created during each hour of the day. However, it's difficult to identify the hour with the highest values.
For readable results, we'll sort the lists and print the five highest values in a format that's easier to read.
swap_avg_by_hour = []
for item in avg_by_hour:
item.reverse()
swap_avg_by_hour.append(item)
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
import datetime as dt
top_five = sorted_swap[:5]
for item in top_five:
item[0] = f'{item[0]:.2f}'
item[1] = dt.datetime.strptime(item[1],"%H")
item[1] = item[1].strftime('%H:%M')
print(f'The average comments at {item[1]} is {item[0]}')
As we can see, the data show that posts created at 13:00 (Eastern Time in the US) have a higher chance of receiving more comments than those created at any other hour of the day.
For a contributor living in the United Arab Emirates, posts created at 01:00 (Gulf Standard Time) have a higher chance of receiving more comments.
If one cannot wake in the night to create a post, one can do so at 12:00 (Gulf Standard Time) as posts created in the midday can also attract a high audience.