In this project, we're working with a data set of submissions to popular technology site Hacker News. This data set has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. The column descriptions are as follows:
We're looking for posts in the Ask HN or Show HN category (from the title column). Ask HN posts are created to ask the Hacker News community a specific question. Show HN posts are submitted to show the Hacker News community a project, product, or just generally something interesting. We want to compare these two types of posts to determine the following:
# reading in the data set
import csv
opened_file = open('hacker_news.csv')
hn = list(csv.reader(opened_file))
hn[:5]
In order to analyze the data, it is useful to first remove the header row:
# extract the header row and assign it to a variable
headers = hn[0]
# remove the header row from our list of lists
hn = hn[1:]
# print header row and update list to make sure headers were removed properly
print(headers)
print('\n')
print(hn[:5])
Now we're ready to filter the data to retain only titles beginning with Ask HN or Show HN. In order to this this, we'll filter the data into three new lists:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
if title.lower().startswith('ask hn'):
ask_posts.append(row)
elif title.lower().startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print(ask_posts[:5])
print('\n')
print(show_posts[:5])
print('\n')
print(other_posts[:5])
print('\n')
print("Length of Ask HN list:", (len(ask_posts)))
print("Length of Show HN list:", (len(show_posts)))
print("Length of Other Posts list:", (len(other_posts)))
We've printed the first five rows of each list to make sure the data filtered correctly, as well as the length of each list.
Now we'll determine if the Ask HN or Show HN list received more comments on average:
# finding the average number of comments on Ask HN posts
total_ask_comments = 0
for row in ask_posts:
total_ask_comments += int(row[4])
avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average Number of Comments on Ask HN posts:", avg_ask_comments)
# finding the average number of comments on Show HN posts
total_show_comments = 0
for row in show_posts:
total_show_comments += int(row[4])
avg_show_comments = total_show_comments / len(show_posts)
print("Average Number of Comments on Show HN posts:", avg_show_comments)
The average number of comments on Ask HN posts is approximately 14 and the average number of comments on Show HN posts is approximately 10, thus the data shows that Ask HN posts receive more comments, on average. Since this is the case, we'll focus our next analysis on just the Ask HN posts.
In the second part of our analysis, we want to determine if Ask HN posts created at a certain time receive more comments. In order to do this, we'll:
# to calculate the number of Ask HN posts created each hour of the day
import datetime as dt
result_list = []
for row in ask_posts:
result_list.append([row[6], int(row[4])])
print(result_list[:5])
The above code calculates the number of Ask HN posts created each hour of the day and prints the first five rows of the list storing this data so we can see that we did the formula correctly.
# to calculate the number of Ask HN posts published each hour and the number of comments on those posts
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
for each_row in result_list:
date = each_row[0]
comment = each_row[1]
hour = dt.datetime.strptime(date, date_format).strftime("%H")
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = comment
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += comment
print("Number of Ask HN posts published each hour:",
counts_by_hour)
print('\n')
print("Number of comments on Ask HN posts published each hour:",
comments_by_hour)
Now to use the two dictionaries we created above to calculate the average number of comments for Ask HN posts created during each hour of the day:
# to calculate the average number of comments for Ask HN posts created each hour of the day
avg_by_hour = []
for hr in comments_by_hour:
avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])
avg_by_hour
# to swap and sort the avg_by_hour list
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
sorted_swap
# to print the 5 highest values
print("Top 5 Hours for Ask Posts Comments")
for avg, hr in sorted_swap[:5]:
print("{}: {:.2f} average comments per post".format(
dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg))
The data shows that the hour that receives the most post comments is 15:00 (or 3:00pm), thus you would have the highest chance of receiving comments on your post if you publish it during this hour. The dataset documentation says the timezone is U.S. Eastern Time, so that would translate to 2:00pm for my timezone (CST).