In this project, we'll work with a data set of submissions to popular technology site Hacker News
. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
We're specifically interested in posts whose titles begin with either Ask HN
or Show HN
.
Ask HN
posts to ask the Hacker News community a specific question.Show HN
posts to show the Hacker News community a project, product, or just generally something interesting.Our goal for this project is to compare these two types of post and determine the following:
In the code cell below, we:
reader()
function from the csv
modulehacker_news.csv
file using the open()
function, and assign the output to a variable named opened_file
. If you run into an error named UnicodeDecodeError
, add encoding="utf8"
to the open()
function (for instance, use open('hacker_news.csv', encoding='utf8')
)opened_file
using the reader()
function, and assign the output to a variable named read_file
read_file
to a list of lists using list()
and save it to a variable named hn
header
hn
5
rows of the data set.# hacker_news data set
from csv import reader
opened_file = open('hacker_news.csv', encoding='utf8')
read_file = reader(opened_file)
hn = list(read_file)
header = hn[0]
hn = hn[1:]
print(header)
print('\n')
print(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]
Now that we've removed the headers from hn
, we're ready to filter our data. Since we're only concerned with post titles beginning with Ask HN
or Show HN
, we'll create new lists of lists containing just the data for those titles.
To find the posts that begin with either Ask HN
or Show HN
, we'll use the string method startswith
. Given a string object, say, string1
, we can check if starts with, say, dq
by inspecting the output of the object string1.startswith('dq')
. If string1
starts with dq
, it will return True
, otherwise it will return False
.
If we wish to control for case, we can use the lower
method which returns a lowercase version of the starting string.
In the code cell below, we:
ask_posts
, show_posts
and other_posts
hn
title
title
column is the second column, you'll need to get the element at index 1
in each rowlower()
and starswith.
methods):title
strats with ask hn
, append the row to ask_posts
title
stars with show hn
, append the row to show_posts
other_posts
ask_posts
, show_posts
, and other_posts
# Method to separate posts beginning with Ask HH and Show HN
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
if title.lower().startswith("ask hn"):
ask_posts.append(row)
elif title.lower().startswith("show hn"):
show_posts.append(row)
else:
other_posts.append(row)
print('ask_posts: ', len(ask_posts))
print('show_posts: ', len(show_posts))
print('other_posts: ', len(other_posts))
ask_posts: 9139 show_posts: 10158 other_posts: 273822
In the last screen, we separated the ask posts
and the show posts
into two list of lists named ask_posts
and show_posts
.
Next, let's determine if ask posts
or show posts
receive more comments on average.
In the code cell below, we:
total_ask_comments
total_ask_comments
to 0
ask posts
num_comments
column is the fifth column in ask_posts
, you'll need to get the element at index 4
in each rowtotal_ask_comments
ask posts
and assign it to avg_ask_comments
avg_ask_comments
# Average Ask HN
total_ask_comments = 0
for row in ask_posts:
num_comments = int(row[4])
total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)
10.393478498741656
In the code cell below, we:
total_show_comments
total_show_comments
to 0
ask posts
num_comments
column is the fifth column in show_posts
, you'll need to get the element at index 4
in each rowtotal_show_comments
show posts
and assign it to avg_show_comments
avg_show_comments
# Average Show HN
total_show_comments = 0
for row in show_posts:
num_comments = int(row[4])
total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)
4.886099625910612
On average, ask posts approximately receive 10 comments whereas show posts receive almost 5 comments. Since ask posts are more likely to receive comments.
Since ask posts
are more likely to receive comments, we'll focus our remaining analysis just on these posts.
Next, we'll determine if ask posts
created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
We'll tackle the first step — calculating the amount of ask posts and comments by hour created. We'll use the datetime
module to work with the data in the created_at
column.
Recall that we can use the datetime.strptime()
constructor to parse dates stored as strings
and return datetime objects
.
# import datetime module
import datetime as dt
In the cell bellow, we:
result_list
. This will be a list of listask_post
and append to result_list
a list with two elements:created_at
created_at
column is the seventh column in ask_posts
, you'll need to get the element at index 6
in each rownumber of comments
of the post# Appending columns: created_at and num_columns
result_list = []
for row in ask_posts:
result_list.append([row[6], int(row[4])])
counts_by_hour
and comments_by_hour
result_list
datetime.strptime()
method to select just the hour from the datetime objectdatetime.strftime()
method to select just the hour from the datetime objectcounts_by_hour
:counts_by_hour
and set it equal to 1comments_by_hour
and set it equal to the comment
numbercounts_by_hour
:counts_by_hour
by 1
comments_by_hour
by the comment
number# amount of ask post and comments
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
for row in result_list:
date_string = row[0]
time = dt.datetime.strptime(date_string, date_format)
hour = time.strftime("%H")
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = row[1]
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += row[1]
We created two dictionaries:
the number of ask posts
created during each hour of the daynumber of comments ask posts
created at each hour receivedNext, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.
In the code below, we:
avg_by_hour
comments_by_hour
avg_num
comment
and avg_number
in avg_by_hour
# Average Number of comments
avg_by_hour = []
for comment in comments_by_hour:
avg_num = comments_by_hour[comment] / counts_by_hour[comment]
avg_by_hour.append([comment, avg_num])
avg_by_hour
[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]
avg_by_hour
with swapped columnsswap_avg_by_hour
avg_by_hour
and append to swap_avg_by_hour
a list whose first element is the second element of the row, and whose second element is the first element of the rowswap_avg_by_hour
sorted()
function to sort swap_avg_by_hour
in descending order. Since the first column of this list is the average number of comments, sorting the list will sort by the average number of commentsreverse
argument to True
, so that the highest value in the first column appears first in the listsorted_swap
sorted_swap
str.format()
method to print the hour and average in the following format: 15:00: 38.59 average comments per post
datetime.strptime()
constructor to return a datetime object and then use the strftime()
method to specify the format of the time{:.2f}
to indicate that just two decimal places should be used# Sorting and Printing Values
# Part one
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)
#Part two
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
#Part three
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
time_string = row[1]
time_top = dt.datetime.strptime(time_string, "%H")
hour_top = time_top.strftime("%H:%M")
print("{}: {:.2f} average comments per post".format(hour_top, row[0]))
[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']] Top 5 Hours for Ask Posts Comments 15:00: 28.68 average comments per post 13:00: 16.32 average comments per post 12:00: 12.38 average comments per post 02:00: 11.14 average comments per post 10:00: 10.68 average comments per post
The hour that receives the most comments per post on average is 15:00
with an average of 28.68
comments per post. The time zone used is Eastern Time in the US; as a result, we could also write 15:00
as 3:00 pm est
.
Based on our analysis, we recommend posting at 15:00
or 3:00 p, est
in order to have a higher chance of receiving more comments on an Aks Post
.
Furthermore, Creating post from 12:00
to 13:00
receives on average 28,7
comments per post which is another good option to do as well.