Overview:
Hacker News (https://news.ycombinator.com/) is the "most trusted, widely-read, independent source of the latest news and technical coverage on cybersecurity, hacking threads, and infosec trends." In this project, we will look at the two primary types of posts on HN: "Ask HN" and "Show HN". Users use the former to ask the HN community a specific question, and the the latter to show the community something intresting such as a project or product.
This project will compare these two types of posts and analyze which receives the most comments. It will also analyze the time stamps of these posts to see what time is the best to post in order to receive a comment.
Part 1: Reading the csv file and understanding the data
import datetime as dt
from csv import reader
opened = open("HN_posts_year_to_Sep_26_2016.csv")
read_file = reader(opened)
data = list(read_file)
headers = data[0]
hn = data[1:]
print(headers,'\n')
for row in hn[:5]:
print(row, '\n')
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'] ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'] ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'] ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'] ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']
Part 2:
To isolate only the posts w/ "Ask HN" or "Show HN", we can use the startswith string method. The cell below highlights our approach to seperating each list
eg_1 = hn[42][1]
eg_1
'Ask HN: How do you pass on your work when you die?'
eg_1.startswith("Ask HN")
True
Now, we apply this approach to all the title posts in our dataset.
To do so, We first create three empty lists - one for each category of title type. We then loop through each row in the dataset, and create a title variable at the first index of each row. We then use the .lower() method on the title and pass this value into a conditional that checks for 'ask hn' or 'show hn' in the title.
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1].lower()
if title.startswith("ask hn"):
ask_posts.append(row)
elif title.startswith("show hn"):
show_posts.append(row)
else:
other_posts.append(row)
Next, we create a function called show_rows
to double check that our lists contain the correct posts.
def show_rows(dataset):
for rows in dataset[:5]:
print(rows,'\n')
show_rows(ask_posts)
['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'] ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'] ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'] ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'] ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']
show_rows(show_posts)
['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'] ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'] ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/25/2016 23:44'] ['12577991', 'Show HN: Pomodoro-centric, heirarchical project management with ES6 modules', 'https://github.com/jakebian/zeal', '2', '0', 'dbranes', '9/25/2016 23:17'] ['12577142', 'Show HN: Jumble Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06']
The titles match the lists they are in, so we now have an idea of the breakdown of the number of posts on HN from our dataset. There are 9,139 ask posts, 10,158 show posts, and 273,822 other posts.
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
9139 10158 273822
Part 3:
In this section, we compare the average number of comments in the ask_posts
and show_posts
list. To do this, we create a function total_comments
that returns the total number of comments from each list. Then, we create an other function named avg_comment
that returns the avg number of commment for each each list.
def total_comments(dataset):
total_comments = 0
for row in dataset:
comments = int(row[4])
total_comments += comments
return total_comments
total_ask_comments = total_comments(ask_posts)
total_show_comments = total_comments(show_posts)
def avg_comment(total_com, dataset):
return round((total_com / len(dataset) ),2)
avg_ask_comments = avg_comment(total_ask_comments, ask_posts)
avg_show_comments = avg_comment(total_show_comments, show_posts)
print(avg_ask_comments)
print(avg_show_comments)
10.39 4.89
From our analysis, we can see that the average "Ask HN" post recieves roughly 10.4 comments while the average "Show HN" post recieves about 4.9 comments.
Part 4:
Now, that we have an idea of which post recieves the most comments, we can look deeper into the timing of each post. Our objective is to determine if there is a specific time or range of times that recieve the most comments. To narrow the scope of our investigation, we will look at only the comments and timestamps in the 'Ask HN' dataset.
The first step for this part is to create a new, empty list called result_list
. We then iterate through each row of data in ask_posts
and append the time that the each post was created, as well as the num of comments for each post receievd into the results list
results_list = []
for row in ask_posts:
time_create = row[-1]
num_comment = int(row[4])
combined = [time_create, num_comment]
results_list.append(combined)
Below is a sample of the results_list
.
show_rows(results_list)
['9/26/2016 2:53', 7] ['9/26/2016 1:17', 3] ['9/25/2016 22:57', 0] ['9/25/2016 22:48', 3] ['9/25/2016 21:50', 2]
We then pass the info from the results_list
into two dictonaries, counts_by_hour
and comments_by_hour
. To do this, we iterate trhough each value in the results_list
and create a time
and comment_num
variable for these repective values in the list.
Next, we turn the time
variable into a datetime object, which then allows us to extract the hour of each post. We save this variable as hour
. Then, we check if it is in counts_by_hour
dict. If it isn't we create a key and value pair for it in the dict. If it is, then we add one to the existing value.
We use a similar process for the comments_by_hour
dict. We check if the hour key is in the list and if is not then we set that key equal to the comment_num
variable. If it is a key, then we add the comment_num
val to the existing value.
counts_by_hour = {}
comments_by_hour = {}
for row in results_list:
time = row[0]
comment_num = row[1]
time_obj = dt.datetime.strptime(time, "%m/%d/%Y %H:%M")
hour = time_obj.strftime("%H")
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
else:
counts_by_hour[hour] += 1
if hour not in comments_by_hour:
comments_by_hour[hour] = comment_num
else:
comments_by_hour[hour] += comment_num
Now, we are ready to calculate the average per hour list. To do this, we create a new, empty list called avg_by_hour
. We then loop through each item in counts_by_hour
and append the hour, the k
variable, to the avg_by_hour
list.
We then use the same k
variable to access the total number of comments for that hour from the comments_by_hour
dict. We divide this value by v
, which is the corresponding value for counts_by_hour
. This division (the total number of comments for that hour / the number of posts) provides us with an avergae number of comments per post.
avg_by_hour = []
for k,v in counts_by_hour.items():
avg_by_hour.append([k, round(comments_by_hour[k] / v,2)])
Below is a sample of the avg_hour
list. The first value of each row is the hour, and the second value is the avg number of comments for that hour.
show_rows(avg_by_hour)
['02', 11.14] ['01', 7.41] ['22', 8.8] ['21', 8.69] ['19', 7.16]
We then sort avg_by_hour
by descending order of comments per hour, and then format the top five values to display their average comments per post.
avg_by_hour.sort(key=lambda row: (row[1]), reverse=True)
print("Top 5 Hours for 'Ask HN' Comments:")
for row in avg_by_hour[:5]:
hour = row[0]
time_obj = dt.datetime.strptime(hour, "%H")
hour = time_obj.strftime("%H:%M")
avg_comment = row[1]
avg_comment_str = f"{avg_comment} average comments per post"
print((f"{hour} : {avg_comment_str}"))
Top 5 Hours for 'Ask HN' Comments: 15:00 : 28.68 average comments per post 13:00 : 16.32 average comments per post 12:00 : 12.38 average comments per post 02:00 : 11.14 average comments per post 10:00 : 10.68 average comments per post
The best time to post a question on HN is during the 15:00 hour. The second best hour to ask a question is 13:00, and the third best is 12:00. As such, we can deduce that the early afternoon period is the best to ask a question since the top three hours fall within this period.
We can also observe that the 15:00 period is leaps and bounds ahead of the other hours regarding comments. It has roughly 60% more comments than the second best hour, 13:00.
Conclusion:
This project analyzed 300k rows of data pertaining to articles posted on Hacker Network (HN). Specifcally, it looked at two types of posts on the website, 'Ask HN' and 'Show HN', both of which accounted for roughly 20k of the rows of data from our dataset.
We were able to determine that 'Ask HN' posts recieve on average about 5 more comments per post. As well as, breakdown the best time to post an 'Ask HN' question and revieve a response, 15:00.