In this project, we're going to work with a data set of submissions posted on popular technology website Hacker Newss. This website is a site hosted by startup incubator Y Combinator where users submit their own 'posts' which other readers can vote and comment upon. Another website similar to this is Reddit.
Hacker News is extremely popular in technology and startup circles and posts that are well voted and commented end up having hundreds of thousands of visits as a result.
The data set is available online and can be found and downloaded here. This is a reduced version of the original dataset which actually contains almost 300k rows. Our dataset contains of approximately 20k rows as posts that did not receive any comments were removed.
We're specifically interested in 'Ask Posts', which are posts where the user specifically looks for feedback by asking a question like:
On the other hand, 'Show Posts' are posts where the user specifically wants to showcase something to the community which the latter will provide feedback on. Some examples include:
At the end of this project we'll be coming up with conclusions such as:
from csv import reader # import reader module
opened_file = open('hacker_news.csv') # open csv file 'hacker_news.csv
read_file = reader(opened_file) # use reader function to load the opened_file
hn = list(read_file) # assign the read_file to a variable 'hn' in the form of a list
hn[:5] # display the first 5 rows of the dataset
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
In order to remove the Headers of each row from our data, we will have to exclude it and separate it from our data, as below:
headers = hn[0] # Headers at index 0 set to 'headers'
hn = hn[1:] # First row removed
print(headers) # Output 'headers'
print(hn[:5]) # Output first five rows
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
Now that the headers are removed we can start filtering our data - a process called Data Cleaning.
In order to make proper use of our data, we will have to separate the Ask Posts from the Show Posts. Posts that do not fall under this category will be listed as Other Posts.
In the next part of code, three lists were created: ask_posts
, show_posts
and other_posts
.
To find such posts we're going to make use of the startswith()
function to look for ask hn and show hn in the beginning of title.
The number of posts under each category is listed below.
# Creating 3 empty lists
ask_posts = []
show_posts = []
other_posts = []
# Iterating through the data set without headers for the title of the post, which has index 1:
for row in hn:
title = row[1]
if title.lower().startswith('ask hn'):
ask_posts.append(row)
elif title.lower().startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
# Output for each category
print('Nr of ask posts is', len(ask_posts))
print('Nr of show posts is', len(show_posts))
print('Nr of other posts is', len(other_posts))
Nr of ask posts is 1744 Nr of show posts is 1162 Nr of other posts is 17194
# Displaying first 5 rows of each list:
ask_posts[:5]
show_posts[:5]
[['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]
To do this, we're going to compute the average number of comments for each category.
In this section, we will make use of for loops for each category, then compute the average and compare.
#Finding total nr of comments in ask posts:
total_ask_comments = 0
for row in ask_posts:
total_ask_comments += int(row[4])
avg_ask_comments = total_ask_comments / len(ask_posts)
print('average nr of ask posts comments is', avg_ask_comments)
#Finding total nr of comments in show posts:
total_show_comments = 0
for row in show_posts:
total_show_comments += int(row[4])
avg_show_comments = total_show_comments / len(show_posts)
print('average nr of show posts comments is', avg_show_comments)
average nr of ask posts comments is 14.038417431192661 average nr of show posts comments is 10.31669535283993
The results above show that on average, Ask Posts (14) get more feedback by comments than Show Posts (10).
The next interesting analogy is to check whether the time at which the post is uploaded has any effect on the level of feedback that the post might receive, be it by comments or upvotes.
In this part, we have used the datetime
module to parse the data from the Date column in our dataset, which has an index number of 0.
Furthermore, 2 dictionaries were created - one for the comments and one for the counts, relative to the hour, in 24-hr format.
# Screen 5/8
# Importing datetime module as dt
import datetime as dt
result_list = []
for posts in ask_posts:
created = posts[6]
comments = int(posts[4])
result_list.append([created, comments])
counts_by_hour = {}
comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'
for row in result_list:
date = row[0]
comment = row[1]
date_obj = dt.datetime.strptime(date, date_format).strftime("%H")
if date_obj not in counts_by_hour:
counts_by_hour[date_obj] = 1
comments_by_hour[date_obj] = comment
else:
counts_by_hour[date_obj] += 1
comments_by_hour[date_obj] += comment
print('the nr of comments on ask posts by the hour are:')
comments_by_hour
the nr of comments on ask posts by the hour are:
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
Result: From the output above, one can immediately note that Ask Posts uploaded in the afternoon, specifically between 13:00 and 19:00 generate the most feedback, peaking at 15:00.
On the contrary, posts uploaded during the night (22:00 - 07:00) generate the least frequency of feedback by comments. However, there seems to be an abnormality at 02:00, which has generated a lot of interest.
This result can also be exhibited by calculating the average comments by the hour:
avg_by_hour = []
for hr in comments_by_hour:
avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])
avg_by_hour
[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]
swap_avg_by_hour = []
for rows in avg_by_hour:
a = rows[1]
b = rows[0]
swap_avg_by_hour.append([a,b])
print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print('Top 5 hours for Ask Posts Comments')
for avg, hr in sorted_swap[:5]:
print('{}: {:.2f} average comments per post'.format(
dt.datetime.strptime(hr, '%H').strftime('%H:%M'), avg))
[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']] Top 5 hours for Ask Posts Comments 15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post
It would also be interesting to determine which type of posts would be upvoted the most. In this next section, a simple calculation for the average number of counts for Ask Posts and Show Posts was executed separately, with the output listed underneath each section of code.
# Finding average nr of counts for Ask Posts
total_ask_counts = 0
for row in ask_posts:
counts = int(row[3])
total_ask_counts += counts
avg_ask_counts = total_ask_counts / len(ask_posts)
print('The avg number of counts for Ask Posts is ',avg_ask_counts)
The avg number of counts for Ask Posts is 15.061926605504587
# Finding average nr of counts for Show Posts:
total_show_counts = 0
for row in show_posts:
show_counts = int(row[3])
total_show_counts += show_counts
avg_show_counts = total_show_counts / len(show_posts)
print('The avg number of counts for Show Posts is ',avg_show_counts)
The avg number of counts for Show Posts is 27.555077452667813
From the results above it is clear that Show Posts get more upvotes than Ask Posts. This can be understandable since people tend to show more appreciation for others' work - hence upvote it - then appreciating a simple question (Ask Post).
On the other hand, it is also understandable that Ask Posts get more answers than Show Posts since users are actually looking for feedback in the form of an answer.
It could be that an article or post uploaded at a certain time might get more attention or upvotes than posts uploaded at a different time. For example, we already saw that posts uploaded at 3pm get more feedback, i.e. comments back than posts uploaded at any other time of day.
In this section, we will see whether time has any effect on the upvotes a certain post may get.
We will do this in the following order:
import datetime as dt
ask_list_counts_vs_time = []
total=0
# Checking for upvoting vs. time for Ask Posts
for posts in ask_posts:
created = posts[6]
counts = int(posts[3])
ask_list_counts_vs_time.append([created, counts])
ask_counts_by_hour = {}
ask_comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'
for row in ask_list_counts_vs_time:
date = row[0]
count = row[1]
created_obj = dt.datetime.strptime(date, date_format).strftime('%H')
if created_obj not in ask_counts_by_hour:
ask_counts_by_hour[created_obj] = 1
ask_comments_by_hour[created_obj] = count
else:
ask_counts_by_hour[created_obj] += 1
ask_comments_by_hour[created_obj] += count
print('the nr of counts on ask posts by the hour are:')
ask_counts_by_hour
the nr of counts on ask posts by the hour are:
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
import datetime as dt
show_list_counts_vs_time = []
# Checking for upvoting vs. time for Show Posts
for post in show_posts:
created = post[6]
counts = int(post[3])
show_list_counts_vs_time.append([created, counts])
show_counts_by_hour = {}
show_comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'
for row in show_list_counts_vs_time:
date = row[0]
count = row[1]
show_created_obj = dt.datetime.strptime(date, date_format).strftime('%H')
if show_created_obj not in show_counts_by_hour:
show_counts_by_hour[show_created_obj] = 1
show_comments_by_hour[show_created_obj] = count
else:
show_counts_by_hour[show_created_obj] += 1
show_comments_by_hour[show_created_obj] += count
print('the nr of counts on show posts by the hour are:')
show_counts_by_hour
the nr of counts on show posts by the hour are:
{'14': 86, '22': 46, '18': 61, '07': 26, '20': 60, '05': 19, '16': 93, '19': 55, '15': 78, '03': 27, '17': 93, '06': 16, '02': 30, '13': 99, '08': 34, '21': 47, '04': 26, '11': 44, '12': 61, '23': 36, '09': 30, '01': 28, '10': 36, '00': 31}
From the results above we can conclude the following points:-
In fact, one can deduce that people spend more time looking at Ask Posts (13:00 - 21:00) rather than at Show Posts.