In this project, we'll work with a data set of submissions to popular technology site Hackers news
Hacker News is a site started by the startup incubator Y Combinators Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
Our analysis will be focused on two major type of posts, The Ask HN and Show HN posts.
Users submit Ask HN posts to ask the Hacker News community a specific question.
Users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.
We'll compare these two types of posts to determine the following:
Let's start by importing the libraries we need and reading the data set into a list of lists.
Also displaying the first five rows
from csv import reader
### The Google Play data set ###
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
##displaying the first five rows
hn[:5]
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
About the columns We've about seven columns in this data set which is been listed below.
Extracting the first row of data, and assigning it to the variable headers
headers = hn[0]
headers
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
hn = hn[1:]
hn[:5]
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
looping through the list of list hn and check if Ask HN or Show HN is included in the title. If included, create two distict lists and add each post type respectively in their list
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1].lower()
if title.startswith('ask hn'):
ask_posts.append(row)
elif title.startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
1744 1162 17194
From the result above, the number of Ask HN posts are more than that of Show HN.
Total number of comments in asks posts
total_ask_comment = 0
for row in ask_posts:
asks_comment = int(row[4])
total_ask_comment += asks_comment
avg_ask_comments = total_ask_comment/len(ask_posts)
Total number of comments in Show posts
total_show_comment = 0
for row in show_posts:
show_comment = int(row[4])
total_show_comment += show_comment
avg_show_comments = total_show_comment/len(show_posts)
print(f"Average post for asks comments are {avg_ask_comments}")
print(f"Average post for show comments are {avg_show_comments}")
Average post for asks comments are 14.038417431192661 Average post for show comments are 10.31669535283993
Asks Post recieved more comments on the average than show comment
Next, To determine if ask posts created at a certain time are more likely to attract comments the following steps are taken to perform this analysis:
import datetime as dt
resultlist = []
for post in ask_posts:
created_at = post[6]
nun_comments = int(post[4])
resultlist.append([created_at, nun_comments])
counts_by_hour = {}
comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'
for row in resultlist:
date = row[0]
comment = row[1]
date_dt = dt.datetime.strptime(date, date_format)
hour_dt = date_dt.strftime("%H")
if hour_dt not in counts_by_hour:
counts_by_hour[hour_dt] = 1
comments_by_hour[hour_dt] = comment
else:
counts_by_hour[hour_dt] += 1
comments_by_hour[hour_dt] += comment
print(counts_by_hour)
print(comments_by_hour)
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58} {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
avg_by_hour = []
for hour in comments_by_hour:
comments = comments_by_hour[hour]
counts = counts_by_hour[hour]
avg_by_hour.append([hour, comments/counts])
print(avg_by_hour)
[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]
Sorting the list of lists (avg_by_hour) in descending order and getting the five first hours of the day with the highest number of comments per post
swap_avg_hour = []
for row in avg_by_hour:
swap_avg_hour.append([row[1], row[0]])
print(swap_avg_hour)
sorted_swap = sorted(swap_avg_hour, reverse = True)
[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]
print("Top 5 Hours for Ask Posts Comments")
Top 5 Hours for Ask Posts Comments
for row in sorted_swap[0:5]:
date = dt.datetime.strptime(row[1],'%H')
time = date.strftime("%H:%M")
print(f"Show post at {time}: {row[0]} average comments per post")
Show post at 15:00: 38.5948275862069 average comments per post Show post at 02:00: 23.810344827586206 average comments per post Show post at 20:00: 21.525 average comments per post Show post at 16:00: 16.796296296296298 average comments per post Show post at 21:00: 16.009174311926607 average comments per post
To recieve more comments, post should be made by 15.00 because it had most average comments .
There is approximately 63% increase in the number of comments between the hour with the highest and second highest average number of comments
User posts more of Ask Posts than Show Posts on Hacker News
To get more comments users shouls endeavour to post either by 02:00 or by 15:00 because it has been shown that these recieves more comments than the rest