In [1]:
"""
Analysing Hacker News Posts
In this project, the two different post types from Hacker News, are analysed.
These are Ask HN and Show HN.

Users Ask HN when they want to post questions they need answers to while the
Submit HN is to reply, answer questions posted or just post an interesting update.

The aim of this project is to determine:
1. Find out which posts receive more comments on average
2. Find out if posts created at a certain time recieve more comments on average.

Note that the data explored is reduced from 300,000 rows to about 20,000 by removing 
all submissions that did not receive any comments.
"""
Out[1]:
'\nAnalysing Hacker News Posts\nIn this project, the two different post types from Hacker News, are analysed.\nThese are Ask HN and Show HN.\n\nUsers Ask HN when they want to post questions they need answers to while the\nSubmit HN is to reply, answer questions posted or just post an interesting update.\n\nThe aim of this project is to determine:\n1. Find out which posts receive more comments on average\n2. Find out if posts created at a certain time recieve more comments on average.\n\nNote that the data explored is reduced from 300,000 rows to about 20,000 by removing \nall submissions that did not receive any comments.\n'
In [2]:
'''
First is to read in the data and remove the headers
'''

#Reading the file and restricting the number of rolls
import csv

f = open('hacker_news.csv')
hn = list(csv.reader(f))
hn[:5]
Out[2]:
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]
In [3]:
'''
Removing headers from a list of lists
'''

#Removing the headers
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
In [4]:
'''
As seen above, the data sets contains the title of the posts, the number 
of comments for each post and the date the post was created. Next, the number
of comments for each type of post will be explored.
'''
Out[4]:
'\nAs seen above, the data sets contains the title of the posts, the number \nof comments for each post and the date the post was created. Next, the number\nof comments for each type of post will be explored.\n'
In [5]:
'''
Posts that begin with either 'Ask HN' or 'Show HN' will 
be identified and the data for those two types of posts 
separated into different lists.
'''

#Extracting the posts

#First identifying posts that begin with either Ask HN or Show HN

ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(post)
    elif title.lower().startswith("show hn"):
        show_posts.append(post)
    else:
        other_posts.append(post)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
1744
1162
17194
In [6]:
'''
Calculating the average number of comments for Ask HN and Show HN posts

Now that the ask posts and show posts have been separated into different
lists, the average number of comments each type of post receives will be calculated.
'''

#Calculate the average number of comments 'Ask HN' posts receive

total_ask_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)
14.038417431192661
In [7]:
#Calculate the average number of comments 'Show HN' posts receive

total_show_comments = 0

for post in show_posts:
    total_show_comments += int(post[4])
    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)
10.31669535283993
In [8]:
'''
From the analysis done so far, it is indicative that on average, ask posts receive approximately 14 comments, while 
show posts receive approximately 10 comments on average. The focus of this project will now shift solely to the ask 
posts since they likely receive more comments.
'''
Out[8]:
'\nFrom the analysis done so far, it is indicative that on average, ask posts receive approximately 14 comments, while \nshow posts receive approximately 10 comments on average. The focus of this project will now shift solely to the ask \nposts since they likely receive more comments.\n'
In [10]:
'''
Finding the amount of Ask posts and comments created by hour.

To determine if the amount of comments an ask post receives can be maximized at a certain time its created.
First, we'll find the amount of ask posts created during the hour of day, along with the number of comments those posts received. 
Then, we'll calculate the average amount of comments ask posts created at each hour of the day receive. 
'''
Out[10]:
"\nFinding the amount of Ask posts and comments created by hour.\n\nTo determine if the amount of comments an ask post receives can be maximized at a certain time its created.\nFirst, we'll find the amount of ask posts created during the hour of day, along with the number of comments those posts received. \nThen, we'll calculate the average amount of comments ask posts created at each hour of the day receive. \n"
In [11]:
#Calculate the amount of ask posts created during eachhour of day and the number of comments received.

import datetime as dt

result_list = []

for post in ask_posts:
    result_list.append([post[6], int(post[4])])
    
comments_by_hour = {}
counts_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for each_row in result_list:
    date = each_row[0]
    comment = each_row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1
        
comments_by_hour
Out[11]:
{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}
In [12]:
'''
Calculating the average number of comments for Ask HN posts by the hour
'''

#Calculate the average amount of comments `Ask HN` posts created at each hour of the day receive.

avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr]/counts_by_hour[hr]])

avg_by_hour
Out[12]:
[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]
In [20]:
#Sorting and printing values from a list of lists by swapping values in 'avg_by_hour'

swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

sorted_swap
[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]
Out[20]:
[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]
In [21]:
# Sort the values and print the the 5 hours with the highest average comments.

print("Top 5 Hours for 'Ask HN' comments")

for avg, hr in sorted_swap[:5]:
    print(
          "{}: {:.2f} average comments per post".format(dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg)
    )
Top 5 Hours for 'Ask HN' comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
In [23]:
'''
The hour at which most comments are posted is 15:00 with an average of 38.59 comments per post.
This shows that 15:00 is the best time to post an Ask HN to get high number of feedbacks and comments.
It is also indicates this may be the busiest traffic hour on Hacker News.
There's about a 60% increase in the number of comments between the hours with the highest and second 
highest average number of comments.

The timezone according to the dataset is Eastern Time in the US.
'''
Out[23]:
"\nThe hour at which most comments are posted is 15:00 with an average of 38.59 comments per post.\nThis shows that 15:00 is the best time to post an Ask HN to get high number of feedbacks and comments.\nIt is also indicates this may be the busiest traffic hour on Hacker News.\nThere's about a 60% increase in the number of comments between the hours with the highest and second \nhighest average number of comments.\n\nThe timezone according to the dataset is Eastern Time in the US.\n"
In [24]:
'''
In this project, ask posts and show posts were analyzed to determine which type of post and time
receive the most comments on average. Based on the analysis, to maximize the amount of comments 
a post receives, I recommend the post be categorized as ak post and created between 15:00 and 16:00.

However, it should be noted that the dataset analyzed excluded posts without any comments. Given that,
it's more accurate to say that out of the posts that received comments, ask posts received more comments
on average and the ask posts created between 15:00 and 16:00 received the most comments on average.
'''
Out[24]:
"\nIn this project, ask posts and show posts were analyzed to determine which type of post and time\nreceive the most comments on average. Based on the analysis, to maximize the amount of comments \na post receives, I recommend the post be categorized as ak post and created between 15:00 and 16:00.\n\nHowever, it should be noted that the dataset analyzed excluded posts without any comments. Given that,\nit's more accurate to say that out of the posts that received comments, ask posts received more comments\non average and the ask posts created between 15:00 and 16:00 received the most comments on average.\n"
In [ ]: