Notebook

Exploring Hacker News Posts¶

In this project, we'll work with a data set of submissions to popular technology site Hackers news

Hacker News is a site started by the startup incubator Y Combinators Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

Our analysis will be focused on two major type of posts, The Ask HN and Show HN posts.

Users submit Ask HN posts to ask the Hacker News community a specific question.

Users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

We'll compare these two types of posts to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

Let's start by importing the libraries we need and reading the data set into a list of lists.

Also displaying the first five rows

In [1]:

from csv import reader

### The Google Play data set ###
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

##displaying the first five rows
hn[:5]

Out[1]:

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

About the columns We've about seven columns in this data set which is been listed below.

id: The unique identifier from Hacker News for the post
Title: The title of the post
Url: The URL that the posts links to, if the post has a URL
num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments: The number of comments that were made on the post
authors: The username of the person who submitted the post
created_at: The date and time at which the post was submitted

Extracting the first row of data, and assigning it to the variable headers

In [2]:

headers = hn[0]
headers

Out[2]:

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

Displaying the first five rows of hn to verify that i removed the header row properly.

In [3]:

hn = hn[1:]
hn[:5] 

Out[3]:

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

Separating posts beginning with Ask HN and Show HN (and case variations) into two different lists next.

looping through the list of list hn and check if Ask HN or Show HN is included in the title. If included, create two distict lists and add each post type respectively in their list

In [4]:

ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [5]:

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194

From the result above, the number of Ask HN posts are more than that of Show HN.

Next, let's determine if ask posts or show posts receive more comments on average.

Total number of comments in asks posts

In [6]:

total_ask_comment = 0
for row in ask_posts:
    asks_comment = int(row[4])
    total_ask_comment += asks_comment
avg_ask_comments = total_ask_comment/len(ask_posts)

Total number of comments in Show posts

In [7]:

total_show_comment = 0
for row in show_posts:
    show_comment = int(row[4])
    total_show_comment += show_comment
avg_show_comments = total_show_comment/len(show_posts)

In [8]:

print(f"Average post for asks comments are {avg_ask_comments}")
print(f"Average post for show comments  are {avg_show_comments}")

Average post for asks comments are 14.038417431192661
Average post for show comments  are 10.31669535283993

Asks Post recieved more comments on the average than show comment

Calculating the amount of ask posts created per hour, along with the total amount of comments.¶

Next, To determine if ask posts created at a certain time are more likely to attract comments the following steps are taken to perform this analysis:

Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts receive by hour created.

In [9]:

import datetime as dt
resultlist = []
for post in ask_posts:
    created_at = post[6]
    nun_comments = int(post[4])
    resultlist.append([created_at, nun_comments])

    
counts_by_hour = {}
comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'
for row in resultlist:
    date = row[0]
    comment = row[1]
    date_dt = dt.datetime.strptime(date, date_format)
    hour_dt = date_dt.strftime("%H")
    if hour_dt not in counts_by_hour:
        counts_by_hour[hour_dt] = 1
        comments_by_hour[hour_dt] = comment
    else:
        counts_by_hour[hour_dt] += 1
        comments_by_hour[hour_dt] += comment

print(counts_by_hour) 
print(comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}

In [10]:

avg_by_hour = []
for hour in comments_by_hour:
    comments = comments_by_hour[hour]
    counts = counts_by_hour[hour]
    avg_by_hour.append([hour, comments/counts])
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]

Sorting the list of lists (avg_by_hour) in descending order and getting the five first hours of the day with the highest number of comments per post

In [11]:

swap_avg_hour = []
for row in avg_by_hour:
    swap_avg_hour.append([row[1], row[0]])
print(swap_avg_hour)    
sorted_swap = sorted(swap_avg_hour, reverse = True)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]

In [12]:

print("Top 5 Hours for Ask Posts Comments")

Top 5 Hours for Ask Posts Comments

In [13]:

for row in sorted_swap[0:5]:
    date = dt.datetime.strptime(row[1],'%H')
    time = date.strftime("%H:%M")
    print(f"Show post at {time}: {row[0]} average comments per post")

Show post at 15:00: 38.5948275862069 average comments per post
Show post at 02:00: 23.810344827586206 average comments per post
Show post at 20:00: 21.525 average comments per post
Show post at 16:00: 16.796296296296298 average comments per post
Show post at 21:00: 16.009174311926607 average comments per post

Analysis¶

To recieve more comments, post should be made by 15.00 because it had most average comments .

There is approximately 63% increase in the number of comments between the hour with the highest and second highest average number of comments

Conclusion¶

User posts more of Ask Posts than Show Posts on Hacker News

To get more comments users shouls endeavour to post either by 02:00 or by 15:00 because it has been shown that these recieves more comments than the rest