Hacker News: Guided Project.

This project contains an analysis of the Hacker News company. This is a company which is very popular in the Tech and Startup fields, where users submit their stories(posts), and the are voted and commented upon based on what other users feel about another users post similar to reddit. \ Posts that make it to the top of Hacker New's listings can get hundreds of thousands of visitors as a result.\ \ This data set has been filtered and reduced from almost 300,000 rows to about 20,000 rows. Empty rows such as submissions that do not receive any comments have been removed, while the rest of the data cummulated as a result of random sampling from the remaining submissions. The data set consists only 7 columns, below are the columns contained and its descriptions.

  • id: The unique identifier from Hacker News for the post
  • title: The title of the post
  • url: The URL that points to the post, if the post has a URL
  • num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
  • num_comments: The number of comments that were made on a post
  • author: The username of the person who submitted a post.
  • created_at: The date and time at which the post was submitted. \ \ NB: Here is a link to the dataset Hacker News The our main intrests in this project are the post titles that begin with Ask hn and Show hn. Users Submit Ask hn posts to ask the community a specific question and Show hn to show the community a project, product, or generally something intresting.

    Our goal is to:

    1. Compare the post titles that begin with Ask hn and Show hn and determine which receives more comments from users on an avergae.
    2. Determine if there's a time factor(when posts are created) attached to amount of comments made on a particular post.

Opening and Exploring the dataset

In [1]:
# Opening the dataset and storing it as a list.
from csv import reader
opened_file = open('hacker_news.csv')
reader_file = reader(opened_file)
hn = list(reader_file)
headers = hn[0]
hn = hn[1:]
In [2]:
# Outputing the first five elements of the dataset
print(hn[1:6])
[['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'], ['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http://arstechnica.com/business/2015/10/comcast-and-other-isps-boost-network-investment-despite-net-neutrality/', '53', '22', 'Deinos', '10/31/2015 9:48']]
In [3]:
# displaying the columns in our dataset
print(headers)
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

Filtering Ask HN and Show HN posts from other posts.

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
In [5]:
# a function to count the number of posts in a list
def num_of_posts(post_list):
    total = 0
    for post in post_list:
        total += 1
    return total


print('This is the number of posts that begin with Ask HN post: {}'.format(num_of_posts(ask_posts)))
print('This is the number of posts that begin with Show HN post: {}'.format(num_of_posts(show_posts)))
print('These are the rest of the posts: {}'.format(num_of_posts(other_posts)))
This is the number of posts that begin with Ask HN post: 1744
This is the number of posts that begin with Show HN post: 1162
These are the rest of the posts: 17194

From the analysis, we can see that users post more Ask HN posts than Show HN posts. So it's very likely that there would be more comments under an Ask HN post than a Show HN post, because people who feel obligated to answering questions the know, would answer a given question based on their own knowledge Which could stirr up discussions amongsts other users. But we'd get to see in the later analysis.

In [6]:
# printing out the first 5 elements of each our seperated post lists
print('================= Ask Posts ==================')
print(ask_posts[1:6])
print('================= Show Posts =================')
print(show_posts[1:6])
print('================= Other Posts ================')
print(other_posts[1:6])
================= Ask Posts ==================
[['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38'], ['10284812', 'Ask HN: Limiting CPU, memory, and I/O usage on a program for testing', '', '2', '1', 'zatkin', '9/26/2015 23:23']]
================= Show Posts =================
[['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45'], ['11237259', 'Show HN: Run with Mark (Runkeeper only)', 'http://runwithmark.github.io/#/', '3', '3', 'ecesena', '3/7/2016 5:17']]
================= Other Posts ================
[['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'], ['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http://arstechnica.com/business/2015/10/comcast-and-other-isps-boost-network-investment-despite-net-neutrality/', '53', '22', 'Deinos', '10/31/2015 9:48']]

Calculating average number of comments for Ask HN posts and Show HN posts

In [7]:
# a function to calculate the average number of comments for any post_title category
def avg_no_comments(post_list, index=4):
    total_comments = 0
    for row in post_list:
        comments = int(row[index])
        total_comments += comments

        
    avg_comments_in_post = total_comments/(len(post_list))
    return avg_comments_in_post, total_comments
In [8]:
# displaying the average no of comments and the total number of comments for 'Ask HN' posts
avg_no_comments(ask_posts)
Out[8]:
(14.038417431192661, 24483)
In [9]:
# displaying the average no of comments and the total number of commments for Show HN posts.
avg_no_comments(show_posts)
Out[9]:
(10.31669535283993, 11988)

From the values displayed above we see that the values gotten, further confirms our proposed hypothesis. It is quite fair that the average number of comments that begin Ask HN posts are more because, The total number of Ask HN posts is almost twice the amount of Show HN post. Totaling 24483 for Ask HN posts and 12000 for Show HN posts. Therefore, it's almost logical to say the more questions asked, the more the comments, because discussions could stir up from users based on the relpy to a given post. The Average no. of comments that begin with Ask HN is approximately 14.04 and Show HN comments with approximately 10.33.\ \ However, this also shows that even with a higher average of comments for posts that begin with Ask HN, posts that begin with Show HN may be more interactive, which can be shown by the difference between the Average of both post categories, irrespective of the fact that the posts beginning with Ask HN posts are almost twice Show HN posts.

But we'd still keep our focus on posts that begin with Ask HN

Finding the Amount of Ask Posts and Comments by Hour Created

We try to determine if ask posts created at a certain time (using created_at column) are likely to attract more comments by:

  • Calculate the amount of ask posts created in each hour of the day, along with the number of comments received
  • Calcualte the average no. of comments ask posts receive by hour created.
In [10]:
# Calculate the amount of ask posts created during each hour of day and the number of comments received.

import datetime as dt

result_list = []

for post in ask_posts:
    result_list.append(
        [post[6], int(post[4])]
    )

comments_by_hour = {}
counts_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for each_row in result_list:
    date = each_row[0]
    comment = each_row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1

comments_by_hour
Out[10]:
{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

The output above shows us that Ask posts created at the hours 15:00 showed more potential of getting feedback from users on their posts. In general posts created in between noon hours say 13:00 to late hours of the day say 21:00 had more feedbacks on a posts.

This could be due to the fact that during the early hours of the morning most users are highly engaged with work, school e.t.c Therefore as noon approaches the tension eases off, providing time for lesiure activities.

NB: There's a little oddity which shows a large number of posts made during the early hours of the day, precisely 02:00. We would see the reason for that below.

Calculating the average number of comments for posts created during each hour of the day

In [11]:
# calculating the average number of comments per post for post each hour of the day
avg_by_hour = []
for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])

avg_by_hour
Out[11]:
[['21', 16.009174311926607],
 ['14', 13.233644859813085],
 ['11', 11.051724137931034],
 ['23', 7.985294117647059],
 ['22', 6.746478873239437],
 ['15', 38.5948275862069],
 ['12', 9.41095890410959],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['17', 11.46],
 ['16', 16.796296296296298],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['18', 13.20183486238532],
 ['08', 10.25],
 ['09', 5.5777777777777775],
 ['10', 13.440677966101696],
 ['02', 23.810344827586206],
 ['05', 10.08695652173913],
 ['04', 7.170212765957447],
 ['03', 7.796296296296297],
 ['07', 7.852941176470588],
 ['13', 14.741176470588234],
 ['20', 21.525]]

The result above further affirms the results gotten before. E.g There are approximately 39 Ask posts made at 15:00 every day which reflects on the feedback/comments from users which is about 4477 comments. Also we also notice that, there are lots of posts made between noon hours to before midnight, which accounts for the large amount of comments seen during the day.

This also accounts for the oddity we identified in the earlier cells above.

In [12]:
# Sorting our avg_by_hour dictionary

swap_avg_by_hour = []
for row in avg_by_hour:
    a = row[1]
    b = row[0]
    swap_avg_by_hour.append([a,b])

print('============== Unsorted avg values ===========')    
print(swap_avg_by_hour[:5])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print('============= Sorted avg values ==============')
print('Top 5 Hours for Ask Posts Comments')
# sorted_swap[:6]

for avg, hr in sorted_swap[:5]:
    print('{}: {:.2f} average comments per post'.format(
        dt.datetime.strptime(hr, '%H').strftime('%H:%M'), avg))
============== Unsorted avg values ===========
[[16.009174311926607, '21'], [13.233644859813085, '14'], [11.051724137931034, '11'], [7.985294117647059, '23'], [6.746478873239437, '22']]
============= Sorted avg values ==============
Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post

The result of the values shown above shows to gain maximum made on a post, a post should be made most preferably at 15:00. However other suitable hours could be 2:00, 20:00, 16:00 and 21:00.

Calulating the amounts of points either Ask post or Show post gets.

In [13]:
# Finding the average number of counts for Ask posts

total_ask_count = 0
for row in ask_posts:
    count = row[3]
    if count != '':
        count = int(row[3])
        total_ask_count += count
    
    
avg_ask_counts = total_ask_count/len(ask_posts)
print('The average number of counts for Ask Posts is {:.2f}'.format(avg_ask_counts))
The average number of counts for Ask Posts is 15.06
In [14]:
# Finding the average number of counts for Show Posts

total_show_count = 0
for row in show_posts:
    show_count = row[3]
    if show_count != '':
        show_count = int(row[3])
        total_show_count += show_count
    
    
avg_show_counts = total_show_count/len(show_posts)
print('The average number of counts for Show Posts is {:.2f}'.format(avg_show_counts))
The average number of counts for Show Posts is 27.56

The values shown above implies that there are more counts(rating) for Show posts than Ask posts. This simply means that the community values contributions of projects, products or generally something intresting to the platform than only seeking information through Ask posts.

Are posts created at a certain time more upvoted than others?

It could be that an article or post uploaded at a certain time might get more attention or upvotes than posts uploaded at a different time. For example, we already saw that posts uploaded at 3pm get more feedback, i.e. comments back than posts uploaded at any other time of day.

In this section, we will see whether time has any effect on the upvotes a certain post may get.

We will do this in the following order:

  • Ask Posts
  • Show Posts
  • Other Posts
In [16]:
ask_list_counts_vs_time = []
# total=0

# Checking for upvoting vs. time for Ask Posts
for posts in ask_posts:
    created = posts[6]
    counts = int(posts[3])
    ask_list_counts_vs_time.append([created, counts])
    
ask_counts_by_hour = {}
ask_comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'

for row in ask_list_counts_vs_time:
    date = row[0]
    count = row[1]
    created_obj = dt.datetime.strptime(date, date_format).strftime('%H')
    if created_obj not in ask_counts_by_hour:
        ask_counts_by_hour[created_obj] = 1
        ask_comments_by_hour[created_obj] = count
    else:
        ask_counts_by_hour[created_obj] += 1
        ask_comments_by_hour[created_obj] += count
        
print('the number of counts on ask posts by the hour are:')
ask_counts_by_hour  
the nr of counts on ask posts by the hour are:
Out[16]:
{'00': 55,
 '01': 60,
 '02': 58,
 '03': 54,
 '04': 47,
 '05': 46,
 '06': 44,
 '07': 34,
 '08': 48,
 '09': 45,
 '10': 59,
 '11': 58,
 '12': 73,
 '13': 85,
 '14': 107,
 '15': 116,
 '16': 108,
 '17': 100,
 '18': 109,
 '19': 110,
 '20': 80,
 '21': 109,
 '22': 71,
 '23': 68}
In [18]:
show_list_counts_vs_time = []

# Checking for upvoting vs. time for Show Posts
for post in show_posts:
    created = post[6]
    counts = int(post[3])
    show_list_counts_vs_time.append([created, counts])
    
show_counts_by_hour = {}
show_comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'

for row in show_list_counts_vs_time:
    date = row[0]
    count = row[1]
    show_created_obj = dt.datetime.strptime(date, date_format).strftime('%H')
    if show_created_obj not in show_counts_by_hour:
        show_counts_by_hour[show_created_obj] = 1
        show_comments_by_hour[show_created_obj] = count
    else:
        show_counts_by_hour[show_created_obj] += 1
        show_comments_by_hour[show_created_obj] += count
        
print('the number of counts on show posts by the hour are:')
show_counts_by_hour
the number of counts on show posts by the hour are:
Out[18]:
{'00': 31,
 '01': 28,
 '02': 30,
 '03': 27,
 '04': 26,
 '05': 19,
 '06': 16,
 '07': 26,
 '08': 34,
 '09': 30,
 '10': 36,
 '11': 44,
 '12': 61,
 '13': 99,
 '14': 86,
 '15': 78,
 '16': 93,
 '17': 93,
 '18': 61,
 '19': 55,
 '20': 60,
 '21': 47,
 '22': 46,
 '23': 36}

Conclusion

From the results above we can conclude the following points:-

  • Show posts uploaded between 13:00 and 17:00 are the most likely to get the highest number of upvotes.
  • Ask posts also get a high number of upvotes, especially those uploaded between 13:00 and 21:00.
In [ ]: