Analysing Posts on Hacker News

Index

1

Introduction

image.png Source:Zox News

Y Combinator a startup incubator that hosts a site called Hacker News. The site contains posts (or stories as they are called in Hacker News) submitted by users which can be voted or commented on. The site is very popular among start-up circles. Therefore posts that get very high up-votes can tend to attract a large following.

Among the various post types, HN (Hacker News) has special sections for two particular post types namely Ask HN and Show HN posts.

If a post title is prefixed with Ask HN, the post is meant to pose a question to the HN community. The question could be a query or a doubt. If a post title is prefixed with Show HN then the post is meant for exhibiting to the HN community a project the user may have done. It may also be to get an opinion from the community regarding the said project or work.

The goal of this project is to compare Ask HN and Show HN posts. The focus will be on:

  • which post type recieved more comments on average
  • whether the time the post was created on has influence on the average number of comments.

2

Reading the Data

The full dataset for the project can be found here. The explanation of every column has been provided in the link.

It must be noted that while the dataset contains nearly 300,000 records, the dataset that will be used for the project only contains a sample of about 20,000 records.

The data set being used here is one created by DQ(Dataquest) for the purpose of learning. The data set was created by removing all posts for which there were no comments and then randomly sampling 20,000 records from the remaining dataset.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
In [2]:
#Read the dataset
hn = pd.read_csv("hacker_news.csv")
hn[:10]
Out[2]:
id title url num_points num_comments author created_at
0 12224879 Interactive Dynamic Video http://www.interactivedynamicvideo.com/ 386 52 ne0phyte 8/4/2016 11:52
1 10975351 How to Use Open Source and Shut the Fuck Up at... http://hueniverse.com/2016/01/26/how-to-use-op... 39 10 josep2 1/26/2016 19:30
2 11964716 Florida DJs May Face Felony for April Fools' W... http://www.thewire.com/entertainment/2013/04/f... 2 1 vezycash 6/23/2016 22:20
3 11919867 Technology ventures: From Idea to Enterprise https://www.amazon.com/Technology-Ventures-Ent... 3 1 hswarna 6/17/2016 0:01
4 10301696 Note by Note: The Making of Steinway L1037 (2007) http://www.nytimes.com/2007/11/07/movies/07ste... 8 2 walterbell 9/30/2015 4:12
5 10482257 Title II kills investment? Comcast and other I... http://arstechnica.com/business/2015/10/comcas... 53 22 Deinos 10/31/2015 9:48
6 10557283 Nuts and Bolts Business Advice NaN 3 4 shomberj 11/13/2015 0:45
7 12296411 Ask HN: How to improve my personal website? NaN 2 6 ahmedbaracat 8/16/2016 9:55
8 11337617 Shims, Jigs and Other Woodworking Concepts to ... http://firstround.com/review/shims-jigs-and-ot... 34 7 zt 3/22/2016 16:18
9 10379326 That self-appendectomy http://www.southpolestation.com/trivia/igy1/ap... 91 10 jimsojim 10/13/2015 9:30

Since the id column is unique and the URL column has no relevance to our analysis they can be removed.

In [3]:
#Remove the ID and URL columns
hn = hn.drop(["id",'url'],axis = 1)
In [4]:
hn[:15]
Out[4]:
title num_points num_comments author created_at
0 Interactive Dynamic Video 386 52 ne0phyte 8/4/2016 11:52
1 How to Use Open Source and Shut the Fuck Up at... 39 10 josep2 1/26/2016 19:30
2 Florida DJs May Face Felony for April Fools' W... 2 1 vezycash 6/23/2016 22:20
3 Technology ventures: From Idea to Enterprise 3 1 hswarna 6/17/2016 0:01
4 Note by Note: The Making of Steinway L1037 (2007) 8 2 walterbell 9/30/2015 4:12
5 Title II kills investment? Comcast and other I... 53 22 Deinos 10/31/2015 9:48
6 Nuts and Bolts Business Advice 3 4 shomberj 11/13/2015 0:45
7 Ask HN: How to improve my personal website? 2 6 ahmedbaracat 8/16/2016 9:55
8 Shims, Jigs and Other Woodworking Concepts to ... 34 7 zt 3/22/2016 16:18
9 That self-appendectomy 91 10 jimsojim 10/13/2015 9:30
10 Crate raises $4M seed round for its next-gen S... 3 1 hitekker 3/27/2016 18:08
11 Advertising Cannot Maintain the Internet. Here... 2 1 dredmorbius 5/10/2016 4:46
12 Coding Is Over 18 14 prostoalex 6/26/2016 16:36
13 Show HN: Wio Link ESP8266 Based Web of Things... 26 22 kfihihc 11/25/2015 14:03
14 Custom Deleters for C++ Smart Pointers 59 18 ingve 4/28/2016 10:01

3

Identifying Posts by their Type

Based on the discussion above, we will need to classify the posts based on their type. Once we have segragated the posts, we can calculate the average number of comments and total number of comments for each post type.

In [5]:
#Set the title column to lower case
hn["title"] = hn["title"].str.lower()
In [6]:
#Identify the type of each post
hn.loc[hn["title"].str.startswith("ask hn"), "type"]="ask"
hn.loc[hn["title"].str.startswith("show hn"), "type"]="show"
hn.loc[hn["type"].isnull(),"type"]="others"
In [7]:
hn[:10]
Out[7]:
title num_points num_comments author created_at type
0 interactive dynamic video 386 52 ne0phyte 8/4/2016 11:52 others
1 how to use open source and shut the fuck up at... 39 10 josep2 1/26/2016 19:30 others
2 florida djs may face felony for april fools' w... 2 1 vezycash 6/23/2016 22:20 others
3 technology ventures: from idea to enterprise 3 1 hswarna 6/17/2016 0:01 others
4 note by note: the making of steinway l1037 (2007) 8 2 walterbell 9/30/2015 4:12 others
5 title ii kills investment? comcast and other i... 53 22 Deinos 10/31/2015 9:48 others
6 nuts and bolts business advice 3 4 shomberj 11/13/2015 0:45 others
7 ask hn: how to improve my personal website? 2 6 ahmedbaracat 8/16/2016 9:55 ask
8 shims, jigs and other woodworking concepts to ... 34 7 zt 3/22/2016 16:18 others
9 that self-appendectomy 91 10 jimsojim 10/13/2015 9:30 others

Now that we have segragated the posts by type, we can find the average number of comments and average points by post type.

In [8]:
#Calculate the number of comments and points by post type
new_hn_avg = hn.groupby(by="type").mean().round(2).reset_index()
new_hn_avg.rename(columns = {"num_points":"avg_points", "num_comments":"avg_num_comments"},inplace=True)
new_hn_avg
Out[8]:
type avg_points avg_num_comments
0 ask 15.06 14.04
1 others 55.41 26.87
2 show 27.56 10.32

We could find the total number of comments and the total number of points by post type.

In [9]:
#Calculate the total number of comments and points by post type
new_hn_total = hn.groupby(by="type").sum().reset_index()
new_hn_total.rename(columns = {"num_points":"total_points","num_comments":"total_num_comments"},inplace=True)
new_hn_total
Out[9]:
type total_points total_num_comments
0 ask 26268 24483
1 others 952664 462055
2 show 32019 11988
In [10]:
#Combine the aggregated data
joined_data = pd.merge(new_hn_avg,new_hn_total,on = "type")

With the new set of data, we could create a single table to view the data and also analyse the same with graphs.

In [11]:
joined_data["avg_points"] = joined_data["avg_points"].round(2)
joined_data["avg_num_comments"] = joined_data["avg_num_comments"].round(2)
joined_data
Out[11]:
type avg_points avg_num_comments total_points total_num_comments
0 ask 15.06 14.04 26268 24483
1 others 55.41 26.87 952664 462055
2 show 27.56 10.32 32019 11988
In [12]:
def gen_barplot(x_axis,y_axis,title):
    """
    Generate a barplot with a fixed style 
    
    Args:
    x_axis (string): Column name of joined_data for which values are to be plotted 
    y_axis (string): Column name of joined_data for which values are to be plotted
    title (string): Name of the plot
    """
    plt = sns.barplot(data = joined_data, x = x_axis, y = y_axis, order = ['ask','show','others'])
    sns.despine(left = True, top=False, bottom=True)
    plt.set_title(title, size = 17)
    plt.set_xlabel(None)
    plt.set_ylabel(None)
    plt.tick_params(left=False, labelsize=13)
    plt.xaxis.tick_top()
In [13]:
gen_barplot(x_axis = 'avg_points', y_axis = 'type', title = 'Number of Points per Post Type')
In [14]:
gen_barplot(x_axis = 'avg_num_comments', y_axis = 'type', title = 'Number of Comments per Post Type')

It is clear from the analysis above that posts associated to show have clearly more points and those associated to ask have more comments. Those asking a question will have a lot of people attempting to answer that question. Questions may also lead to discussions and therefore an increase in the number of comments. Some questions may be common or helpful to the general public thus calling for better upvotes.

Posts associated to show , on the other hand do not warrant a deep discussion. Instead they seek feedback. This feedback is more easily given in points possibly with some additional comments.

4

Analysing the Impact of Time of Day

Each post has a record of the date and time at which it was created. Analyzing this time could reveal when a question must be asked to recieve the most number of answers. More specifically, we could find out the hour of the day during which if a question is asked, it would receive the most answers.

To get the information we require, we could begin by extracting the hour from the time provided and then summarizing the number of comments for each ask post by the hour.

Before beginning the analysis we need to extract the hour from the date-time details provided.

In [15]:
#Extract the hour from the date-time detail.
import datetime as dt
hn["hour"] = pd.to_datetime(hn["created_at"],format = "%m/%d/%Y %H:%M")
hn["hour"] = hn["hour"].dt.hour
hn[:10]
Out[15]:
title num_points num_comments author created_at type hour
0 interactive dynamic video 386 52 ne0phyte 8/4/2016 11:52 others 11
1 how to use open source and shut the fuck up at... 39 10 josep2 1/26/2016 19:30 others 19
2 florida djs may face felony for april fools' w... 2 1 vezycash 6/23/2016 22:20 others 22
3 technology ventures: from idea to enterprise 3 1 hswarna 6/17/2016 0:01 others 0
4 note by note: the making of steinway l1037 (2007) 8 2 walterbell 9/30/2015 4:12 others 4
5 title ii kills investment? comcast and other i... 53 22 Deinos 10/31/2015 9:48 others 9
6 nuts and bolts business advice 3 4 shomberj 11/13/2015 0:45 others 0
7 ask hn: how to improve my personal website? 2 6 ahmedbaracat 8/16/2016 9:55 ask 9
8 shims, jigs and other woodworking concepts to ... 34 7 zt 3/22/2016 16:18 others 16
9 that self-appendectomy 91 10 jimsojim 10/13/2015 9:30 others 9
In [16]:
#Consolidate the number of Ask posts by hour
ask_posts = hn[hn["type"] == 'ask'][["num_comments","hour"]]
ask_posts = ask_posts.groupby("hour").mean().round().reset_index()
print("\033[1m"+"Number of Comments for Ask Posts by Hour"+"\033[0m")
ask_posts
Number of Comments for Ask Posts by Hour
Out[16]:
hour num_comments
0 0 8.0
1 1 11.0
2 2 24.0
3 3 8.0
4 4 7.0
5 5 10.0
6 6 9.0
7 7 8.0
8 8 10.0
9 9 6.0
10 10 13.0
11 11 11.0
12 12 9.0
13 13 15.0
14 14 13.0
15 15 39.0
16 16 17.0
17 17 11.0
18 18 13.0
19 19 11.0
20 20 22.0
21 21 16.0
22 22 7.0
23 23 8.0
In [17]:
fig = plt.figure(figsize=(7,7))

sns.set(style = "white")
plt = sns.lineplot(data = ask_posts, x = 'hour', y = 'num_comments', marker = 'o', color='green')

sns.despine(left = True,bottom = True, top=True)
plt.set_title("Hourly Count of Comments for Ask Posts", size=18)
plt.set_xlabel(None)
plt.set_ylabel(None)
plt.xaxis.tick_top()
plt.set_xticks([0,2,4,6,8,10,12,14,16,18,20,22])
plt.tick_params(axis='x',top=False,labelsize=16)
plt.tick_params(axis='y',top=False,labelsize=13)

It is clear from the above analysis that questions asked at the 15th hour of the day which is between 3:00-4:00 PM invites the most number comments.

In the same vein it would be interesting to know the impact of the time of day on the points alloted. We saw earlier that show posts tend to have more average points than ask posts. However total points between show and ask posts are not too far off. For this reason we shall compare the average number of points per hour for all posts.

In [18]:
#Consolidate post points by hour
most_points = hn[["num_points","hour"]]
most_points = most_points.groupby("hour").mean().round().reset_index()
print("\033[1m"+"Number of Points for all Posts by Hour"+"\033[0m")
most_points
Number of Points for all Posts by Hour
Out[18]:
hour num_points
0 0 54.0
1 1 45.0
2 2 51.0
3 3 50.0
4 4 44.0
5 5 44.0
6 6 42.0
7 7 52.0
8 8 48.0
9 9 49.0
10 10 55.0
11 11 53.0
12 12 53.0
13 13 56.0
14 14 54.0
15 15 56.0
16 16 50.0
17 17 53.0
18 18 50.0
19 19 54.0
20 20 42.0
21 21 44.0
22 22 46.0
23 23 48.0
In [19]:
from matplotlib import pyplot as plt

# Set size of layout
fig = plt.figure(figsize=(7,7))
ax1 = fig.add_subplot(111)
# Set the line plot
sns.lineplot(data = most_points, x = 'hour', y = 'num_points', marker = 'o')

#Clean up the plot
sns.despine(left = True, bottom = True)
ax1.set_title("Hourly Count of Points for All Posts", size=18)
ax1.set_xlabel(None)
ax1.set_ylabel(None)
ax1.xaxis.tick_top()
ax1.tick_params(axis='x',top=False,labelsize=16)
ax1.tick_params(axis='y',top=False,labelsize=13)
ax1.set_xticks([0,2,4,6,8,10,12,14,16,18,20,22])
ax1.set_yticks([40,45,50,55])
plt.show()

As can be seen above, the points assigned to posts at any given point through out the day range between 43 and 53. We cannot definitively say that the time of day has a significant impact on the points alloted as the points do not vary significantly through the course of the day and because the allotment of points is subjective to the kind of posts that have been put up.

5

Conclusion

The goal of this project was to analyze a sample of 20,000 posts from the Hacker News site run by YCombinator and deduce the impact of the type of posts on the number of comments from users and they number of points allotted by users for a post. In addition we also try to analyse the impact of the time of the day on the number of comments put up by users and the number of points alloted by readers.

Ask related posts tend to get more comments while Show related posts tend to get more points in comparison. Questions posed betweem 3 and 4 PM tend to get more replies than at any other time of the day. However the allotment of points through the course of the day is between 43-53.