Exploring Hacker News Posts

In this project I will be analyzing a data set from the popular website Hacker News. This data set contains 293,119 rows of user submitted posts that contain at least one comment.

There are two main goals in this project:

  • To determine which posts between the subjects Ask HN or Show HN receieve more comments
  • To determine what time of day has the biggest influence on number of comments

There are also two side goals:

  • To determine which posts between the subjects Ask HN or Show HN receieve more points
  • To determine what time of day has the biggest influence on number of points

As quoted from the FAQ, "Ask HN lists questions and other text submissions. Show HN is for sharing your personal work and has special rules."

The points system is a little more complicated, and there is even an Ask HN post on the subject. For simplicity sake, we'll assume the points is equal to upvotes - downvotes

In [1]:
##Let's begin by opening and reading the data set
from csv import reader

opened = open(r"C:\Users\Green Miracle\csv files\hacker_news.csv", encoding="utf8")
read = reader(opened)
hacker = list(read)


##The data of our set will be labeled hn
hn = hacker[1:]
##And the header will be labled hnh
hnh = hacker[0]

##Let's print the first five rows to ensure everything is correct
print(hnh, "\n")
for row in hn[:5]:
    print(row, "\n")
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'] 

['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'] 

['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'] 

['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'] 

['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14'] 

Now that we have our data, let's start cleaning it.

Data Cleaning: Ask HN and Show HN

We are only interested in the data that use the subjects Ask HN and Show HN, so we'll create and fill two new lists, respectively, with this pertinent data.

In [2]:
##We'll create two lists for the pertinent data, 
##and another for the rest of the data.
ask_posts = []
show_posts = []
other_posts = []

##Now we'll loop through our data set and fill each of our new 
##lists with it's respective data.
for row in hn:
    title = row[1].lower()
    
    if title.startswith("ask hn") == True:
        ask_posts.append(row)
        
    elif title.startswith("show hn") == True:
        show_posts.append(row)
        
    else:
        other_posts.append(row)
        
##Let's see how much data we are left with in each list
print("Ask HN Posts :", len(ask_posts), "\n"
      "Show HN Posts :", len(show_posts), "\n"
      "Other Posts :", len(other_posts))
Ask HN Posts : 9139 
Show HN Posts : 10158 
Other Posts : 273822

We are left with 19,297 posts to work with. Our data is all in a clean, readable format for our first goal of determining which subject receives more comments. So let's begin.

Analysis: Ask HN and Show HN Average Comments and Points

In [3]:
##Let's begin by creating our variables for our average comments
total_ask_comments = 0
total_show_comments = 0

avg_ask_comments = 0
avg_show_comments = 0

total_ask = 0
total_show = 0

##Now let's loop through our respective subjects to determine
##the number of comments in each subject, as well as the average
for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments
    total_ask += 1

avg_ask_comments = total_ask_comments // total_ask
    
for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments
    total_show += 1
    
avg_show_comments = total_show_comments // total_show
    
##Now that we have our relevant information, let's print it out and view it
print("-------------------" "\n"
      "||Comment Average||" "\n"
      "-------------------" "\n"
      "Total Ask HN Comments :", total_ask_comments, "\n"
      "Average Ask HN Comments :", avg_ask_comments, "\n", "\n"
     
      "Total Show HN Comments :", total_show_comments, "\n"
      "Average Show HN Comments :", avg_show_comments, "\n")


##We can also repeat the process to begin our average number of points here, as well
total_ask_points = 0
total_show_points = 0

avg_ask_points = 0
avg_show_points = 0

total_ask_pt = 0
total_show_pt = 0

for row in ask_posts:
    points = int(row[3])
    total_ask_points += points
    total_ask_pt += 1

avg_ask_points = total_ask_points // total_ask_pt
    
for row in show_posts:
    points = int(row[3])
    total_show_points += points
    total_show_pt += 1
    
avg_show_points = total_show_points // total_show_pt

##And print our results
print("-----------------" "\n"
      "||Point Average||" "\n"
      "-----------------" "\n"
      "Total Ask HN Points :", total_ask_points, "\n"
      "Average Ask HN Points :", avg_ask_points, "\n", "\n"
     
      "Total Show HN Points :", total_show_points, "\n"
      "Average Show HN Points :", avg_show_points)
-------------------
||Comment Average||
-------------------
Total Ask HN Comments : 94986 
Average Ask HN Comments : 10 
 
Total Show HN Comments : 49633 
Average Show HN Comments : 4 

-----------------
||Point Average||
-----------------
Total Ask HN Points : 103378 
Average Ask HN Points : 11 
 
Total Show HN Points : 150781 
Average Show HN Points : 14

It appears that Ask HN posts receieve more comments and discussion compared to Show HN posts. There's a large difference of 45,353 comments. It appears that Ask HN is a much more popular subject for comments compared to the Show HN subject.

However, Show HN receives more points compared to Ask HN. This average is only a difference of 3, though.

While Show HN contains more posts, Ask HN has an average of 6 more comments per post compared to Show HN. I wonder if there are some posts with a large number of comments that are affecting this average.

In [4]:
##Let's find all the posts that contain more than 500 comments on them
big_ask_comments = []
big_show_comments = []

for row in ask_posts:
    comments = int(row[4])
    if comments > 500:
        big_ask_comments.append(row)
        
for row in show_posts:
    comments = int(row[4])
    if comments > 500:
        big_show_comments.append(row)

print("Posts with over 500 Comments :" "\n" "\n"
      "Ask HN Posts :", len(big_ask_comments), "\n"
      "Show HN Posts :", len(big_show_comments))

##I'll find the new average for Ask HN posts as well,
##skipping Show HN posts, since the average doesn't change
under_500_comments = 0
under_500_total = 0
under_500_avg = 0

for row in ask_posts:
    comments = int(row[4])
    if comments < 500:
        under_500_comments += comments
        under_500_total += 1

under_500_avg = under_500_comments // under_500_total

print("\n" "\n" "Fixed Ask HN Average Comments :", under_500_avg)
Posts with over 500 Comments :

Ask HN Posts : 19 
Show HN Posts : 0


Fixed Ask HN Average Comments : 8

As I thought, Ask HN has some large outliers. It seems that Ask HN posts tend to have a greater chance on becoming viral compared to Show HN posts. Out of 9,139 total Ask HN posts, only 19 of them affect the average number by two.

Data Cleaning: Time of Day

Now we can begin our second goal of finding the most popular time of day for posts receiving comments and points. With our Ask HN subject receiving almost double the number of comments compared to Show HN, we'll base our analysis around this data. We'll also use Ask HN for our points comparison, since the averages were similar and it'll increase readability.

We'll first need to put all of the data into a readable and standard form, for this we'll use dictionaries. We only want the hour of day that posts and comments/points were made, so we can create two new lists with only this information.

In [5]:
##We'll import the datetime module, which contains many useful classes 
##and methods for managing dates and times. We'll give the module an 
##alias of dt, to help with code readiability
import datetime as dt

##We'll create a list of lists that contains the pertinent information
##from our posts. The comments and the date created
result_comments = []
result_points = []

##Then we'll create a loop to grab this information
for row in ask_posts:
    created_at = row[6]
    comments = int(row[4])
    points = int(row[3])
    
    result_comments.append([created_at, comments])
    result_points.append([created_at, points])
    
##Now we'll populate three dictionaries with this information.
##One dictionary will contain the number of posts made for
##a certain hour of the day, while the other two will contain the
##number of comments/points for a certain hour of the day.
counts_by_hour = {}
comments_by_hour = {}
points_by_hour = {}

##Loop for our comments/posts dictionaries
for row in result_comments:
    comments = row[1]
    dates = dt.datetime.strptime(row[0], '%m/%d/%Y %H:%M')
    hour = dates.strftime("%H")
    hour = hour + ":00"
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
        
    elif hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
    
##Loop for our points    
for row in result_points:
    points = row[1]
    dates = dt.datetime.strptime(row[0], '%m/%d/%Y %H:%M')
    hour = dates.strftime("%H")
    hour = hour + ":00"
    
    if hour not in points_by_hour:
        points_by_hour[hour] = points
        
    elif hour in points_by_hour:
        points_by_hour[hour] += points
        
##Let's view our dictionaries
print("------------------" "\n"
      "||Posts by Hour ||" "\n"
      "------------------" "\n", 
       counts_by_hour, "\n" "\n"
      "---------------------" "\n"
      "||Comments by Hour ||" "\n"
      "---------------------" "\n", 
       comments_by_hour, "\n" "\n"
      "---------------------" "\n"
      "||Points by Hour ||" "\n"
      "---------------------" "\n", 
       points_by_hour)
------------------
||Posts by Hour ||
------------------
 {'02:00': 269, '01:00': 282, '22:00': 383, '21:00': 518, '19:00': 552, '17:00': 587, '15:00': 646, '14:00': 513, '13:00': 444, '11:00': 312, '10:00': 282, '09:00': 222, '07:00': 226, '03:00': 271, '23:00': 343, '20:00': 510, '16:00': 579, '08:00': 257, '00:00': 301, '18:00': 614, '12:00': 342, '04:00': 243, '06:00': 234, '05:00': 209} 

---------------------
||Comments by Hour ||
---------------------
 {'02:00': 2996, '01:00': 2089, '22:00': 3372, '21:00': 4500, '19:00': 3954, '17:00': 5547, '15:00': 18525, '14:00': 4972, '13:00': 7245, '11:00': 2797, '10:00': 3013, '09:00': 1477, '07:00': 1585, '03:00': 2154, '23:00': 2297, '20:00': 4462, '16:00': 4466, '08:00': 2362, '00:00': 2277, '18:00': 4877, '12:00': 4234, '04:00': 2360, '06:00': 1587, '05:00': 1838} 

---------------------
||Points by Hour ||
---------------------
 {'02:00': 2944, '01:00': 2662, '22:00': 3601, '21:00': 5042, '19:00': 4782, '17:00': 7155, '15:00': 13978, '14:00': 5390, '13:00': 7962, '11:00': 2856, '10:00': 3789, '09:00': 1763, '07:00': 2040, '03:00': 2539, '23:00': 2616, '20:00': 4491, '16:00': 5970, '08:00': 2744, '00:00': 2835, '18:00': 6850, '12:00': 4643, '04:00': 2650, '06:00': 2030, '05:00': 2046}

Analysis: Average Posts per Hour of Day

Now that we have our dictionaries, we can use them to find the average number of posts and points received during each hour of the day.

In [6]:
##Let's create a list that'll contain our averages. The first column
##will hold the time of day, and the second will contain the average
avg_comments_hour = []
avg_points_hour = []

for key in counts_by_hour:
    average1 = comments_by_hour[key] // counts_by_hour[key]
    average2 = points_by_hour[key] // counts_by_hour[key]
    
    avg_comments_hour.append([key, average1])
    avg_points_hour.append([key, average2])
    
##And let's print our results
print("---------------------------------" "\n"
      "||Average Posts to Time of Day ||" "\n" 
      "---------------------------------" "\n", 
      avg_comments_hour, "\n" "\n"
      "----------------------------------" "\n"
      "||Average Points to Time of Day ||" "\n" 
      "----------------------------------" "\n",
      avg_points_hour)
---------------------------------
||Average Posts to Time of Day ||
---------------------------------
 [['02:00', 11], ['01:00', 7], ['22:00', 8], ['21:00', 8], ['19:00', 7], ['17:00', 9], ['15:00', 28], ['14:00', 9], ['13:00', 16], ['11:00', 8], ['10:00', 10], ['09:00', 6], ['07:00', 7], ['03:00', 7], ['23:00', 6], ['20:00', 8], ['16:00', 7], ['08:00', 9], ['00:00', 7], ['18:00', 7], ['12:00', 12], ['04:00', 9], ['06:00', 6], ['05:00', 8]] 

----------------------------------
||Average Points to Time of Day ||
----------------------------------
 [['02:00', 10], ['01:00', 9], ['22:00', 9], ['21:00', 9], ['19:00', 8], ['17:00', 12], ['15:00', 21], ['14:00', 10], ['13:00', 17], ['11:00', 9], ['10:00', 13], ['09:00', 7], ['07:00', 9], ['03:00', 9], ['23:00', 7], ['20:00', 8], ['16:00', 10], ['08:00', 10], ['00:00', 9], ['18:00', 11], ['12:00', 13], ['04:00', 10], ['06:00', 8], ['05:00', 9]]

Now we have our results! This is a little difficult to read, however, so let's make it look a little nicer.

In [7]:
##Let's swap the columns so that we can sort our data from most
##posts to least posts.
swap_avg_by_hour = []

for row in avg_comments_hour:
    hour = row[0]
    avg = row[1]
    swap_avg_by_hour.append([avg, hour])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
back_swap = sorted(swap_avg_by_hour, reverse=False)

##And do the same for points
swap_avg_pts_by_hour = []

for row in avg_points_hour:
    hour = row[0]
    avg = row[1]
    swap_avg_pts_by_hour.append([avg, hour])

sorted_swap_pts = sorted(swap_avg_pts_by_hour, reverse=True)
back_swap_pts = sorted(swap_avg_pts_by_hour, reverse=False)

##Then let's print our results for comments
print("-------------------------------------------" "\n"
      "||Top 5 Best Hours for Ask Posts Comments||" "\n"
      "-------------------------------------------")

for avg in sorted_swap[:5]:
    format_str = "{0}: {1} average comments per post"
    print(format_str.format(avg[1], avg[0]))
    
    
print("--------------------------------------------" "\n"
      "||Top 3 Worst Hours for Ask Posts Comments||" "\n"
      "--------------------------------------------")

for avg in back_swap[:3]:
    format_str = "{0}: {1} average comments per post"
    print(format_str.format(avg[1], avg[0]))
    
##And points
print("-------------------------------------------" "\n"
      "||Top 5 Best Hours for Ask Posts Points||" "\n"
      "-------------------------------------------")

for avg in sorted_swap_pts[:5]:
    format_str = "{0}: {1} average comments per post"
    print(format_str.format(avg[1], avg[0]))
    
    
print("--------------------------------------------" "\n"
      "||Top 3 Worst Hours for Ask Posts Points||" "\n"
      "--------------------------------------------")

for avg in back_swap_pts[:3]:
    format_str = "{0}: {1} average comments per post"
    print(format_str.format(avg[1], avg[0]))
-------------------------------------------
||Top 5 Best Hours for Ask Posts Comments||
-------------------------------------------
15:00: 28 average comments per post
13:00: 16 average comments per post
12:00: 12 average comments per post
02:00: 11 average comments per post
10:00: 10 average comments per post
--------------------------------------------
||Top 3 Worst Hours for Ask Posts Comments||
--------------------------------------------
06:00: 6 average comments per post
09:00: 6 average comments per post
23:00: 6 average comments per post
-------------------------------------------
||Top 5 Best Hours for Ask Posts Points||
-------------------------------------------
15:00: 21 average comments per post
13:00: 17 average comments per post
12:00: 13 average comments per post
10:00: 13 average comments per post
17:00: 12 average comments per post
--------------------------------------------
||Top 3 Worst Hours for Ask Posts Points||
--------------------------------------------
09:00: 7 average comments per post
23:00: 7 average comments per post
06:00: 8 average comments per post

It appears that 3PM Eastern Time is when comments are most likely, with 2AM and 8PM in second and third place. The times with the least chance of comments is at 6AM, 9AM, and 11PM. For the highest chance to receieve comments on your ask HM post, you should post at 3PM EST, and avoid posting at 6AM, 9AM, and 11PM EST.

Points are very similar, with nearly the exact same times for the highest averages! This leads me to believe that this is due to the user-bases active hours. Users are most active at 3PM EST, and least active at 9AM and 11PM EST.

Conclusion

Our inital goals were to answer our two questions of which subjects, between Ask HN and Show HN, receieve more comments on average, and what time of day has the greatest influence on number of comments.

While our side goals were to answer these same questions, but relating to points.

In conclusion, we have successfully answered all of our questions. Ask HM posts receive the most comments from the community, with almost double the number of comments compared to Show HM posts, and tend to have a greater chance of becoming viral. Show HM posts, however, receieve more points compared to Ask HM, though the averages are similar. To receive the most number of comments and points on an Ask HM post, you should post at 3PM EST, when the user-base is most active.