Analyzing the submissions of Hacker News

I am going to examine Hacker News website that is pretty popular among people who interested with technology and startups. My curiosity about titles begins with "Ask HN" and "Show HN".

"Show HN" titles created by a person who wants to learn opinions of people about a project, product or just a interesting new in the community. Likewise, "Ask HN" titles created by a person who needs to know thoughs of people in the community.

The main discovering questions are:

  • Do Ask HN or Show HN receive more comments on average?
  • Is there any correlation between received comments and creation time of the title?

You can see the source of data from here.

Columns are explained below:

  • title: title of the post (self explanatory)

  • url: the url of the item being linked to

  • num_points: the number of upvotes the post received

  • num_comments: the number of comments the post received

  • author: the name of the account that made the post

  • created_at: the date and time the post was made (the time zone is Eastern Time in the US)

In [42]:
### Assign the result to the variable hn
import csv
opened_file = open("C:/Users/Toshiba1/Desktop/DataQuest/hacker_news.csv", encoding="utf8")
read_file = csv.reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]

### Display headers and the first five rows
print(headers)
print('-' * 125)
for post in hn[0:5]:
    print(post)
    print('\n')
print("Total number of posts: " + str(len(hn)))
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
-----------------------------------------------------------------------------------------------------------------------------
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


Total number of posts: 20100

Filtering Titles

Titles requires to be filtered to just handle with titles that includes "Ask HN" and "Show HN". In order to complete this task, I will assign them into new lists.

In [2]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(len(ask_posts) + len(show_posts) + len(other_posts)) # Ensured no missing data
20100

Finding Out Which Post Type Received More Comment on Average

We want to discover that which post type is perceived as more interesting to encourage people making a comment.

In [3]:
## Calculating average number of comments for "Ask HN" posts
total_ask_comment = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comment += num_comments
avg_ask_comments = total_ask_comment / len(ask_posts)
print(avg_ask_comments)
14.038417431192661
In [4]:
## Calculating average number of comments for "Show HN" posts
total_show_comment = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comment += num_comments
avg_show_comments = total_show_comment / len(show_posts)
print(avg_show_comments)
10.31669535283993

The examination shows that "Ask HN" posts receive more comment when they are compared with "Show HN" posts. Our scope is narrowed in "Ask HN" posts because they attract attention of more people.

In [12]:
import datetime as dt
result_list = []
for row in ask_posts:
    create_time = row[6]
    num_comment = int(row[4])
    result_list.append([create_time, num_comment])

counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    date_string = row[0]
    date_string = dt.datetime.strptime(date_string, "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(date_string, "%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    elif hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
print(counts_by_hour) #Displays number of created new "Ask HN" posts
print("")
print(comments_by_hour) #Displays number of comment entried in "Ask HN" posts
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}

Calculating Average Number of Comments for "Ask HN" Posts by Hour

In [14]:
avg_by_hour = []
for key in counts_by_hour:
    avg_by_hour.append([key, comments_by_hour[key] / counts_by_hour[key]])
print(avg_by_hour)
[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]

Sorted List Ordered by Highest Value

In [24]:
swap_avg_by_hour = []
for i in avg_by_hour:
    key = i[0]
    value = i[1]
    swap_avg_by_hour.append([value, key])
print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("")
print("Top 5 Hours for Ask Posts Comments")

for avg, hour in sorted_swap[:5]:
    datetime_type = dt.datetime.strptime(hour, "%H")
    datetime_str = dt.datetime.strftime(datetime_type, "%H:%M")
    result = f"{datetime_str}: {avg:.2f} average comments per post"
    print(result)
[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post

If you decide to open a new post in Hacker News, you should open 15:00 according to my analysis. Note that time zone is Eastern Time(GMT -4) in the US. Let's convert it Turkey time zone(GMT +3).

In [30]:
## Convert time to Turkey time zone (GMT+3)
print('Top 5 Hours for Ask Posts Comments (Turkey time zone (GMT+3))')
for row in sorted_swap[:5]:
    datetime_str = (dt.datetime.strptime(row[1], "%H") + dt.timedelta(hours=7)).strftime("%H:%M")
    output = "{hour}: {avg:.2f} average comments per post".format(hour = datetime_str, avg = row[0])
    print(output)
Top 5 Hours for Ask Posts Comments (Turkey time zone (GMT+3))
22:00: 38.59 average comments per post
09:00: 23.81 average comments per post
03:00: 21.52 average comments per post
23:00: 16.80 average comments per post
04:00: 16.01 average comments per post

Conclusion

There is extremely bigger chance to get more comments in Hacker News, if you post from 22:00 to 23:00 in Turkey. In our study, we find out "Ask HN" titles take attention of more people and people make more comments. The other finding is we have explored the most active hours. However, you should realize that we have excluded the posts without any comment. Thanks to this exclusion, we get a data set decreased from 300,000 to 20,000.