Exploring Hacker News Posts

Introduction

Hacker News is a site (started by a startup incubator Y Combinator) where user submitted stories('posts') receive votes and comments. As Hacker News is popular among technology and startup circles, top posts get hundreds of thousands of visitors. Our interests are in posts whose title begin with Ask HN (posts to ask Hacker News community a specific question) and Show HN (posts to show Hacker News community a project, product or generally something interesting).

In this project we will be determining the following questions:

  • Do Ask HN or Show HN receive more comments on an average?
  • Do posts created at a certain time receive more comments on an average?

Link to the original data is here Link. The data we are working with has been reduced in size from 300,000 rows to 20,000 rows by removing all submissions that did not receive any comments.

Opening, reading and viewing the data

In [1]:
#import the csv module and open, read the file
import csv
open_file = open('hacker_news.csv')
read_file = csv.reader(open_file)
hn = list(read_file)

#iterate through the data & print first five rows
for i in hn[:5]:
    print(i, sep=' ')
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']

The dataset contains: author of the post, title of the post, url links (if any), num_points (number of points the post acquired: No. of upvotes - No. of downvotes), num_comments (number of comments made on the post), created_at (the date and time at which the post was created).

Removing Headers from a List of Lists for Ease of Analysis

In here we are going to remove the header row and keep just the data body.

In [2]:
headers = hn[0] #select the header row
hn = hn[1:] #slice the header row from the rest
print(headers) #print the header row
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
In [3]:
#loop through the 'hn data' and print the first five rows
for i in hn[:5]:
    print(i, sep=' ')
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']

Extracting ASK HN and Show HN Posts for the Analysis

In here we will create three empty lists one for ask posts, one for show posts and one for other posts. We will iterate over each row and append the appropriate rows (by identifying the matching title) to the above lists.

In [4]:
#create 3 empty lists
ask_posts = []
show_posts = []
other_posts = []

#loop through the data & select the appropriate rows
for i, row in enumerate(hn):
    title = row[1]         #select the title
    title = title.lower()  #convert the title to lower-case
    title = str(title)     #convert all titles to string type

    
    #append the lists by appropriate rows by 
    #identifying the title using conditional statements
    if title.startswith('ask hn'):
        ask_posts.append(hn[i])
    elif title.startswith('show hn'):
        show_posts.append(hn[i])
    else:
        other_posts.append(hn[i])

We will now check the number of posts (rows) in each list with len() built-in function

In [5]:
print("Ask Posts: ", len(ask_posts))
print("Show Posts: ", len(show_posts))
print("Other Posts: ", len(other_posts))
# print(ask_posts[:5])
# print(show_posts[:5])
Ask Posts:  1744
Show Posts:  1162
Other Posts:  17194

Comparing the Average Number of Comments for Ask HN and Show HN Posts

Below we are going to calculate the average number of ask posts and show posts inorder to understand which one is greater

In [6]:
#assign 0 to variables, total_ask_comments & count_ask_comments
total_ask_comments = 0
count_ask_comments = 0

#loop through the ask_posts & select the column 5(no.of comments)
for row in ask_posts:
    num_comments = int(row[4])   
    total_ask_comments += num_comments  #add the value of comments
    count_ask_comments += 1             #add the no.of comments

    #calculate the average no. of comments by deviding the total by count
    avg_ask_comments = total_ask_comments / count_ask_comments


print('Average no. of ask comments:', round(avg_ask_comments, 3))
Average no. of ask comments: 14.038

Below we will follow the same steps for calculating the average number of show comments as that of ask comments

In [7]:
#assign 0 to variables, total_show_comments & count_show_comments
total_show_comments = 0
count_show_comments = 0

#loop through the show_posts & select the column 5(no.of comments)
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    count_show_comments += 1
    
    #calculating the average comments
    avg_show_comments = total_show_comments / count_show_comments

print('Average no. of show comments:', round(avg_show_comments, 3))
Average no. of show comments: 10.317

Answer to the question:

Do Ask HN or Show HN receive more comments on an average? Average ask comments(avg_ask_comments) is ~14 and Average show comments(avg_show_comments) is ~10. That means Ask HN recieves *~4* more comments than Show HN on an average`.

Finding the Number of Ask Posts and Comments by Hour Created

Further we will determine if the number of ask posts created at a certain time of the day are more likely to attract more comments. We are going to follow two steps in order to achieve this:

  1. Calculate the number of ask posts created each hour of the day, along with the number of comments received.

  2. Second, calculate the average number of comments ask posts receive by hour created.

1. Calculating the Number of Ask Posts Created Each Hour of the Day

In [8]:
# import datetime module
import datetime as dt
result_list = []   #create an empty list

#loop through the ask post, select created_at(col 5) & comments 
#columns(col 3) and append them to result_list
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])
In [9]:
#create two empty dictionaries one for counts and another 
#for comments
counts_by_hour = {}
comments_by_hour = {}

#loop through the result_list and select date & comments column
for row in result_list:
    date = row[0]
    num_comment = row[1]

    #convert the date column to datetime object and exctract
    #just the hour
    date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = date.strftime("%H")
    
    #using conditional statements check if hour is not 
    #a key in counts_by_hour, then set value to 1 & comments by 
    #hour to comment number
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comment
    
    #if hour is a key in counts_by_hour, increment the value
    #by 1 and increment the value in comments_by_hour
    #by comment number
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comment
        
# print(counts_by_hour)
# print(comments_by_hour)

2. Calculating the Average Number of Comments for Ask HN Posts by Hour

In [10]:
#create an empty list for adding average hour
avg_by_hour = []

#loop through the dict counts_by_hour and append the above list by
#calculating average comments by hour and print it
for hour in counts_by_hour:
    avg_by_hour.append([hour, round(comments_by_hour[hour]/ counts_by_hour[hour], 2)])

for i in avg_by_hour:
    print(i, sep=' ')
['09', 5.59]
['13', 14.91]
['10', 13.23]
['14', 13.14]
['16', 16.8]
['23', 7.88]
['12', 9.34]
['17', 11.36]
['15', 38.27]
['21', 15.9]
['20', 21.28]
['02', 23.46]
['18', 13.1]
['03', 7.67]
['05', 10.49]
['19', 10.73]
['01', 11.74]
['22', 6.68]
['08', 10.14]
['04', 7.08]
['00', 8.16]
['06', 8.84]
['07', 7.69]
['11', 10.9]

Sorting and Printing Values for the Better Understanding of the Above Analysis

avg_by_hour list makes it difficult to identify the hours with the highest values. So we will sort the list and choose the five highest values in a format which is easier to read.

In [11]:
#create an empty list for swapping
swap_avg_by_hour = []

#loop through the avg_by_hour list & append the 
#swap_avg_by_hour list after swapping the rows
for x in avg_by_hour:
    swap_list = [x[1], x[0]] 
    swap_avg_by_hour.append(swap_list)
    
#print the list
for i in swap_avg_by_hour:
    print(i, sep=' ')
[5.59, '09']
[14.91, '13']
[13.23, '10']
[13.14, '14']
[16.8, '16']
[7.88, '23']
[9.34, '12']
[11.36, '17']
[38.27, '15']
[15.9, '21']
[21.28, '20']
[23.46, '02']
[13.1, '18']
[7.67, '03']
[10.49, '05']
[10.73, '19']
[11.74, '01']
[6.68, '22']
[10.14, '08']
[7.08, '04']
[8.16, '00']
[8.84, '06']
[7.69, '07']
[10.9, '11']
In [12]:
#swap the list using sorted function in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

#print the first five rows
for i in sorted_swap[:5]:
    print(i, sep=' ')
[38.27, '15']
[23.46, '02']
[21.28, '20']
[16.8, '16']
[15.9, '21']

Stacking Time and Average Comments per Post for Five Top Comments

In [13]:
#loop through the five rows of sorted_swap list &
#select the appropriate rows
for row in sorted_swap[:5]:
    avg_com_per_post = row[0]
    time = row[1]
    
    #first use the strptime() constructor to convert time to datetime
    #object, then use strftime() to specify the format
    time = time+":"+"00"
    time = dt.datetime.strptime(time, "%H:%M")
    time = time.strftime("%H:%M")   
    
    #format the average comments to indicate only two decimal places
    template = "{hour}: {acpp:.2f} average comments per post." 
    output = template.format(hour=time, acpp=avg_com_per_post)
    print(output)
15:00: 38.27 average comments per post.
02:00: 23.46 average comments per post.
20:00: 21.28 average comments per post.
16:00: 16.80 average comments per post.
21:00: 15.90 average comments per post.

Answer to the question:

Do posts created at a certain time receive more comments on an average?

The five best hours to create an ask post to receive maximum comments are:-

1. 15:00 (3 PM)

2. 02:00 (2 AM)

3. 20:00 (8 PM)

4. 16:00 (4 PM)

5. 21:00 (9 PM)

Converting the Times to IST (Indian Standard Time)

In [14]:
#import pytz & datetime modules
import pytz
from datetime import datetime

#define est & ist
est = pytz.timezone('US/Eastern')
ist = pytz.timezone('Asia/Kolkata')

#convert the times to est
t1 = datetime(2020,7,4,15,0,tzinfo=est)
t2 = datetime(2020,7,4,2,0,tzinfo=est)
t3 = datetime(2020,7,4,20,0,tzinfo=est)
t4 = datetime(2020,7,12,16,0,tzinfo=est)
t5 = datetime(2020,7,12,21,0,tzinfo=est)

#convert the above times to ist & format them
t1_ist = t1.astimezone(ist).strftime("%H:%M")
t2_ist = t2.astimezone(ist).strftime("%H:%M")
t3_ist = t3.astimezone(ist).strftime("%H:%M")
t4_ist = t4.astimezone(ist).strftime("%H:%M")
t5_ist = t5.astimezone(ist).strftime("%H:%M")

#print the times in ist
print(t1_ist)
print(t2_ist)
print(t3_ist)
print(t4_ist)
print(t5_ist)
01:26
12:26
06:26
02:26
07:26

The best time to receive maximum comments in IST are:-

1. 1.30 AM

2. 12.30 PM

3. 6.30 AM

4. 2.30 AM

5. 7.30 AM

Further We Will Perform Some More Exploration on the Data in Order to Understand the Similarity/Difference Between Different Posts

Calculating the Number of Show Posts Created Each Hour of the Day

Here we will follow the same proceedure as above (used for ask posts) for calculating the number of show posts created each hour of the day.

In [15]:
#import the datetime module
import datetime as dt

#create an empty list
show_result_list = []

#loop through the show_posts and append the 
#appropriate rows
for row in show_posts:
    created_at = row[6]
    num_comments = int(row[4])
    show_result_list.append([created_at, num_comments])
    
#create two empty dictionaries
show_counts_by_hour = {}
show_comments_by_hour = {}

#loop through the show_result_list
for row in show_result_list:
    date = row[0]
    num_comments = row[1]
    
    #extract hour from the date list 
    date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = date.strftime("%H")
    
    #using conditional statements, assign appropriate
    #key and value
    if hour not in show_counts_by_hour:
        show_counts_by_hour[hour] =1
        show_comments_by_hour[hour] = num_comments
        
        if hour in show_counts_by_hour:
            show_counts_by_hour[hour] += 1
            show_comments_by_hour[hour] += num_comments
            
# print(other_comments_by_hour)
# print(other_counts_by_hour)

Calculating the Average Number of Comments for Show Posts by Hour

In [16]:
#create an empty list
show_avg_by_hour = []

#loop through the show_counts_by_hour and append the
#avg_by_hour
for hour in show_counts_by_hour:
    show_avg_by_hour.append([hour, show_comments_by_hour[hour]/show_counts_by_hour[hour]])
    
# for i in other_avg_by_hour:
#     print(i, sep=' ')

Sorting Values for Better Understanding

In [17]:
#create an empty list
show_swap_avg_by_hour = []

#loop through the show_avg_by_hour
for x in show_avg_by_hour:
    swap_list = [x[1], x[0]]
    show_swap_avg_by_hour.append(swap_list)
    
#sort the swap_list
show_sorted_swap = sorted(show_swap_avg_by_hour, reverse=True)

for i in show_sorted_swap[:5]:
    print(i, sep=' ')
[102.0, '22']
[22.0, '14']
[9.0, '20']
[5.0, '01']
[4.0, '15']

The five best hours to create a show post to receive maximum comments are:-

1) 22:00 (10 PM)

2) 14:00 (2 PM)

3) 20:00 (8 PM)

4) 01:00 (1 AM)

5) 15:00 (3 PM)

We observe that the five best times for getting more comments for both Ask Posts and Show Posts are very close.

Average Number of Points Received by Show Posts and Ask Posts

Calculation of Average Number of Points for Ask Posts

In [18]:
#assign 0 to total points and count of points
total_ask_points = 0
count_ask_points = 0

#loop through the ask posts & select the appropriate row
#add the num_points to total_ask_points & add 1 for each
#num_points and add it to count_ask_points
for row in ask_posts:
    num_points = int(row[3])
    total_ask_points += num_points
    count_ask_points += 1
    
    #calculate the average ask points & print it
    avg_ask_points = total_ask_points/count_ask_points
    
print('average ask points:', round(avg_ask_points, 2))
average ask points: 15.06

Calculation of Average Number of Points for Show Posts

In [19]:
#assign 0 to total points and count of points
total_show_points = 0
count_show_points = 0

#loop through the show posts & select the appropriate row
#add the num_points to total_show_points & add 1 for each
#num_points and add it to count_show_points
for row in show_posts:
    num_points = int(row[3])
    total_show_points += num_points
    count_show_points += 1
    
    #calculate the average ask points & print it
    avg_show_points = total_show_points/count_show_points
    
print('average show points:', round(avg_show_points, 2))
average show points: 27.56

Above analysis shows that average ask points is ~ 15 and average show points is ~ 27. Clearly Show Posts receive more points than Ask Posts.

Determining if the Posts Created at Certain Time are Likely to Receive More Points

Ask Posts

Calculating the Number of Ask Post Points Created Each Hour of the Day

In [20]:
#import datetime object and create an empty list
import datetime as dt
ask_point_list = []

#iterate through the ask_posts and select aprropriate 
#rows and append the empty list
for row in ask_posts:
    created_at = row[6]
    num_points = int(row[3])
    ask_point_list.append([created_at, num_points])

#create two dictionaries one for counts by hour & the 
#other points by hour
ask_counts_by_hour = {}
points_by_hour = {}

#iterate through the ask_point_list, select date and
#no.of points
for row in ask_point_list:
    date = row[0]
    num_points = row[1]
    
    #parse the date and select just the hour
    date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = date.strftime("%H")
    
    #apply the conditional statement & determine key 
    #and value for the dictionary
    if hour not in ask_counts_by_hour:
        ask_counts_by_hour[hour] = 1
        points_by_hour[hour] = num_points
        
    if hour in ask_counts_by_hour:
        ask_counts_by_hour[hour] += 1
        points_by_hour[hour] += num_points
        
# print(ask_counts_by_hour)
# print(points_by_hour)

Calculating Average Number of Points for Ask Posts by Hour

In [21]:
#create an empty list for average points
avg_points_by_hour = []

#iterate through the ask_counts_by_hour & append the list with hour & average points
for hour in ask_counts_by_hour:
    avg_points_by_hour.append([hour, round(points_by_hour[hour]/ask_counts_by_hour[hour], 2)])
    
# print(avg_points_by_hour)

Sorting Values

In [22]:
#create an empty list 
swap_avg_points_by_hour = []

#iterate through the avg_points_by_hour & append the above list
for x in avg_points_by_hour:
    swap_list = [x[1], x[0]]
    swap_avg_points_by_hour.append(swap_list)
    
#sort the swapped list
sorted_swap_avg = sorted(swap_avg_points_by_hour, reverse=True)
for i in sorted_swap_avg[:5]:
    print(i)
[29.74, '15']
[24.3, '13']
[23.39, '16']
[19.24, '17']
[18.38, '10']

The five best hours to create other ask posts to receive maximum points are:-

1) 15:00 (3 PM)

2) 13:00 (1 PM)

3) 16:00 (4 PM)

4) 17:00 (5 PM)

5) 10:00 (10 AM)

Show Posts

Calculating the Number of Show Post Points Created Each Hour of the Day

In [23]:
#import datetime object and create an empty list
import datetime as dt
show_point_list = []

#iterate through the show_posts and select aprropriate 
#rows and append the empty list
for row in show_posts:
    created_at = row[6]
    num_points = int(row[3])
    show_point_list.append([created_at, num_points])

#create two dictionaries one for counts by hour & the 
#other points by hour
sh_counts_by_hour = {}
show_points_by_hour = {}

#iterate through the show_point_list, select date and
#no.of points
for row in show_point_list:
    date = row[0]
    num_points = row[1]
    
    #parse the date and select just the hour
    date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = date.strftime("%H")
    
    #apply the conditional statement & determine key 
    #and value for the dictionary
    if hour not in sh_counts_by_hour:
        sh_counts_by_hour[hour] = 1
        show_points_by_hour[hour] = num_points
        
    if hour in sh_counts_by_hour:
        sh_counts_by_hour[hour] += 1
        show_points_by_hour[hour] += num_points
        
# print(sh_counts_by_hour)
# print(show_points_by_hour)

Calculating Average Number of Points for Show Posts by Hour

In [24]:
#create an empty list for average points
avg_show_points_by_hour = []

#iterate through the sh_counts_by_hour & append the list with hour & average points
for hour in sh_counts_by_hour:
    avg_show_points_by_hour.append([hour, round(show_points_by_hour[hour]/sh_counts_by_hour[hour], 2)])
    
# print(avg_show_points_by_hour)

Sorting Values

In [25]:
#create an empty list 
swap_avg_show_points_by_hour = []

#iterate through the avg_show_points_by_hour & append the above list
for x in avg_show_points_by_hour:
    swap_list = [x[1], x[0]]
    swap_avg_show_points_by_hour.append(swap_list)
    
#sort the swapped list
sorted_swap_show_avg = sorted(swap_avg_show_points_by_hour, reverse=True)
for i in sorted_swap_show_avg[:5]:
    print(i)
[55.38, '22']
[41.49, '23']
[41.13, '12']
[36.94, '00']
[35.74, '18']

The five best hours to create show posts to receive maximum points are:-

1) 22:00 (10 PM)

2) 23:00 (11 PM)

3) 12:00 (12 AM)

4) 00:00 (12 PM)

5) 18:00 (6 PM)

We observe that the five best times for receiving maximum ponits for Ask posts and Show posts are very different. These are different than the times for receiving maximum comments as well.

In [26]:
#assign 0 to total_other_comments & count_other_comments
total_other_comments = 0
count_other_comments = 0

#loop through the other_posts & select the column 5(no.of comments)
for row in other_posts:
    num_comments = int(row[4])   #convert it to an integer
    total_other_comments += num_comments  #add the value of comments
    count_other_comments += 1             #add the no.of comments

    #calculate the average no. of comments by deviding the total by count
    avg_other_comments = total_other_comments / count_other_comments
print('Average no. of other comments:', round(avg_other_comments, 2))
Average no. of other comments: 26.87

Average Number of Other Comments is ~ 26. This is more than both Average Number of Ask Comments (~ 14) and Average Number of Show Comments (~ 10).

In [27]:
#assign 0 to total points and count of points
total_other_points = 0
count_other_points = 0

#loop through the other posts & select the appropriate row
#add the num_points to total_other_points & add 1 for each
#num_points and add it to count_other_points
for row in other_posts:
    num_points = int(row[3])
    total_other_points += num_points
    count_other_points += 1
    
    #calculate the average other points & print it
    avg_other_points = total_other_points/count_other_points
    
print('average other points:', round(avg_other_points, 2))
average other points: 55.41

Average Number of Other Points is ~ 55. This is more than both Average Number of Ask Points (~ 15) and Average Number of Show Points (~ 27).

Here we will follow the same proceedure (as that of ask posts and show posts) for calculating the number of other posts created each hour of the day.

In [28]:
#import the datetime module
import datetime as dt

#create an empty list
other_result_list = []

#loop through the other_posts and append the 
#appropriate rows
for row in other_posts:
    created_at = row[6]
    num_comments = int(row[4])
    other_result_list.append([created_at, num_comments])
    
#create two empty dictionaries
other_counts_by_hour = {}
other_comments_by_hour = {}

#loop through the show_result_list
for row in other_result_list:
    date = row[0]
    num_comments = row[1]
    
    #extract hour from the date list 
    date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = date.strftime("%H")
    
    #using conditional statements, assign appropriate
    #key and value
    if hour not in other_counts_by_hour:
        other_counts_by_hour[hour] =1
        other_comments_by_hour[hour] = num_comments
        
        if hour in other_counts_by_hour:
            other_counts_by_hour[hour] += 1
            other_comments_by_hour[hour] += num_comments
            
# print(other_comments_by_hour)
# print(other_counts_by_hour)
In [29]:
#create an empty list
other_avg_by_hour = []

#loop through the other_counts_by_hour and append the
#avg_by_hour
for hour in other_counts_by_hour:
    other_avg_by_hour.append([hour, round(other_comments_by_hour[hour]/other_counts_by_hour[hour],2)])
    
# for i in other_avg_by_hour:
#     print(i, sep=' ')

Sorting Values for Better Understanding

In [30]:
#create an empty list
other_swap_avg_by_hour = []

#loop through the other_avg_by_hour
for x in other_avg_by_hour:
    swap_list = [x[1], x[0]]
    other_swap_avg_by_hour.append(swap_list)
    
#sort the swap_list
other_sorted_swap = sorted(other_swap_avg_by_hour, reverse=True)

for i in other_sorted_swap[:5]:
    print(i, sep=' ')
[213.0, '02']
[112.0, '21']
[112.0, '03']
[106.0, '17']
[68.0, '07']

The five best hours to create other post to receive maximum comments are:-

1) 02:00 (2 AM)

2) 21:00 (9 PM)

3) 03:00 (3 AM)

4) 17:00 (5 PM)

5) 07:00 (7 AM)

We observe that the five best times for getting maximum comments for Other Posts is somewhat similar to both Ask Posts and Show Posts best times for receiving maximum comments.

Calculating the Number of Other Post Ponits Created Each Hour of the Day

In [31]:
#import datetime object and create an empty list
import datetime as dt
other_point_list = []

#iterate through the other_posts and select aprropriate 
#rows and append the empty list
for row in other_posts:
    created_at = row[6]
    num_points = int(row[3])
    other_point_list.append([created_at, num_points])

#create two dictionaries one for counts by hour & the 
#other points by hour
other_counts_by_hour = {}
other_points_by_hour = {}

#iterate through the other_point_list, select date and
#no.of points
for row in other_point_list:
    date = row[0]
    num_points = row[1]
    
    #parse the date and select just the hour
    date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = date.strftime("%H")
    
    #apply the conditional statement & determine key 
    #and value for the dictionary
    if hour not in other_counts_by_hour:
        other_counts_by_hour[hour] = 1
        other_points_by_hour[hour] = num_points
        
    if hour in other_counts_by_hour:
        other_counts_by_hour[hour] += 1
        other_points_by_hour[hour] += num_points
        
# print(other_counts_by_hour)
# print(other_points_by_hour)
In [32]:
#create an empty list
avg_other_points_by_hour = []

#iterate through the other_counts_by_hour & append the list with hour & average points
for hour in other_counts_by_hour:
    avg_other_points_by_hour.append([hour, round(other_points_by_hour[hour]/other_counts_by_hour[hour],2)])
    
# print(avg_other_points_by_hour)

Sorting Values

In [33]:
#create an empty list 
swap_avg_other_points_by_hour = []

#iterate through the avg_other_points_by_hour & append the above list
for x in avg_other_points_by_hour:
    swap_list = [x[1], x[0]]
    swap_avg_other_points_by_hour.append(swap_list)
    
#sort the swapped list
sorted_swap_other_avg = sorted(swap_avg_other_points_by_hour, reverse=True)
for i in sorted_swap_other_avg[:5]:
    print(i)
[62.52, '13']
[61.72, '14']
[60.49, '15']
[60.48, '10']
[59.99, '19']

The five best hours to create other posts to receive maximum points are:-

1) 13:00 (1 PM)

2) 14:00 (2 PM)

3) 15:00 (3 PM)

4) 10:00 (10 AM)

5) 19:00 (7 PM)

Conclusion

We performed very eloborate analysis on the Hacker News Posts dataset. We specifically analyzed Ask and Show posts trying to understand what kind of posts get more comments and what are the best times to post to get responses. We also analyzed other posts (non-ask and non-show posts) and compared them with Ask and Show posts. We make the following observations.

1) We found that the number of show posts (~ 1164) and ask posts (~ 1744) together are only ~ 20% of other posts (~ 17194).

2) Ask posts gather more comments (~14) than show posts (~10). Whereas show posts gather more points (~27) than ask posts (~15). However the average number of comments and points are higher for other posts as compared to these two (~26 & ~55) respectively.

3) We find that regardless of the type of post (ask, show or other) some hours of the day get more comments than others. We list these as

  • Afrenoon :- from 2 PM to 5 PM
  • Late Evening :- from 8 PM to 10 PM
  • Late Night :- from 1 AM to 3 AM

4) In contrast to the above, the best times to post to get highest points for all three posts are different.

  • Ask Posts :- Afternoon - from 1 PM till 5 PM
  • Show Posts :- Night - from 10 PM till 12 PM
  • Other Posts :- Afternoon - from 1 PM till 3 PM

Results of our analysis can be used to guide users to decide what kind and time of post they should post on Hacker News in order to get maximum comments and points.