Guided Project: Exploring Hacker News Posts

In this project, we are going to work with a dataset from Hacker News

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

The original dataset is available on Kaggle
However, note that in this project we are going to use the modified dataset. In the modified dataset, the rows was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

Here are the first few rows of the data set:

id title url
12224879 Interactive Dynamic Video http://www.interactivedynamicvideo.com/
10975351 How to Use Open Source and Shut the F*ck Up at the Same Time http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/
11964716 Florida DJs May Face Felony for April Fools' Water Joke http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/
11919867 Technology ventures: From Idea to Enterprise https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429
10301696 Note by Note: The Making of Steinway L1037 (2007) http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0



We're specifically interested in posts whose titles begin with either Ask HN or Show HN.
Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a couple examples:


Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of examples:


Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm

The purpose of this project is to answer the following questions:

(1) Which kind of posts receive the most comments?
(2) What is the most optimal time to create posts that gather the most comments?
(3) Which kind of posts receive the most points?
(4) What is the most optimal time to create posts that gather the most points?

1. Opening The Data: hacker_news.csv

As a start, we will do the following:

(1) Read the hacker_news.csv file in as a list of lists.
(2) Assign the result to the variable hn.

In [1]:
opened_file = open('hacker_news.csv') #(1)

from csv import reader
read_file = reader(opened_file)

hn = list(read_file) #(2)

1.1. Header and columns

Next, after opening the data we want to:

  1. extract the first row of data, and assign it to the variable hn_header.
  2. Display headers.
  3. Describe the columns
In [2]:
hn_header = hn[0] #(1)
print('header: ' + str(hn_header)) #(2)

#(3) column description below
header: ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

Column description:

id: The unique identifier from Hacker News for the post
title: The title of the post
url: The URL that the posts links to, if it the post has a URL
num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments: The number of comments that were made on the post
author: The username of the person who submitted the post
created_at: The date and time at which the post was submitted

1.2. Overview

Next, we want to get an overview of the dataset that we are working on.
To do that, we will do the following:

(1) Separate the header from the rest of the file
(2) Display the first five rows of hn with the header removed
(3) Display the length of the file

In [3]:
hn = hn[1:] #(1)

print("The first 5 rows are:") #(2)
hn[:4]
The first 5 rows are:
Out[3]:
[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]
In [4]:
print('length: ' + str(len(hn)) + ' rows') #(3)
length: 20100 rows

2. FIltering The Data


Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

Note: To find the posts that begin with either Ask HN or Show HN, we'll use the string method, startswith(  )


We will do the following:
(1) Create three empty lists called ask_posts, show_posts, and other_posts.
(2) Loop through each row in hn.
   - Assign the title in each row to a variable named title (element index 1)
   - If the lowercase version of title starts with ask hn, append the row to ask_posts.
   - Else if the lowercase version of title starts with show hn, append the row to show_posts.
   - Else append to other_posts.
(3) Print the first five rows in ask_posts, show_posts, and other_posts.
(4) Check the number of posts in each list

In [5]:
#(1)
ask_posts = [] #Ask HN
show_posts = [] #Show HN
other_posts = []
In [6]:
for row in hn: #(2)
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
In [7]:
print('The first five rows in each list:')  #(3)
print('\n')
print('--ask_posts--')
print(ask_posts[:4])
print('\n')
print('--show_posts--')
print(show_posts[:4])
print('\n')
print('--other_posts--')
print(other_posts[:4])
The first five rows in each list:


--ask_posts--
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']]


--show_posts--
[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11']]


--other_posts--
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
In [8]:
print('Checking the number of posts:')  #(4)
print('\n')
print('length of ask_posts: ' + str(len(ask_posts)) + ' posts')
print('length of show_posts: ' + str(len(show_posts)) + ' posts')
print('length of other_posts: ' + str(len(other_posts)) + ' posts')
Checking the number of posts:


length of ask_posts: 1744 posts
length of show_posts: 1162 posts
length of other_posts: 17194 posts

3. Exploring num_comments


3.1. explore(  )

To explore the num_comments(index 4) or numpoints (index 3) column in each list, we create a function, explore(  )
This function takes list_name and variable_name as an argument, and does the following:

(1) Create new parameters, total_variable and row_variable. Then set both to 0
   - total_variable will be used to calculate the sum of comments or points in a list
   - row_variable will be used to determine the row number for comments or points
(2) Loop through each row in list_name.
   - Convert the element at index 4(num_comments), or at index 3(num_points), to an int
   - Assign it to num_variable
(3) Compute the average number of comments by dividing total_comments with the length of the list
   - Assign it to avg_variable
(4) print avg_variable and round it to two decimals

Using explore_(  ) , we explore the num_comments column in ask_posts, and show_posts

In [9]:
def explore(list_name, variable_name):  
    total_variable = 0 #(1)
    row_variable = 0
    
    if variable_name == 'comments':
        row_variable = 4
    if variable_name == 'points':
        row_variable = 3
        
    for row in list_name: #(2)
        num_variable = int(row[row_variable])
        total_variable += num_variable 
    avg_variable = total_variable / len(list_name) #3
    
    print(round(avg_variable,2)) #4
In [10]:
print('Average comments on ask_posts:') #calling explore( ) 
explore(ask_posts, 'comments')           #to explore the num_comments column
print('\n')                               #in ask_posts, and show_posts
print('Average comments on show_posts:')
explore(show_posts, 'comments')
Average comments on ask_posts:
14.04


Average comments on show_posts:
10.32

Using the explore(  ) we found the following:

  • ask_posts has more comments on average with 14.04 average comments
  • show_posts came in less comments on average with 10.32 average comments

Since ask_posts has more comments on average, this indicates that ask_posts are more likely to receive comments from the user. Hence, we will focus on ask_posts when exploring the num_comments column

This answers our first question: Which kind of posts receive the most comments?
Answer: Posts that start with Ask HN (ask_posts)

FInding the optimal time to create a post in ask_posts, in order to get the most comments

We want to determine if there is an optimal time to create a post in ask_posts that can get us the most comments.
We will do the following:

(1) Calculate the amount of posts created in ask_posts in each hour of the day (posts per hour).
(2) Calculate the average number of comments per post in ask posts in every hour. (comments per hour)

Note: To calculate the amount of posts created in ask_posts in each hour of the day, we will use the datetime module to work with the data in the created_at column.

3.2 hourly_average(  )

We will do the two aforementioned things above by creating a function, hourly_average(  ),
This function takes list_name and variable_name as arguments, and does the following:

(1)   Create a new parameter, row_variable, and set it to 0
    - row_variable will be used to determine the row number for comments or points
(2)   Create two empty dictionaries called posts_per_hour and variable_per_hour
(3)   Create a string to store the date format, and assign it to date_format
(4)   Loop through each row in list_name
(5)   Extract the hour from the date
    - Use the datetime.strptime() method to parse the date and create a datetime object.
    - Use the datetime.strftime() method to select just the hour from the datetime object.
(6)   If the hour isn't a key in posts_per_hour:
    - Create the key in posts_per_hour and set it equal to 1
    - Create the key in variable_per_hour and set it equal to the comment number (num_variable)
    - Else increment the key in both posts_per_hour and variable_per_hour by '1' and comment number (num_variable) respectively
  Create the key in variable_per_hour and set it equal to the comment number
(7)   Print posts_per_hour and variable_per_hour
(8)   Create an ampty list called avg_by_hour
(9)   Loop through posts_per_hour and append its elements to the avg_by_hour list that we just created
(10) Sort avg_by_hour using sorted(  ) then print it
    - set Reverse = True to sort in descending order
    - Since the first column of this list is the average number of comments, sorting the list will sort by the average number of comments.
(11) Returns avg_by_hour

In [11]:
import datetime as dt

def hourly_average(list_name, variable_name):
#(1)  
    row_variable = 0                   
    if variable_name == 'comments':
        row_variable = 4
    if variable_name == 'points':
        row_variable = 3
        
#(2)       
    posts_per_hour = {} #counting number of post per hour              
    variable_per_hour = {} #counting number of comments per hour
    
#(3)   
    date_format = "%m/%d/%Y %H:%M"      
    
#(4)  
    for row in list_name: #(4)
        num_variable = int(row[row_variable])
        datetime_str = row[6]
#(5)          
        datetime_dt = dt.datetime.strptime(datetime_str, date_format) #parse the date and create a datetime object.
        hour_str = datetime_dt.strftime("%I %p")

#(6)  
        if hour_str not in posts_per_hour:
            posts_per_hour[hour_str] = 1
            variable_per_hour[hour_str] = num_variable
        else:
            posts_per_hour[hour_str] += 1
            variable_per_hour[hour_str] += num_variable

#(7)  
    print('posts_per_hour')
    print(posts_per_hour)
    print('\n')
    
    if variable_name == 'comments':
        print('comments_per_hour')
    if variable_name == 'points':
        print('points_per_hour')
    
    print(variable_per_hour) 

#(8)  
    avg_by_hour = []
    
#(9)  
    for hour in posts_per_hour:
        avg_by_hour.append([ round(variable_per_hour[hour] / posts_per_hour[hour],2), hour ])
    
#(10)  
    avg_by_hour = sorted(avg_by_hour, reverse = True)
    print('\n')
    
    print('avg_by_hour (sorted in descending order)')

    print(avg_by_hour)

#(11)  
    return avg_by_hour
In [12]:
avg_hr_comments = hourly_average(ask_posts, 'comments') #function call
posts_per_hour
{'09 AM': 45, '01 PM': 85, '10 AM': 59, '02 PM': 107, '04 PM': 108, '11 PM': 68, '12 PM': 73, '05 PM': 100, '03 PM': 116, '09 PM': 109, '08 PM': 80, '02 AM': 58, '06 PM': 109, '03 AM': 54, '05 AM': 46, '07 PM': 110, '01 AM': 60, '10 PM': 71, '08 AM': 48, '04 AM': 47, '12 AM': 55, '06 AM': 44, '07 AM': 34, '11 AM': 58}


comments_per_hour
{'09 AM': 251, '01 PM': 1253, '10 AM': 793, '02 PM': 1416, '04 PM': 1814, '11 PM': 543, '12 PM': 687, '05 PM': 1146, '03 PM': 4477, '09 PM': 1745, '08 PM': 1722, '02 AM': 1381, '06 PM': 1439, '03 AM': 421, '05 AM': 464, '07 PM': 1188, '01 AM': 683, '10 PM': 479, '08 AM': 492, '04 AM': 337, '12 AM': 447, '06 AM': 397, '07 AM': 267, '11 AM': 641}


avg_by_hour (sorted in descending order)
[[38.59, '03 PM'], [23.81, '02 AM'], [21.52, '08 PM'], [16.8, '04 PM'], [16.01, '09 PM'], [14.74, '01 PM'], [13.44, '10 AM'], [13.23, '02 PM'], [13.2, '06 PM'], [11.46, '05 PM'], [11.38, '01 AM'], [11.05, '11 AM'], [10.8, '07 PM'], [10.25, '08 AM'], [10.09, '05 AM'], [9.41, '12 PM'], [9.02, '06 AM'], [8.13, '12 AM'], [7.99, '11 PM'], [7.85, '07 AM'], [7.8, '03 AM'], [7.17, '04 AM'], [6.75, '10 PM'], [5.58, '09 AM']]

3.3. print_hourly_avg(  )

After get the data we need, now is the time to display our findings.

We want to print the top 5 hours that are the most optimal for posting--in order to get the most comments.
To do that, we will create a function, called print_hourly_avg(  )
This function take list_name as an argument and does the following:

(1) Loop through each elements in list_name up until the 5th row (since we want to print the top 5 hour)
(2) Use format(  ) to print string in a formatted way
   - We use {:.2f} to print the average comments per post so that it shows only two decimals (3) Print the output

In [13]:
def print_hourly_avg(list_name):
    
    print('==============================================================================') 
    
    for variable_per_post, hour in list_name[:5]: #(1)
        output = "{h}: {cp:.2f} per post".format(h = hour, cp = variable_per_post) #(2)
        print(output)     

3.4. convert_pt(  )

Note that the timezone used in the dataset is Eastern Time (ET). I currently live the in California, which uses Pacific Time (PT) timezone.
So we are going to do a conversion
In order to do that, we are going create a function, convert_pt(  )
This function takes list_name as an argument and does the following:

(1) Import pytz module
(2) Assign timezones
   - "US/Eastern" timezone to et_timezone
   - "America/Los_Angeles" timezone to pt_timezone
(3) Loop through each row in list_name
(4) Extract the hour from the date then convert
   - Use the datetime.strptime() method to parse the date and create a datetime object.
   - Lozalize the hour using et_timezone and assign it to et
   - Convert et using astimezone(  )
   - Use the datetime.strftime() method to select just the hour from the datetime object.
   - Assigned the converted time back to the original row

In [14]:
import pytz #(1)

def convert_pt(list_name):
    et_timezone = pytz.timezone("US/Eastern") #(2)
    pt_timezone = pytz.timezone("America/Los_Angeles")
    
    for row in list_name: #(3)
        hour = dt.datetime.strptime(row[1] ,"%I %p") #(4)
        et = et_timezone.localize(hour)
        pt = et.astimezone(pt_timezone)
        pt_string = pt.strftime("%I %p")
        row[1] = pt_string

We just created 2 functions:
(i). print_hourly_avg(  )
(ii). convert_pt(  )

Next, we will use both functions to display the top 5 hours to get the most comments in both Eastern Time (ET) and Pacific Time (PT)

In [15]:
#Calling both functions: print_hourly_avg(  ), and convert_pt(  )

print('Top 5 Hours to post on Ask Posts to get the most comments in Eastern Time (ET)')
print_hourly_avg(avg_hr_comments)

convert_pt(avg_hr_comments)

print('\n')
print('Top 5 Hours to post on Ask Posts to get the most comments in Pacific Time (PT)')
print_hourly_avg(avg_hr_comments)
Top 5 Hours to post on Ask Posts to get the most comments in Eastern Time (ET)
==============================================================================
03 PM: 38.59 per post
02 AM: 23.81 per post
08 PM: 21.52 per post
04 PM: 16.80 per post
09 PM: 16.01 per post


Top 5 Hours to post on Ask Posts to get the most comments in Pacific Time (PT)
==============================================================================
12 PM: 38.59 per post
11 PM: 23.81 per post
05 PM: 21.52 per post
01 PM: 16.80 per post
06 PM: 16.01 per post

Finally! We get the top 5 hours for posting that will give us the most comments.
However, one question comes to mind:

Do the posts created at those hour really get the most comments because of the timing of the post? Or is there any other factor? Like the author or the post title, perhaps?
Since there are columns for both author and post title in the dataset, let's check it out

3.5. check_author(  )

We want to check the top 5 authors with the most comments.
In order to do so, we are going to create a function, called check_author(  )
This function takes list_name and variable_name as arguments and does the following:
(1) Create an empty dictionary, authors
(2) Loop through each row in list_name
(3) If the author isn't a key in authors:
  - Create the key in authors and set it equal to the number of comments (num_variable)
  - Else increment the key in authors by the number of comments (num_variable)
(4) Create an empty list, authors_list
(5) Loop through authors and append the elements to the newly created authors_list
(6) Sort authors_list using sorted(  )
  -set Reverse = True to sort in descending order
(7) Returns authors_list

In [16]:
def check_author(list_name, variable_name): 
    
    authors = {} #(1)
    
    if variable_name == 'comments':
        row_variable = 4
    if variable_name == 'points':
        row_variable = 3
    
    for row in list_name: 
        author = row[5]
        num_variable = int(row[row_variable])
        
        if author not in authors: #(3)
            authors[author] = num_variable
        else:
            authors[author] += num_variable
            
    authors_list = [] #(4)
    
    for author in authors: #(5)
        authors_list.append([authors[author], author])
    
    authors_list = sorted(authors_list, reverse = True) #(6)
    return authors_list #(7)
In [17]:
#callling check_author( )

print('Top 5 Authors with the most comments')
print('====================================')

author_most_comments = check_author(ask_posts, 'comments')
author_most_comments[:5]
Top 5 Authors with the most comments
====================================
Out[17]:
[[3046, 'whoishiring'],
 [868, 'mod50ack'],
 [691, 'boren_ave11'],
 [531, 'schappim'],
 [520, 'sama']]

Now that we have the top 5 authors, we want to check the title of their posts

3.6. print_title(  )
We want to check the post titles so we create a function, print_title(  )
This function takes list_name and author_name as arguments and does the following:

(1) Loop through each row in list_name
(2) Assign 'author' and 'title' from the corresponding row index
(3) Print the title that belongs to the author we want to check

In [18]:
def print_title(list_name, author_name):
    for row in list_name: #(1)
        author = row[5]    #(2)
        title = row[1]
        
        if author == author_name: #(3)
            print(title)
In [19]:
#calling print_title( )

print("whoishiring's posts title")
print('==========================')
print_title(ask_posts, 'whoishiring')

print('\n')
print("mod50ack's posts title")
print('======================')
print_title(ask_posts, 'mod50ack')

print('\n')
print("boren_ave11's posts title")
print('=========================')
print_title(ask_posts, 'boren_ave11')

print('\n')
print("schappim's posts title")
print('======================')
print_title(ask_posts, 'schappim')

print('\n')
print("sama's posts title")
print('===================')
print_title(ask_posts, 'sama')
whoishiring's posts title
==========================
Ask HN: Who wants to be hired? (June 2016)
Ask HN: Freelancer? Seeking freelancer? (December 2015)
Ask HN: Who is hiring? (September 2016)
Ask HN: Who wants to be hired? (August 2016)
Ask HN: Freelancer? Seeking freelancer? (September 2016)
Ask HN: Who is hiring? (August 2016)
Ask HN: Who wants to be hired? (April 2016)
Ask HN: Freelancer? Seeking freelancer? (November 2015)
Ask HN: Who wants to be hired? (March 2016)


mod50ack's posts title
======================
Ask HN: What's the best tool you used to use that doesn't exist anymore?


boren_ave11's posts title
=========================
Ask HN: How much do you make at Amazon? Here is how much I make at Amazon


schappim's posts title
======================
Ask HN: What book have you given as a gift?
Ask HN: Is it feasible to port Apple's Swift to the ESP8266?


sama's posts title
===================
Ask HN: What should we fund at YC Research?

From the result above, we find that all of whoishiring's posts are related to hiring, whether it is looking for people who want to get hired, or looking for people who are hiring.

It appears that the posts gathered a lot of comments not because the posts are timed perfectly, but because the nature of the posts instead. It is natural to get a lot of comments if the posts are related to hiring, regardless of the time.

Because of the aforementioned reason, I decided to check on the time the posts are created. If the posts are created during the top 5 hour that are mentioned above, I am going to remove whoishiring's posts from the analysis, because the nature of whoishiring's posts are different from the other authors.

3.7. author_post_hr_list(  )

To get the author posting times, we create a function, author_post_hr_list(  )
This function takes list_name and author_name as arguments and does the following:

(1) Create an empty dictionary, author_dict, to create a frequency table with hour as key and the number of posts crated as its values
(2) Extract the hour from the date then convert
   - Use the datetime.strptime(  ) method to parse the date and create a datetime object.
   - Use the datetime.strftime(  ) method to select just the hour from the datetime object.
(3) If the hour isn't a key in author_dict:
   - Create the key in authors and set it equal to 1
   - Else increment the key in authors with by '1'
(4) Create an empty list, author_list and append the author_dict's values to it.
   - We created author_list so that we can convert the hour from ET to PT later on.
(5) Return author_list

In [20]:
def author_post_hr_list(list_name, author_name):
    date_format = "%m/%d/%Y %H:%M"
    
    author_dict = {}   #(1)
    
    for row in list_name:
        if (row[5] == author_name): 
            datetime_str = row[6]
            datetime_dt = dt.datetime.strptime(datetime_str, date_format)  #(2)
            hour_str = datetime_dt.strftime("%I %p") 
            
            if hour_str not in author_dict:  #(3)
                author_dict[hour_str] = 1
            else:
                author_dict[hour_str] += 1
                
    author_list = []   #(4)
           
    for hour in author_dict:
        author_list.append([author_dict[hour], hour])
        
    return author_list    #(5)
In [21]:
#calling author_post_hr_list( )

whoishiring_hr = author_post_hr_list(ask_posts, 'whoishiring')
print('whoishiring_hr')
print(whoishiring_hr)
whoishiring_hr
[[9, '03 PM']]

3.8. print_post_count(  )

Now that we get the posting time of the author (whoishiring_hr), it's time display the result in a good format.
In order to do so, we create a function, print_post_count(  )
This function take list_name as an argument and does the following:


(1) Loop through each elements in list_name

(2) Use format(  ) to print string in a formatted way

(3) Print the output

In [22]:
def print_post_count(list_name):   #(1)
    for posts, hour in list_name:
        output = '{p} post(s) created at {h}'.format(h = hour, p = posts)   #(2)
        print(output)   #(3)
In [23]:
print("whoishiring's post count in Eastern Time (ET)")  
print('=============================================') 
print_post_count(whoishiring_hr)
whoishiring's post count in Eastern Time (ET)
=============================================
9 post(s) created at 03 PM

Next, we will convert the time from ET to PT using convert_pt(  ), which we defined earlier and then print the posting time in PT using print_post_count(  )

In [24]:
convert_pt(whoishiring_hr) #calling a function that we defined earlier

print("whoishiring's post count in Pacific Time (PT)")  
print('=============================================')
print_post_count(whoishiring_hr) #calling a function that we defined earlier
whoishiring's post count in Pacific Time (PT)
=============================================
9 post(s) created at 12 PM

Based on the result above, we find that whoishiring created 9 posts at 12 PM (PT). If we look back to our findings above, 12 PM (PT) is ranked number 1 in the Top 5 Hours to post on Ask Posts to get the most comments in Pacific Time (PT).

3.9. Remove whoishiring's posts from the analysis and re-explore the num_comments column

As previously mentioned, since whoishiring's posts are created during the aforementioned top 5 hour, I am going to remove whoishiring's posts from the analysis, because the nature of whoishiring's posts are different from the other authors.
(Note: whoishiring's posts are related to hiring)

We are going to do the following:
(1) Create an empty list, posts_without_whoishiring, and append all the rows from ask_posts with whoishiring as an author to the newly created list.
(2) Use hourly_average(  )
   - Calculate the amount of posts created in ask_posts in each hour of the day, along with the number of comments received (posts per hour)
   - Calculate the average number of comments per post in ask posts in every hour. (comments per hour)
(3) Use print_hourly_avg(  ) to print the top 5 hours in ET
(4) Use convert_pt to convert time from ET to PT
(5) Use print_hourly_avg(  ) to print the top 5 hours in PT

In [25]:
posts_without_whoishiring = [] #(1)

for row in ask_posts:
    author = row[5]
    if author != 'whoishiring':
        posts_without_whoishiring.append(row)

avg_hr_comments_wo_whoishiring = hourly_average(posts_without_whoishiring, 'comments') #(2)
posts_per_hour
{'09 AM': 45, '01 PM': 85, '10 AM': 59, '02 PM': 107, '04 PM': 108, '11 PM': 68, '12 PM': 73, '05 PM': 100, '03 PM': 107, '09 PM': 109, '08 PM': 80, '02 AM': 58, '06 PM': 109, '03 AM': 54, '05 AM': 46, '07 PM': 110, '01 AM': 60, '10 PM': 71, '08 AM': 48, '04 AM': 47, '12 AM': 55, '06 AM': 44, '07 AM': 34, '11 AM': 58}


comments_per_hour
{'09 AM': 251, '01 PM': 1253, '10 AM': 793, '02 PM': 1416, '04 PM': 1814, '11 PM': 543, '12 PM': 687, '05 PM': 1146, '03 PM': 1431, '09 PM': 1745, '08 PM': 1722, '02 AM': 1381, '06 PM': 1439, '03 AM': 421, '05 AM': 464, '07 PM': 1188, '01 AM': 683, '10 PM': 479, '08 AM': 492, '04 AM': 337, '12 AM': 447, '06 AM': 397, '07 AM': 267, '11 AM': 641}


avg_by_hour (sorted in descending order)
[[23.81, '02 AM'], [21.52, '08 PM'], [16.8, '04 PM'], [16.01, '09 PM'], [14.74, '01 PM'], [13.44, '10 AM'], [13.37, '03 PM'], [13.23, '02 PM'], [13.2, '06 PM'], [11.46, '05 PM'], [11.38, '01 AM'], [11.05, '11 AM'], [10.8, '07 PM'], [10.25, '08 AM'], [10.09, '05 AM'], [9.41, '12 PM'], [9.02, '06 AM'], [8.13, '12 AM'], [7.99, '11 PM'], [7.85, '07 AM'], [7.8, '03 AM'], [7.17, '04 AM'], [6.75, '10 PM'], [5.58, '09 AM']]
In [26]:
print('avg_hr_comments_wo_whoishiring')
print(avg_hr_comments_wo_whoishiring) #printing the result from (2)
avg_hr_comments_wo_whoishiring
[[23.81, '02 AM'], [21.52, '08 PM'], [16.8, '04 PM'], [16.01, '09 PM'], [14.74, '01 PM'], [13.44, '10 AM'], [13.37, '03 PM'], [13.23, '02 PM'], [13.2, '06 PM'], [11.46, '05 PM'], [11.38, '01 AM'], [11.05, '11 AM'], [10.8, '07 PM'], [10.25, '08 AM'], [10.09, '05 AM'], [9.41, '12 PM'], [9.02, '06 AM'], [8.13, '12 AM'], [7.99, '11 PM'], [7.85, '07 AM'], [7.8, '03 AM'], [7.17, '04 AM'], [6.75, '10 PM'], [5.58, '09 AM']]
In [27]:
print('Top 5 Hours to post on Ask Posts to get the most comments in Eastern Time (ET)')
print_hourly_avg(avg_hr_comments_wo_whoishiring) #(3)
Top 5 Hours to post on Ask Posts to get the most comments in Eastern Time (ET)
==============================================================================
02 AM: 23.81 per post
08 PM: 21.52 per post
04 PM: 16.80 per post
09 PM: 16.01 per post
01 PM: 14.74 per post
In [28]:
convert_pt(avg_hr_comments_wo_whoishiring) #(4)
In [29]:
print('Top 5 Hours to post on Ask Posts to get the most comments Pacific Time (PT)')
print_hourly_avg(avg_hr_comments_wo_whoishiring) #(5)
Top 5 Hours to post on Ask Posts to get the most comments Pacific Time (PT)
==============================================================================
11 PM: 23.81 per post
05 PM: 21.52 per post
01 PM: 16.80 per post
06 PM: 16.01 per post
10 AM: 14.74 per post

At last! We managed to find that optimal to post on asks_posts to get the most comments! Based on the result above, after removing whoishiring's post from the analysis we find that the most optimal time to post --in order to get the most comments-- is at 11PM (PT) or 2AM (ET)

That was the answer to our second question: What is the most optimal time to create posts that gather the most comments?

4. Exploring num_points

Now there is one more thing, we want to find the optimal time to get the most points too! To do that, we are going to explore num_points the same way we did num_comments.

Thankfully, we have created a lot of functions when exploring num_comments, so we will have an easier time exploring now. All we need to do is to call all the functions we defined earlier

4.1. explore(&nbsp )

Using explore(  ), we will be able to find the average points in both ask_posts and show_posts

In [30]:
print('Average points on ask_posts:')
explore(ask_posts, 'points')
print('\n')

print('Average points on show_posts:')
explore(show_posts, 'points')
Average points on ask_posts:
15.06


Average points on show_posts:
27.56

Since show_posts has more points on average, this indicates that show_posts are more likely to receive points from the user. Hence, we will focus on show_posts when exploring the num_points column

This answers our third question: Which kind of posts receive the most comments?
Answer: Posts that start with Show HN (show_posts)

4.2 hourly_average(  )

Using hourly_average(  ), we will do the following:

(1) Calculate the amount of posts created in show_posts in each hour of the day (posts per hour).
(2) Calculate the average number of points per post in show_posts in every hour. (points per hour)
(3) Sort the list by the average number of comments (descending order)

In [31]:
print('avg_hr_points')
print('=============')
avg_hr_points = hourly_average(show_posts, 'points')  #(1), (2), (3)
avg_hr_points
=============
posts_per_hour
{'02 PM': 86, '10 PM': 46, '06 PM': 61, '07 AM': 26, '08 PM': 60, '05 AM': 19, '04 PM': 93, '07 PM': 55, '03 PM': 78, '03 AM': 27, '05 PM': 93, '06 AM': 16, '02 AM': 30, '01 PM': 99, '08 AM': 34, '09 PM': 47, '04 AM': 26, '11 AM': 44, '12 PM': 61, '11 PM': 36, '09 AM': 30, '01 AM': 28, '10 AM': 36, '12 AM': 31}


points_per_hour
{'02 PM': 2187, '10 PM': 1856, '06 PM': 2215, '07 AM': 494, '08 PM': 1819, '05 AM': 104, '04 PM': 2634, '07 PM': 1702, '03 PM': 2228, '03 AM': 679, '05 PM': 2521, '06 AM': 375, '02 AM': 340, '01 PM': 2438, '08 AM': 519, '09 PM': 866, '04 AM': 386, '11 AM': 1480, '12 PM': 2543, '11 PM': 1526, '09 AM': 553, '01 AM': 700, '10 AM': 681, '12 AM': 1173}


avg_by_hour (sorted in descending order)
[[42.39, '11 PM'], [41.69, '12 PM'], [40.35, '10 PM'], [37.84, '12 AM'], [36.31, '06 PM'], [33.64, '11 AM'], [30.95, '07 PM'], [30.32, '08 PM'], [28.56, '03 PM'], [28.32, '04 PM'], [27.11, '05 PM'], [25.43, '02 PM'], [25.15, '03 AM'], [25.0, '01 AM'], [24.63, '01 PM'], [23.44, '06 AM'], [19.0, '07 AM'], [18.92, '10 AM'], [18.43, '09 PM'], [18.43, '09 AM'], [15.26, '08 AM'], [14.85, '04 AM'], [11.33, '02 AM'], [5.47, '05 AM']]

4.3. print_hourly_avg(  )

Using print_hourly_avg(  ), we will display the top 5 Hours to post on Ask Posts --in order to get the most points in both Eastern Time and Pacific Time.

In [32]:
print('Top 5 Hours to post on Show Posts to get the most points in Eastern Time (ET)')
print_hourly_avg(avg_hr_points) #calling print_hourly_avg for ET
Top 5 Hours to post on Show Posts to get the most points in Eastern Time (ET)
==============================================================================
11 PM: 42.39 per post
12 PM: 41.69 per post
10 PM: 40.35 per post
12 AM: 37.84 per post
06 PM: 36.31 per post

4.4. convert_pt(  )

Using convert_pt(  ), we will convert the time from PT to ET

In [33]:
convert_pt(avg_hr_points)
In [34]:
print('Top 5 Hours to post on Show Posts to get the most points in Pacific Time (PT)')
print_hourly_avg(avg_hr_points) #calling print_hourly_avg for PT (see 4.3)
Top 5 Hours to post on Show Posts to get the most points in Pacific Time (PT)
==============================================================================
08 PM: 42.39 per post
09 AM: 41.69 per post
07 PM: 40.35 per post
09 PM: 37.84 per post
03 PM: 36.31 per post

4.5. check_author(  )

Using check_author(  ), we will display the top 5 author with the most points

In [35]:
print('Top 5 Authors with the most points')
print('====================================')

author_most_points = check_author(show_posts, 'points')
print(author_most_points[:5])
Top 5 Authors with the most points
====================================
[[825, 'petermolyneux'], [747, 'dhotson'], [681, 'damjanstankovic'], [572, 'orf'], [553, 'Capira']]

4.6. print_title(  )

Using print_title(  ), we will display the titles of the post created by the authors above

In [36]:
print("petermolyneux's posts title")
print('==========================')
print_title(show_posts, 'petermolyneux')

print('\n')
print("dhotson's posts title")
print('======================')
print_title(show_posts, 'dhotson')

print('\n')
print("damjanstankovic's posts title")
print('=============================')
print_title(show_posts, 'damjanstankovic')

print('\n')
print("orf's posts title")
print('=================')
print_title(show_posts, 'orf')

print('\n')
print("Capira's posts title")
print('======================')
print_title(show_posts, 'Capira')
petermolyneux's posts title
==========================
Show HN: New calendar app idea


dhotson's posts title
======================
Show HN: Something pointless I made


damjanstankovic's posts title
=============================
Show HN: I spent a year making an electro-mechanical prototype of a liquid clock


orf's posts title
=================
Show HN: Hacker News Simulator


Capira's posts title
======================
Show HN: What every browser knows about you

The results suggest that none of the authors posted a post that belongs to a specific category, such as hiring (which was what happened in the num_comments exploration). Therefore, I'm not going to remove any posts from the analysis --unlike what we do with whoishiring in the num_comments exploration

Thus, the answer to our third question, 'What is the most optimal time to create posts that gather the most comments?',
is at 08 PM, with 42.39 points per post (see 4.3)

5. Conclusion

In [37]:
#Conclusion data
print('Top 5 Hours to post on Ask Posts to get the most comments Pacific Time (PT)')
print_hourly_avg(avg_hr_comments_wo_whoishiring) #(see 3.9)
print('\n')
print('Top 5 Hours to post on Show Posts to get the most points in Pacific Time (PT)')
print_hourly_avg(avg_hr_points) #(see 4.3)
Top 5 Hours to post on Ask Posts to get the most comments Pacific Time (PT)
==============================================================================
11 PM: 23.81 per post
05 PM: 21.52 per post
01 PM: 16.80 per post
06 PM: 16.01 per post
10 AM: 14.74 per post


Top 5 Hours to post on Show Posts to get the most points in Pacific Time (PT)
==============================================================================
08 PM: 42.39 per post
09 AM: 41.69 per post
07 PM: 40.35 per post
09 PM: 37.84 per post
03 PM: 36.31 per post

After exploring the hackernews dataset, we found the following:

(1) Posts whose titles begin with Ask HN receive more comments
(2) To get the most comments, we need to create a post with a title that begins with Ask HN, and post it at 11 PM (PT)
(3) Posts whose titles begin with Show HN receive more points
(4) To get the most points, we need to create a post with a title that begins with Show HN, and post it at 8 PM (PT)