Notebook

Hacker News Ask HN vs Show HN posts popularity¶

The Hacker News is the most trusted, widely-read information source regarding the latest news about technology, such as cyber attacks, computer security, and cybersecurity for ethical purposes. A plethora of vistors submit posts that receive many comments and votes depending on how interesting the post is.

This particu;ar dataset has about 300,000 rows and we'll filter along as we analyze the data.

We're particularly interested in titles that begin with Ask HN and Show HN.

Ask HN - User submits posts related to asking the Hacker News community a specific question.

Show HN -Users sumbit posts to show the Hacker News community a project, product, or just something interesting.

For more information, please visit Hacker News

Question¶

We are here to compare Ask HN and Show HN to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a ccertain time receive more comments on average?

Opening and exploring the Data¶

In [1]:

import csv
open_file = open('HN_posts_year_to_Sep_26_2016.csv',encoding="utf8")
read_file = csv.reader(open_file)
dataset = list(read_file)
h_header = dataset[0]
hnews = dataset[1:]

Here we define a function called explore_data which takes in four arguments, dataset, start, end, and rows_and_columns.

For readibility and functionality purposes, we created the explore_data function to take a look at a few of the records that were inside the Hacker News dataset.

In [2]:

def explore_data(dataset, start, end, rows_and_columns = False):
    data_slice = dataset[start:end]
    for each_row in data_slice:
        print(each_row)
        print("\n")
        
    if rows_and_columns:
        print("Number of Columns: ", len(dataset[0]))
        print("Number of rows: ", len(dataset))
        
print("Columns: ", h_header, "\n")    
explore_data(hnews,0,5,True)   

Columns:  ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']


Number of Columns:  7
Number of rows:  293119

Testing purposes¶

MISING DATA:

The total length of data in every row is 7.
If any row has a length of 7 or less, means there are missing data points.
If any data are found,print out the row of data to check and than proceed to delete that row of data if its missing values.

In [3]:

print("\nRows with Missing Data Points Testing Purposes")
for each_row in hnews:
    length = len(each_row)
    if length < len(h_header):
        print(each_row)
        print('\n')

Rows with Missing Data Points Testing Purposes

Extracting Ask HN and Show HN Posts¶

Since we are only interested in titles from Ask HN and Show HN, we must forst extract these rows from the dataset so we can begin our analysis.

For Example: We will use a string method called startswith. Given a string object, lets say string1 = "We Love to Learn from DataQuest!". We can use the starts with methid to give aboolean result(True or False) to know if our string object starts with a certain choice of characters.

In [4]:

string1 = "We Love to Learn from DataQuest!"
print(string1.startswith("We love"))
print(string1.startswith("We Love"))

False
True

Notice that this method is case sensitive. So to work around this, we can add on the method .lower(), which will make all the characters lowercase.

In [5]:

print(string1.lower())

we love to learn from dataquest!

In [6]:

print(string1.lower().startswith("we love"))

True

Now lets begin to use .startswith() and .lower() methods to filter our dataset for only Ask HN and Show HN post titles. These posts will be seperated into their own lists show_hn which will only contain Show HN posts and ask_hn which will only contain Ask HN posts. Since we are not interested in other post types, we'll just add them to other_posts list.

In [7]:

show_hn_posts = []
ask_hn_posts = []
other_posts = []

for each_row in hnews:
    title_filter = each_row[1]
    
    if title_filter.lower().startswith("ask hn"):
        ask_hn_posts.append(each_row)
    if title_filter.lower().startswith('show hn'):
        show_hn_posts.append(each_row)
    else:
        other_posts.append(each_row)
        
print("Number of Ask HN Posts: \n", len(ask_hn_posts))  
print("Number of Show HN Posts: \n",len(show_hn_posts))
print("Number of other Posts: \n",len(other_posts))

Number of Ask HN Posts: 
 9139
Number of Show HN Posts: 
 10158
Number of other Posts: 
 282961

Now let's take a quick look at the number of rows we have in the Ask Post and Show Posts. We notice there is only a little over a thousand difference in posts between Ask HN and Show HN posts. Fortunately, the data at this point is not sqered by a large margin since the number of posts is reativly close in size.

Number of Ask HN Posts:

9139

Number of Show HN Posts:

10158

Below we showcase the first five rows in the Show HN posts that were added to our show_hn_posts list.

In [8]:

explore_data(show_hn_posts,0,5)

['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36']


['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01']


['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/25/2016 23:44']


['12577991', 'Show HN: Pomodoro-centric, heirarchical project management with ES6 modules', 'https://github.com/jakebian/zeal', '2', '0', 'dbranes', '9/25/2016 23:17']


['12577142', 'Show HN: Jumble  Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06']

Below we showcase the first five rows in the Ask HN posts that were added to our ask_hn_posts list.

In [9]:

explore_data(ask_hn_posts,0,5)

['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']


['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']


['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57']


['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48']


['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']

Calculating the Average Number of Comments for Ask HN and Show HN Posts¶

Now we will take a look at how many comments does Ask HN and Show HN lists generates.
We'll be taking a look at the num_comments column to gather our data.

Ask HN Comments¶

For about each Ask post with comments garner about an average of 14 comments.

In [10]:

num_ask_comment = []
total_ask_comments = 0
average_ask_comments = 0

for each_row in ask_hn_posts:
    each_comment = int(each_row[4])
    if each_comment > 0:
        total_ask_comments += each_comment
        num_ask_comment.append(each_comment)
average_ask_comments =   total_ask_comments / len(num_ask_comment)

print("Total Number of comments: \n", total_ask_comments)
print("Average Number of comments per post: \n", round(average_ask_comments))

Total Number of comments: 
 94986
Average Number of comments per post: 
 14

Show HN Comments¶

For about each Show post with comments garner about an average of 10 comments.

In [11]:

num_show_comment = []
total_show_comments = 0
average_show_comments = 0

for each_row in show_hn_posts:
    each_comment = int(each_row[4])
    if each_comment > 0:
        total_show_comments += each_comment
        num_show_comment.append(each_comment)
average_show_comments =   total_show_comments / len(num_show_comment)

print("Total Number of comments: \n", total_show_comments)
print("Average Number of comments per post: \n", round(average_show_comments))

Total Number of comments: 
 49633
Average Number of comments per post: 
 10

Ask vs Show: Which HN Posts receive more Comments?¶

Winner: Ask HN Posts¶

Based on our analysis, earlier in our data we determined that Show HN Posts have more Posts. As we continue to delve deeper into calculating our average number of comments per post. The results are surprising.

On Average Ask HN posts receive 14 comments per Ask HN post versus their counterpart Show HN receives 10 comments per post on average.

Number of Ask Posts and Comments by Hour Created¶

So far we have determined the following:

Ask HN posts garner more comments with an average of 14.
Ask receives more comments than Show HN posts.

Now that we know AsK HN Posts are more likely to receive comments, we will use Ask HN Posts data moving forward to continue our analysis.

Now we must determine if Ask posts created at a certain time will garner more comments. To acheive this result,we must do the following:

Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts receive by hour created.

Part One: Number of ask posts created per hour & number of comments received¶

We'll use the datetime module to work with the created_at column.

For Example: We'll take a string that has a date and extract that data to be used inside a variable using the datetime.strptime() constructor to parse dates stored as strings and return datetime objects.

In [12]:

import datetime as dt
date_string = "November 15, 2012"
date_datetime = dt.datetime.strptime(date_string, "%B %d, %Y")
print(date_datetime)

2012-11-15 00:00:00

Now let's use these techniques to parse every string date in our dataset as a datetime.

Preparation¶

We will import the datetime module as an alias called dt to reduce confusion when calling the datetime class.

result_list is a list created that will house an inner list containing the date and time in one column and the number of comments in the other.

counts_by_hour is a dictionary that will house the number of times per hour a post is created.

comments_by_houris a dictionary that will house the total number of comments per hour.

created_at variable will contain the date and time that we pull from our 'ask_hn_posts' dataset.

comments variable will contain comments and converted to int datatypes to better analyze our data.

To reduce the irrevelant data with no comments, only posts with comments will be recorded and appended to the result_list.

ask_hn_final_post will house all rows that do now have comments as 0.

In [13]:

import datetime as dt

ask_hn_final_posts = []
result_list = []
counts_by_hour = {}
comments_by_hour = {}

for each_row in ask_hn_posts:
    #column created_at
    created_at = each_row[6]
    comments = int(each_row[4])
    if comments > 0:
        result_list.append([created_at,comments ])
        ask_hn_final_posts.append(each_row)
        

Sample of the ask_hn_final_posts after removing rows with 0 comments.

In [14]:

print(h_header, "\n")    
explore_data(ask_hn_final_posts,0,3,True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']


['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']


['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48']


Number of Columns:  7
Number of rows:  6911

Sample of the result_list containing date, time, and number of comments

In [15]:

print(result_list[:3])

[['9/26/2016 2:53', 7], ['9/26/2016 1:17', 3], ['9/25/2016 22:48', 3]]

Extracting the Date, Hour, and Number of comments¶

Use the datetime.strptime() method to parse the date and create a datetime object.
Use the datetime.strftime() method to select just the hour from the datetime object.

If the hour is not in counts_by_hour:

Create the key in counts_by_hour, and set it equal to 1.
Create the key in comments_by_hour, and set it equal to the comment number.

If the hour is already in counts_by_hour:

Increment the value in counts_by_hour by 1.
Increment the value in comments_by_hour by the comment number.

In [16]:

for each_date_time in result_list:
    date_time = each_date_time[0]
    comment = each_date_time[1]
    
    the_date = dt.datetime.strptime(date_time, "%m/%d/%Y %H:%M")
    only_hour = dt.datetime.strftime(the_date, "%H")
    
    if only_hour not in counts_by_hour:
        counts_by_hour[only_hour] = 1
        comments_by_hour[only_hour] = comment
    else:
        counts_by_hour[only_hour] += 1
        comments_by_hour[only_hour] += comment 

Results¶

Now we have the following data regarding the Ask HN posts:

Number of posts per hour:
Number of comments per hour:

In [17]:

print("Number of posts per hour: \n", counts_by_hour)
print("\n")
print("Number of comments per hour: \n",comments_by_hour)

Number of posts per hour: 
 {'02': 227, '01': 223, '22': 287, '21': 407, '19': 420, '17': 404, '15': 467, '14': 378, '13': 326, '11': 251, '10': 219, '09': 176, '07': 157, '03': 212, '16': 415, '08': 190, '00': 231, '23': 276, '20': 392, '18': 452, '12': 274, '04': 186, '06': 176, '05': 165}


Number of comments per hour: 
 {'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '16': 4466, '08': 2362, '00': 2277, '23': 2297, '20': 4462, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}

Calculating the Average Number of Comments for Ask HN Posts by Hour¶

Next we'll be using the counts_by_hour and comments_by_hour dictionaries to calculate the average number of comments for posts created during each hour of the day.

As an example before we start analyzing again:

To illustrate the technique, let's work with the following dictionary:

In [18]:

sample_dict = {
                'apple': 2, 
                'banana': 4, 
                'orange': 6
               }

Suppose we wanted to multiply each of the values by ten and return the results as a list of lists. We can use the following code:

In [19]:

fruits = []

for fruit in sample_dict:
    fruits.append([fruit, 10*sample_dict[fruit]])
    
#results
print(fruits)

[['apple', 20], ['banana', 40], ['orange', 60]]

In the example above, we did the following:

Initialized an empty list (of lists) and assigned it to fruits

Iterated over the keys of sample_dict and appended to fruits a list with the following attributes:

The first element is the key from sample_dict.
The second element is the value corresponding to that key multiplied by ten.

Let's use this format to create a list of lists containing the hours during which posts were created and the average number of comments those posts received.

As we can see printed is a list of lists in which the first element is the hour and the second element is the average number of comments per post all contained within the avg_list_hour list.

We iterate through counts_by_hour and comments_by_hour.
per hour, we get the average of comments per post by dividing the number of comments by the number of posts created.

In [20]:

avg_by_hour = []

for hour_comment in counts_by_hour:
    avg_by_hour.append( [hour_comment, comments_by_hour[hour_comment] / counts_by_hour[hour_comment] ] )

print("Average number of Comments for Ask HN Posts by Hour: ")
for row in avg_by_hour:
    print(row)    

Average number of Comments for Ask HN Posts by Hour: 
['02', 13.198237885462555]
['01', 9.367713004484305]
['22', 11.749128919860627]
['21', 11.056511056511056]
['19', 9.414285714285715]
['17', 13.73019801980198]
['15', 39.66809421841542]
['14', 13.153439153439153]
['13', 22.2239263803681]
['11', 11.143426294820717]
['10', 13.757990867579908]
['09', 8.392045454545455]
['07', 10.095541401273886]
['03', 10.160377358490566]
['16', 10.76144578313253]
['08', 12.43157894736842]
['00', 9.857142857142858]
['23', 8.322463768115941]
['20', 11.38265306122449]
['18', 10.789823008849558]
['12', 15.452554744525548]
['04', 12.688172043010752]
['06', 9.017045454545455]
['05', 11.139393939393939]

Sorting and printing values from a list of lists¶

Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

First step is to swap the first list element with the second list element.

By creating an empty list, we would transfer our newly created swapped list to the swap_avg_by_hour empty list and print out the results.

In [21]:

swap_avg_by_hour = []
for each_row in avg_by_hour:
    swap_avg_by_hour.append([each_row[1],each_row[0]] )
for row in swap_avg_by_hour:   
    print(row)    

[13.198237885462555, '02']
[9.367713004484305, '01']
[11.749128919860627, '22']
[11.056511056511056, '21']
[9.414285714285715, '19']
[13.73019801980198, '17']
[39.66809421841542, '15']
[13.153439153439153, '14']
[22.2239263803681, '13']
[11.143426294820717, '11']
[13.757990867579908, '10']
[8.392045454545455, '09']
[10.095541401273886, '07']
[10.160377358490566, '03']
[10.76144578313253, '16']
[12.43157894736842, '08']
[9.857142857142858, '00']
[8.322463768115941, '23']
[11.38265306122449, '20']
[10.789823008849558, '18']
[15.452554744525548, '12']
[12.688172043010752, '04']
[9.017045454545455, '06']
[11.139393939393939, '05']

Next we will use the sorted() function to sort the swap_avg_by_hour in descending order.

Since the first column contains the average number of comments per post created. The sorted list will be sorted by average number of comments.

In [22]:

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
for entry in sorted_swap:
    swap = [entry[0], entry[1]]  
    print(swap)

[39.66809421841542, '15']
[22.2239263803681, '13']
[15.452554744525548, '12']
[13.757990867579908, '10']
[13.73019801980198, '17']
[13.198237885462555, '02']
[13.153439153439153, '14']
[12.688172043010752, '04']
[12.43157894736842, '08']
[11.749128919860627, '22']
[11.38265306122449, '20']
[11.143426294820717, '11']
[11.139393939393939, '05']
[11.056511056511056, '21']
[10.789823008849558, '18']
[10.76144578313253, '16']
[10.160377358490566, '03']
[10.095541401273886, '07']
[9.857142857142858, '00']
[9.414285714285715, '19']
[9.367713004484305, '01']
[9.017045454545455, '06']
[8.392045454545455, '09']
[8.322463768115941, '23']

Below we have our average comments sorted into the top 5 Hours in which the most comments are posted. Surprisingly 3:00 pm has the highest average followed by 1:00 pm Eastern Time.

In [23]:

print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
    print(row)

Top 5 Hours for Ask Posts Comments
[39.66809421841542, '15']
[22.2239263803681, '13']
[15.452554744525548, '12']
[13.757990867579908, '10']
[13.73019801980198, '17']

In [24]:

print("TOP 5 HOURS in which per post receive a number of comments\n")
count = 1

for each_date_time in sorted_swap[:5]:   
    time = each_date_time[1]
    comment = each_date_time[0]
    
    the_date = dt.datetime.strptime(time, "%H")
    only_hour = dt.datetime.strftime(the_date, "%H")

    Top_five_hours = "{}:{}: {:.2f} average comments per post.".format(only_hour,"00",comment )
    print(Top_five_hours)
    

TOP 5 HOURS in which per post receive a number of comments

15:00: 39.67 average comments per post.
13:00: 22.22 average comments per post.
12:00: 15.45 average comments per post.
10:00: 13.76 average comments per post.
17:00: 13.73 average comments per post.

Conclusion¶

The times at which this data set was scraped were in Eastern Time. During our analysis, we wanted to know is Ask Hn or Show HN posts garneder more comments, which concluded Ask HN posts was the winner.

Now we wanted to know at what specific times did the average number of comments were given based on what time the Ask HN post created.

The largest average comments belong to:

15:00 (3:00 pm)
13:00 (1:00 pm)

These time could be due to when most people are having lunch break or getting off work.

In [ ]: