The Hacker News is the most trusted, widely-read information source regarding the latest news about technology, such as cyber attacks, computer security, and cybersecurity for ethical purposes. A plethora of vistors submit posts that receive many comments and votes depending on how interesting the post is.
This particu;ar dataset has about 300,000 rows and we'll filter along as we analyze the data.
We're particularly interested in titles that begin with Ask HN and Show HN.
Ask HN - User submits posts related to asking the Hacker News community a specific question.
Show HN -Users sumbit posts to show the Hacker News community a project, product, or just something interesting.
For more information, please visit Hacker News
We are here to compare Ask HN and Show HN to determine the following:
import csv
open_file = open('HN_posts_year_to_Sep_26_2016.csv',encoding="utf8")
read_file = csv.reader(open_file)
dataset = list(read_file)
h_header = dataset[0]
hnews = dataset[1:]
Here we define a function called explore_data
which takes in four arguments, dataset
, start
, end
, and rows_and_columns
.
For readibility and functionality purposes, we created the explore_data function to take a look at a few of the records that were inside the Hacker News dataset.
def explore_data(dataset, start, end, rows_and_columns = False):
data_slice = dataset[start:end]
for each_row in data_slice:
print(each_row)
print("\n")
if rows_and_columns:
print("Number of Columns: ", len(dataset[0]))
print("Number of rows: ", len(dataset))
print("Columns: ", h_header, "\n")
explore_data(hnews,0,5,True)
Columns: ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'] ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'] ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'] ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'] ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14'] Number of Columns: 7 Number of rows: 293119
MISING DATA:
print("\nRows with Missing Data Points Testing Purposes")
for each_row in hnews:
length = len(each_row)
if length < len(h_header):
print(each_row)
print('\n')
Rows with Missing Data Points Testing Purposes
Since we are only interested in titles from Ask HN and Show HN, we must forst extract these rows from the dataset so we can begin our analysis.
For Example: We will use a string method called startswith
.
Given a string object, lets say string1 = "We Love to Learn from DataQuest!".
We can use the starts with methid to give aboolean result(True or False)
to know if our string object starts with a certain choice of characters.
string1 = "We Love to Learn from DataQuest!"
print(string1.startswith("We love"))
print(string1.startswith("We Love"))
False True
Notice that this method is case sensitive.
So to work around this, we can add on the method .lower()
, which will make all the characters lowercase.
print(string1.lower())
we love to learn from dataquest!
print(string1.lower().startswith("we love"))
True
Now lets begin to use .startswith()
and .lower()
methods to filter our dataset for only Ask HN and Show HN post titles. These posts will be seperated into their own lists show_hn
which will only contain Show HN posts and ask_hn
which will only contain Ask HN posts.
Since we are not interested in other post types, we'll just add them to other_posts
list.
show_hn_posts = []
ask_hn_posts = []
other_posts = []
for each_row in hnews:
title_filter = each_row[1]
if title_filter.lower().startswith("ask hn"):
ask_hn_posts.append(each_row)
if title_filter.lower().startswith('show hn'):
show_hn_posts.append(each_row)
else:
other_posts.append(each_row)
print("Number of Ask HN Posts: \n", len(ask_hn_posts))
print("Number of Show HN Posts: \n",len(show_hn_posts))
print("Number of other Posts: \n",len(other_posts))
Number of Ask HN Posts: 9139 Number of Show HN Posts: 10158 Number of other Posts: 282961
Now let's take a quick look at the number of rows we have in the Ask Post and Show Posts. We notice there is only a little over a thousand difference in posts between Ask HN and Show HN posts. Fortunately, the data at this point is not sqered by a large margin since the number of posts is reativly close in size.
9139
10158
Below we showcase the first five rows in the Show HN posts that were added to our show_hn_posts
list.
explore_data(show_hn_posts,0,5)
['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'] ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'] ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/25/2016 23:44'] ['12577991', 'Show HN: Pomodoro-centric, heirarchical project management with ES6 modules', 'https://github.com/jakebian/zeal', '2', '0', 'dbranes', '9/25/2016 23:17'] ['12577142', 'Show HN: Jumble Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06']
Below we showcase the first five rows in the Ask HN posts that were added to our ask_hn_posts
list.
explore_data(ask_hn_posts,0,5)
['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'] ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'] ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'] ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'] ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']
num_comments
column to gather our data.num_ask_comment = []
total_ask_comments = 0
average_ask_comments = 0
for each_row in ask_hn_posts:
each_comment = int(each_row[4])
if each_comment > 0:
total_ask_comments += each_comment
num_ask_comment.append(each_comment)
average_ask_comments = total_ask_comments / len(num_ask_comment)
print("Total Number of comments: \n", total_ask_comments)
print("Average Number of comments per post: \n", round(average_ask_comments))
Total Number of comments: 94986 Average Number of comments per post: 14
num_show_comment = []
total_show_comments = 0
average_show_comments = 0
for each_row in show_hn_posts:
each_comment = int(each_row[4])
if each_comment > 0:
total_show_comments += each_comment
num_show_comment.append(each_comment)
average_show_comments = total_show_comments / len(num_show_comment)
print("Total Number of comments: \n", total_show_comments)
print("Average Number of comments per post: \n", round(average_show_comments))
Total Number of comments: 49633 Average Number of comments per post: 10
Based on our analysis, earlier in our data we determined that Show HN Posts have more Posts. As we continue to delve deeper into calculating our average number of comments per post. The results are surprising.
On Average Ask HN posts
receive 14 comments per Ask HN post versus their counterpart Show HN
receives 10 comments
per post on average.
So far we have determined the following:
Now that we know AsK HN Posts are more likely to receive comments, we will use Ask HN Posts data moving forward to continue our analysis.
Now we must determine if Ask posts created at a certain time will garner more comments. To acheive this result,we must do the following:
We'll use the datetime module
to work with the created_at
column.
For Example:
We'll take a string that has a date and extract that data to be used inside a variable using the datetime.strptime() constructor to parse dates stored as strings and return datetime objects
.
import datetime as dt
date_string = "November 15, 2012"
date_datetime = dt.datetime.strptime(date_string, "%B %d, %Y")
print(date_datetime)
2012-11-15 00:00:00
Now let's use these techniques to parse every string date in our dataset as a datetime.
We will import the datetime module
as an alias called dt
to reduce confusion when calling the datetime class
.
result_list
is a list created that will house an inner list containing the date and time in one column and the number of comments in the other.
counts_by_hour
is a dictionary that will house the number of times per hour a post is created.
comments_by_hour
is a dictionary that will house the total number of comments per hour.
created_at
variable will contain the date and time that we pull from our 'ask_hn_posts' dataset.
comments
variable will contain comments and converted to int datatypes to better analyze our data.
To reduce the irrevelant data with no comments, only posts with comments will be recorded and appended to the result_list.
ask_hn_final_post
will house all rows that do now have comments as 0.
import datetime as dt
ask_hn_final_posts = []
result_list = []
counts_by_hour = {}
comments_by_hour = {}
for each_row in ask_hn_posts:
#column created_at
created_at = each_row[6]
comments = int(each_row[4])
if comments > 0:
result_list.append([created_at,comments ])
ask_hn_final_posts.append(each_row)
Sample of the ask_hn_final_posts after removing rows with 0 comments.
print(h_header, "\n")
explore_data(ask_hn_final_posts,0,3,True)
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'] ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'] ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'] Number of Columns: 7 Number of rows: 6911
Sample of the result_list containing date, time, and number of comments
print(result_list[:3])
[['9/26/2016 2:53', 7], ['9/26/2016 1:17', 3], ['9/25/2016 22:48', 3]]
If the hour is not in counts_by_hour:
If the hour is already in counts_by_hour:
for each_date_time in result_list:
date_time = each_date_time[0]
comment = each_date_time[1]
the_date = dt.datetime.strptime(date_time, "%m/%d/%Y %H:%M")
only_hour = dt.datetime.strftime(the_date, "%H")
if only_hour not in counts_by_hour:
counts_by_hour[only_hour] = 1
comments_by_hour[only_hour] = comment
else:
counts_by_hour[only_hour] += 1
comments_by_hour[only_hour] += comment
Now we have the following data regarding the Ask HN posts:
print("Number of posts per hour: \n", counts_by_hour)
print("\n")
print("Number of comments per hour: \n",comments_by_hour)
Number of posts per hour: {'02': 227, '01': 223, '22': 287, '21': 407, '19': 420, '17': 404, '15': 467, '14': 378, '13': 326, '11': 251, '10': 219, '09': 176, '07': 157, '03': 212, '16': 415, '08': 190, '00': 231, '23': 276, '20': 392, '18': 452, '12': 274, '04': 186, '06': 176, '05': 165} Number of comments per hour: {'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '16': 4466, '08': 2362, '00': 2277, '23': 2297, '20': 4462, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}
Next we'll be using the counts_by_hour and comments_by_hour dictionaries to calculate the average number of comments for posts created during each hour of the day.
As an example before we start analyzing again:
To illustrate the technique, let's work with the following dictionary:
sample_dict = {
'apple': 2,
'banana': 4,
'orange': 6
}
Suppose we wanted to multiply each of the values by ten and return the results as a list of lists. We can use the following code:
fruits = []
for fruit in sample_dict:
fruits.append([fruit, 10*sample_dict[fruit]])
#results
print(fruits)
[['apple', 20], ['banana', 40], ['orange', 60]]
In the example above, we did the following:
Iterated over the keys of sample_dict and appended to fruits a list with the following attributes:
Let's use this format to create a list of lists containing the hours during which posts were created and the average number of comments those posts received.
As we can see printed is a list of lists in which the first element is the hour and the second element is the average number of comments per post all contained within the avg_list_hour
list.
avg_by_hour = []
for hour_comment in counts_by_hour:
avg_by_hour.append( [hour_comment, comments_by_hour[hour_comment] / counts_by_hour[hour_comment] ] )
print("Average number of Comments for Ask HN Posts by Hour: ")
for row in avg_by_hour:
print(row)
Average number of Comments for Ask HN Posts by Hour: ['02', 13.198237885462555] ['01', 9.367713004484305] ['22', 11.749128919860627] ['21', 11.056511056511056] ['19', 9.414285714285715] ['17', 13.73019801980198] ['15', 39.66809421841542] ['14', 13.153439153439153] ['13', 22.2239263803681] ['11', 11.143426294820717] ['10', 13.757990867579908] ['09', 8.392045454545455] ['07', 10.095541401273886] ['03', 10.160377358490566] ['16', 10.76144578313253] ['08', 12.43157894736842] ['00', 9.857142857142858] ['23', 8.322463768115941] ['20', 11.38265306122449] ['18', 10.789823008849558] ['12', 15.452554744525548] ['04', 12.688172043010752] ['06', 9.017045454545455] ['05', 11.139393939393939]
Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.
First step is to swap the first list element with the second list element.
By creating an empty list, we would transfer our newly created swapped list to the swap_avg_by_hour
empty list and print out the results.
swap_avg_by_hour = []
for each_row in avg_by_hour:
swap_avg_by_hour.append([each_row[1],each_row[0]] )
for row in swap_avg_by_hour:
print(row)
[13.198237885462555, '02'] [9.367713004484305, '01'] [11.749128919860627, '22'] [11.056511056511056, '21'] [9.414285714285715, '19'] [13.73019801980198, '17'] [39.66809421841542, '15'] [13.153439153439153, '14'] [22.2239263803681, '13'] [11.143426294820717, '11'] [13.757990867579908, '10'] [8.392045454545455, '09'] [10.095541401273886, '07'] [10.160377358490566, '03'] [10.76144578313253, '16'] [12.43157894736842, '08'] [9.857142857142858, '00'] [8.322463768115941, '23'] [11.38265306122449, '20'] [10.789823008849558, '18'] [15.452554744525548, '12'] [12.688172043010752, '04'] [9.017045454545455, '06'] [11.139393939393939, '05']
Next we will use the sorted() function to sort the swap_avg_by_hour in descending order.
Since the first column contains the average number of comments per post created. The sorted list will be sorted by average number of comments.
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
for entry in sorted_swap:
swap = [entry[0], entry[1]]
print(swap)
[39.66809421841542, '15'] [22.2239263803681, '13'] [15.452554744525548, '12'] [13.757990867579908, '10'] [13.73019801980198, '17'] [13.198237885462555, '02'] [13.153439153439153, '14'] [12.688172043010752, '04'] [12.43157894736842, '08'] [11.749128919860627, '22'] [11.38265306122449, '20'] [11.143426294820717, '11'] [11.139393939393939, '05'] [11.056511056511056, '21'] [10.789823008849558, '18'] [10.76144578313253, '16'] [10.160377358490566, '03'] [10.095541401273886, '07'] [9.857142857142858, '00'] [9.414285714285715, '19'] [9.367713004484305, '01'] [9.017045454545455, '06'] [8.392045454545455, '09'] [8.322463768115941, '23']
Below we have our average comments sorted into the top 5 Hours in which the most comments are posted. Surprisingly 3:00 pm has the highest average followed by 1:00 pm Eastern Time.
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
print(row)
Top 5 Hours for Ask Posts Comments [39.66809421841542, '15'] [22.2239263803681, '13'] [15.452554744525548, '12'] [13.757990867579908, '10'] [13.73019801980198, '17']
print("TOP 5 HOURS in which per post receive a number of comments\n")
count = 1
for each_date_time in sorted_swap[:5]:
time = each_date_time[1]
comment = each_date_time[0]
the_date = dt.datetime.strptime(time, "%H")
only_hour = dt.datetime.strftime(the_date, "%H")
Top_five_hours = "{}:{}: {:.2f} average comments per post.".format(only_hour,"00",comment )
print(Top_five_hours)
TOP 5 HOURS in which per post receive a number of comments 15:00: 39.67 average comments per post. 13:00: 22.22 average comments per post. 12:00: 15.45 average comments per post. 10:00: 13.76 average comments per post. 17:00: 13.73 average comments per post.
The times at which this data set was scraped were in Eastern Time. During our analysis, we wanted to know is Ask Hn or Show HN posts garneder more comments, which concluded Ask HN posts was the winner.
Now we wanted to know at what specific times did the average number of comments were given based on what time the Ask HN post created.
The largest average comments belong to:
These time could be due to when most people are having lunch break or getting off work.