"Hey, I am going to become a Hacker!"
"You going to become what?", waking up from my slumber, I asked with a tinge of irritation.
"A hacker and I need your help with that.", he replied enthusiastically.
"You called me to say this on a Sunday morning?! Is that one of your new impulsive half-hearted projects?! But whatever it might be, I won't be much of a help. I'm no coding expert to guide you on this journey of yours. So bye, let me go back to sleep. "
"Oh wait, wait. You can help me. Didn't you tell me last week that you have been studying data analysis. All you have to do is help me to find out how to I can create a popular post on Hacker News. You know its a social news website where people upvote good stuff related to hacking and other things that generates curiosity. If you can help me find a topic that I can submit and come on top of their first page, I think I'm as good as a hacker. This is my shortcut to become a hacker. Now tell me that you will help me. Please."
The idea of using data analysis to find the popular post topic kind of excited me. At least it will be a good practice on my data quest!
"Okay, so you are not going to be a hacker. You just want some eyeballs on you by posting a popular story on that website. I think I can help you with that. But before giving you any hope, let me see if I can find any dataset for this purpose. Without data, there is no data analysis. So hang in there while I look for the data. I'll call you later."
"Okay man, thank you for this. I owe you!"
"Okay, 'Hacker'. Bye."
My Sunday morning sleep-in has quickly made way to some googlin instead. After a quick search I found a data set on Kaggle.
'Such a lucky bugger, he is'. I thought. Since I found the data set, the project is on.
The Description of data set says
This data set is Hacker News posts from the last 12 months (up to September 26 2016).
It includes the following columns:
title: title of the post (self explanatory)
url: the url of the item being linked to
num_points: the number of upvotes the post received
num_comments: the number of comments the post received
author: the name of the account that made the post
created_at: the date and time the post was made (the time zone is Eastern Time in the US)
Let me open the data set now.
from csv import reader # importing CSV module to read the .csv file
opened_file=open(r"\Users\Surface GO\Downloads\HN_posts_year_to_Sep_26_2016.csv", encoding='UTF-8')
read_file=reader(opened_file)
hacker_news_full=list(read_file) # The whole data set in List form
hn_header=hacker_news_full[0] # Getting the header of the data set
hn_full=hacker_news_full[1:] # The whole data set without Header
print("The number of rows in data set: ", len(hn_full) )
hn_full[:2]
The number of rows in data set: 293119
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']]
So there are 293199 rows in the current data set. I need to remove the unwanted rows. But how do I find the unwanted rows? Well, I think I need a discussion now. . .
"Hey lucky boy. I found the data set for you. Now you need to help me with something. So you want to post popular posts, right? What if I remove posts that have less than 10 comments from this data. Do you think that makes sense?", I asked
"Man, you know it better. So do what you feel like doing. But I do think that you can remove posts with less than 10 comments. I am looking at at least 3 digit numbers, you know!", I could sense that he was being cheeky, but then he continued.
"Maybe you can also look at the point. They usually show the points along with the comments. So can you apply the same logic with points too?", he asked.
"Okay, let me try first. So I am going to create a new list that contains titles that have more than 10 comment and more than 10 points. But let me first figure out the index and column names."
def indexer(dataset_header): # Input the dataset header to retrieve its index
index=0
print("Index") #### is there a way to print the variable name used? (eg. index of appstore)
for column_name in dataset_header: #finding each column name from the header row
print(index, ' : ', column_name)
index+=1
print('\n')
indexer(hn_header)
Index 0 : id 1 : title 2 : url 3 : num_points 4 : num_comments 5 : author 6 : created_at
"Okay, so there is 'num-points' at index 3 and 'num_comments' at index 4. Now I am going to find out how many rows of these data set satify the conditions of having more than 10 comments and more than 10 points."
hn=[] # Creating an empty list to store new data set
for row in hn_full: # Iterating through each rows of the whole data set
comments=int(row[4])
points=int(row[3])
if comments >10 and points>10: # Applying the condition to create a new list
hn.append(row)
print("Length of new data set is ", len(hn))
Length of new data set is 25153
" Hey dude, hear this! The data set reduced from 293k to 25k! Can you believe the number of posts that didn't have at least 10 comments and points?! Are you sure you can do better than this?!" I shared my first finding with my friend.
"Oh wow! Now that is drastic! Well I think I can do better because I know if you 'Ask' or 'Show' something, the chances are that you get better responses. But I don't know which one is better."
"What do you mean?", I asked him.
"Well, if you go to Hacker News Website, you can see that there are two categories, 'Ask' and 'Show' and then there is all other posts. If you are posting on 'Ask' category, it will be given as "Ask HN: Topic name" and if it is on 'Show' category, it will be given as "Show HN: Topic name". Other categories are simply posted with their titles. So if you can find which of these categories are doing better, I am sure I can find some posts to share under that. Did you get it now?"
"Okay, thank you for that information. Now that makes it easier to categorise them. So let me go ahead and make another sets of list for these categories. It will make the analysis simpler, I believe. "
The startswith() function will be used to find if the string starts with given slice of string. Here in this case, if it starts with 'ask hn' or 'show hn'. To standardise the string, it will be converted into smaller case by using lower function.
ask_posts = [] # Creating an empty list to store ask_post rows
show_posts = [] # Creating an empty list to store show_post rows
other_posts = [] # Creating an empty list to store other_post rows
for row in hn: # Iterating through the new reduced hn data set
title=row[1]
title_lower=title.lower() # Finding the lower case version of title
if title_lower.startswith("ask hn"): # Checking if the title starts with the phrase 'ask hn'
ask_posts.append(row) # if it starts with 'ask hn' gets added to the ask_post list
elif title_lower.startswith('show hn'): # checking for titles starting with 'show hn'
show_posts.append(row) # adding titles with 'show hn' to show_posts list
else:
other_posts.append(row) # adding all other posts to other_post list
print("No. of Ask posts", len(ask_posts)) # printing the number of ask posts
print("No. of Show posts", len(show_posts)) # printing the number of show posts
print("No. of Other posts", len(other_posts)) # printing the number of other posts
print("Total length of Hacker News", len(hn)) # checking the lenght of reduced hn list
print("Total",len(ask_posts)+len(show_posts)+len(other_posts)) # comparing with the total to confirm
No. of Ask posts 1091 No. of Show posts 945 No. of Other posts 23117 Total length of Hacker News 25153 Total 25153
" Hey look, I have the initial numbers from the analysis. Looks like there are plenty of posts on other category. The reason for which is probably very obvious. But out of ask and show, 'Ask' category has more posts. i think it is a good idea to be specific and post in one of these categories rather than drowning in a big pile of all other post. So lets look more into Ask and Show category. But as of now it doesn't say which category attracts more comments and interactions. So I will check for the average comments and point on these two categories to see which one is better."
# Finding Average Comments and Points in ASK Category
total_ask_comments=0 # Setting total comments to 0
total_ask_points=0 # Setting total points to 0
for row in ask_posts: # Iterating through each row in Ask_post list
comment = int(row[4]) # assigning integer value of row[4] to comment
point = int(row[3]) # assigning integer value of row[3] to point
total_ask_comments += comment # adding the total comments in each iteration
total_ask_points += point # adding the total points in each iteration
avg_ask_comments = total_ask_comments/len(ask_posts) # finding average number of comments
avg_ask_points = total_ask_points/len(ask_posts) # finding average number of points
print("Total Comments for Ask Posts: ", total_ask_comments)
print("Average comment count for Ask posts: ", avg_ask_comments)
print(" Total points for Ask posts: ", total_ask_points )
print("Average points for Ask posts: ", avg_ask_points)
Total Comments for Ask Posts: 70559 Average comment count for Ask posts: 64.6736938588451 Total points for Ask posts: 75262 Average points for Ask posts: 68.98441796516957
# Finding Average Comments and Points in SHOW Category
total_show_comments=0 # Setting total comments to 0
total_show_points=0 # Setting total points to 0
for row in show_posts: # Iterating through each row in Show_post list
comment = int(row[4]) # assigning integer value of row[4] to comment
point = int(row[3]) # assigning integer value of row[3] to point
total_show_comments += comment # adding the total comments in each iteration
total_show_points += point # adding the total points in each iteration
avg_show_comments = total_show_comments/len(show_posts) # finding average number of comments
avg_show_points = total_show_points/len(show_posts) # finding average number of points
print("Total Comments for Show Posts: ", total_show_comments)
print("Average comment count for Show posts: ", avg_show_comments)
print(" Total points for Show posts: ", total_show_points )
print("Average points for Show posts: ", avg_show_points)
Total Comments for Show Posts: 38278 Average comment count for Show posts: 40.505820105820106 Total points for Show posts: 101455 Average points for Show posts: 107.35978835978835
" So here is the results after checking the averages. Average comment count for Ask posts is 64.67 and Average comment count for Show posts: 40.50. But interestingly Average points for Ask posts is 68.98 and Average points for Show posts: 107.35. Comments are better for Ask posts while points are better for the show posts. So I went to Hacker News website to read more about it. Looks like getting a number of comments are straight forward while point system follows an algorithm. Since you are looking at getting quick way to get recognition, I think we should focus on the average of comments rather than the points. Do you agree?"
"Yes, I think so too. Let me post some content that gets most interaction. If that gets me a better point well and good. But let us look at things that we have control over. So lets stick to comments or average of it. "
"Okay, in that case, let me analyse a bit deeper in the ask post list. There are author name and created time also given in the data. Maybe I can get something out of it too. "
"Yes please. Can you check what time is the best time to post? For example if there are more users online at a specific time, there will be more upvotes in that time. So if we can find out the best time to post, that will be great. "
"Sure, let me play around with the created_time data field and hopefully I will give you something interesting."
The datetime module is used to work with the date and time data provided. What I am going to do is to create a seperate list for
# Creating a dictionary with Hours as Key and Number of Comments as value.
import datetime as dt # importing datetime module to work with date & time given as strings
hourly_comment={} # Empty dictionary to store number of comments made in each hours
hourly_post= {} # Empty dictionary to store number of posts created in each hours
for row in ask_posts: # Iterating through each row of ask post list
time_string = row[6] # assigning row[6], "created at" to time given as string
comments_number = int(row[4]) # Assigning the integer value of number of comments to a variable
# Converting time given in String to datetime object. Sample time is given as 9/26/2016 3:24
# This is in the format of month/date/4 digit Year Hour: Minute --> %m/%d/%Y %H:%M
converted_time = dt.datetime.strptime(time_string,"%m/%d/%Y %H:%M")
hour_posted = converted_time.strftime("%H") # Assigning datetime object in Hour format; eg '13' for 1PM
# Creating a frequency table using Dictionary
if hour_posted in hourly_comment: # if hour_posted is already presnent in the dictionary
hourly_comment[hour_posted] += comments_number # Add up the comments as value of 'hour' key
hourly_post[hour_posted] += 1 # Add 1 to number of posts made in that hour
else: # if it is not present in the dictionary
hourly_comment[hour_posted] = 1 # Assign the value as 1 corresponding to the 'hour' key
hourly_post[hour_posted] = 1 # Assign the value as 1 corresponding to the hour key
print("Here is the Number of posts in each hour \n \n", hourly_post)
print("\n Here is the Number of comments in each hour \n \n", hourly_comment)
Here is the Number of posts in each hour {'19': 62, '15': 103, '09': 18, '20': 52, '17': 58, '14': 66, '11': 33, '23': 39, '13': 64, '02': 37, '21': 51, '16': 62, '07': 26, '06': 31, '00': 29, '03': 38, '04': 25, '22': 50, '10': 44, '12': 55, '18': 69, '08': 29, '01': 32, '05': 18} Here is the Number of comments in each hour {'19': 2513, '15': 17124, '09': 832, '20': 2530, '17': 3968, '14': 3639, '11': 1630, '23': 1261, '13': 5980, '02': 2022, '21': 2997, '16': 3001, '07': 1037, '06': 949, '00': 1372, '03': 1403, '04': 1611, '22': 2336, '10': 2213, '12': 3234, '18': 3222, '08': 1639, '01': 1232, '05': 1139}
Now that I have two dictionaries with number of comments and number of posts against each hours, I can straight away find the average number of comments posted in each hours.
A list will be created to list down the hour and average number of post in each hour. For the ease of sorting we will keep the first column as the average number of posts per hour.
avg_by_hour = [] # Creating an empty list to add avg comments per post
for key in hourly_comment: # Iterating through each key of the dictionary
total_comments = hourly_comment[key] # Assigning comments using dictionary[key]
number_of_posts = hourly_post[key] # Assigning number of posts using dictionary[key]
avg_comment_per_post = total_comments/number_of_posts # finding the average
avg_by_hour.append([avg_comment_per_post, key]) # Appending the average to the list
print("The list of Average comments received per post in each hour is below \n \n", avg_by_hour)
The list of Average comments received per post in each hour is below [[40.53225806451613, '19'], [166.25242718446603, '15'], [46.22222222222222, '09'], [48.65384615384615, '20'], [68.41379310344827, '17'], [55.13636363636363, '14'], [49.39393939393939, '11'], [32.333333333333336, '23'], [93.4375, '13'], [54.648648648648646, '02'], [58.76470588235294, '21'], [48.403225806451616, '16'], [39.88461538461539, '07'], [30.612903225806452, '06'], [47.310344827586206, '00'], [36.921052631578945, '03'], [64.44, '04'], [46.72, '22'], [50.29545454545455, '10'], [58.8, '12'], [46.69565217391305, '18'], [56.51724137931034, '08'], [38.5, '01'], [63.27777777777778, '05']]
Now that I have a list of average comments per post per each hour, I can sort it to see at what hour most number of comments were made per post. To sort the list, now I am going to use sorted() function in descending order.
sorted_avg = sorted(avg_by_hour, reverse=True) # Using sorting function to sort in descending order
print ("The sorted averages per post per hour is here \n \n")
sorted_avg
The sorted averages per post per hour is here
[[166.25242718446603, '15'], [93.4375, '13'], [68.41379310344827, '17'], [64.44, '04'], [63.27777777777778, '05'], [58.8, '12'], [58.76470588235294, '21'], [56.51724137931034, '08'], [55.13636363636363, '14'], [54.648648648648646, '02'], [50.29545454545455, '10'], [49.39393939393939, '11'], [48.65384615384615, '20'], [48.403225806451616, '16'], [47.310344827586206, '00'], [46.72, '22'], [46.69565217391305, '18'], [46.22222222222222, '09'], [40.53225806451613, '19'], [39.88461538461539, '07'], [38.5, '01'], [36.921052631578945, '03'], [32.333333333333336, '23'], [30.612903225806452, '06']]
From this list I can say that 3PM, 1PM, 5PM are the top 3 hours to post to generate most number of comments according to this data. Now I have to tell my friend about this. But let me write some code to generate a report kind of thing to send to my friend and I can pretend like I am a programmer in front of him(at least!) So I am going to use some formatting techinques to impress him.
for row in sorted_avg[:5]: # iterating through the first five averages
time = dt.datetime.strptime(row[1],"%H").strftime("%I %p") # converting time string to datetime object
# %I shows Hour in 12 hour format, %p shows AM or PM
avg = row[0] # Assigning avg from list
# Printing using .format method. Argument this_time is stored at time and this_avg stored at avg
# this_avg is formatted to have 2 decimal points
print(" If you post at {this_time}, you have a chance of getting {this_avg:,.2f} \n"
.format(this_time=time,this_avg=avg))
# Printing our conclusion
print("So the best time to post to get good traction is {} EST".format(
dt.datetime.strptime(sorted_avg[0][1],"%H").strftime("%I %p")))
If you post at 03 PM, you have a chance of getting 166.25 If you post at 01 PM, you have a chance of getting 93.44 If you post at 05 PM, you have a chance of getting 68.41 If you post at 04 AM, you have a chance of getting 64.44 If you post at 05 AM, you have a chance of getting 63.28 So the best time to post to get good traction is 03 PM EST
" Hey bro, I think I have come to conclusion regarding what time to post. From the data we analysed, which is the data collected during 2016, we collected a subset of the data which has more than 10 comments and 10 points.
From that data set I can tell you that posting under Ask HN category can create a better engangement which leads to more number of comments.
But if you want more comments instantly, I have listed down some better time to post. 3PM, 1PM, 5 M, 4AM and 5AM are those times. Out of this 3PM is the best time according to our analysis with 166.25 as the average comments per post per hour.
So in short if you want to become popular on Hacker News, you need to find posts that you can submit on Ask HN Category and post them at 3PM EST and cross your fingers. If it works out, show some gratitude!"
"Wow, that is a great news. Thanks a lot brother. Let me try that trick in the next possible opportunity. And yes I will definitely show my gratitude soon."
"Okay, that is great. Anyway I will be working on this data a bit more and try to find if there is any connection with the authors, if someone is doing better than others, if so how and so on. So hopefully you will hear from me soon again. Till then bye and thank you for motivating me to do such a project. This was fun."
In the mean time I got really interested in it and started digging deep to find some kind of correlation between the authors and number of comments. So I created a dictionary to find the comment distribution among the authors.
# Creating a Author - Comment distribution
authors={} # Creating an empty dictionary to store the values
for row in ask_posts: # iterating through ask_posts
name=row[5] # assigning the name of the authors
comment=int(row[4]) # assigning the number of comments
if name in authors: # Checking if authors name is present in the dictionary
authors[name] += comment # if it is present add the number of comments received for that title
else: # if not
authors[name] = comment # assign the first number of comments associated with that name
author_list=[] # Creating a list with author names and comments
for name in authors: # Iterating through author dictionary
author_list.append([authors[name], name]) # append the empty list with author and no. of comments
sorted_list=sorted(author_list, reverse=True) # Sort the list in descending order
print(sorted_list[:10]) # Print the first 10 authors with highest no of comments
[[12892, 'whoishiring'], [868, 'mod50ack'], [767, 'throw94'], [718, 'barefootcoder'], [691, 'boren_ave11'], [650, 'gtirloni'], [648, 'sebg'], [602, 'sama'], [581, 'milfseriously'], [571, 'dang']]
Now that we have a list of authors with highest number of comments, let me go through each author and find at least 10 titles of their posts and corresponding comments they received. First 5 author's will be analysed to see if anything can be deduced from it.
# Printing the details of the author with highest comments
count=0
for row in ask_posts: # Iterating through each row of ask_posts
if (row[5]=='whoishiring' and count<11): # finding the author and limiting it to 10 entries.
count+=1
print( "Title :",row[1], "\n", "No. of comments: ", row[4]) #Printing Title : No of comments
Title : Ask HN: Who wants to be hired? (September 2016) No. of comments: 166 Title : Ask HN: Freelancer? Seeking freelancer? (September 2016) No. of comments: 85 Title : Ask HN: Who is hiring? (September 2016) No. of comments: 910 Title : Ask HN: Who wants to be hired? (August 2016) No. of comments: 118 Title : Ask HN: Freelancer? Seeking freelancer? (August 2016) No. of comments: 127 Title : Ask HN: Who is hiring? (August 2016) No. of comments: 947 Title : Ask HN: Who wants to be hired? (July 2016) No. of comments: 210 Title : Ask HN: Freelancer? Seeking freelancer? (July 2016) No. of comments: 81 Title : Ask HN: Who is hiring? (July 2016) No. of comments: 898 Title : Ask HN: Who wants to be hired? (June 2016) No. of comments: 250 Title : Ask HN: Freelancer? Seeking freelancer? (June 2016) No. of comments: 200
# Printing author with 2nd highest comments
count=0
for row in ask_posts:
if (row[5]=='mod50ack' and count<11):
count+=1
print( "Title :",row[1], "\n", "No. of comments: ", row[4])
Title : Ask HN: What's the best tool you used to use that doesn't exist anymore? No. of comments: 868
# Printing author with 3rd highest comments
count=0
for row in ask_posts:
if (row[5]=='throw94' and count<11):
count+=1
print( "Title :",row[1], "\n", "No. of comments: ", row[4])
Title : Ask HN: What was your why didn't I start doing this sooner moment? No. of comments: 767
# Printing author with 4th highest comments
count=0
for row in ask_posts:
if (row[5]=='barefootcoder' and count<11):
count+=1
print( "Title :",row[1], "\n", "No. of comments: ", row[4])
Title : Ask HN: Is web programming a series of hacks on hacks? No. of comments: 660 Title : Ask HN: How do you find a good corp-to-corp tech recruiter? No. of comments: 58
# Printing author with 5th highest comments
count=0
for row in ask_posts:
if (row[5]=='boren_ave11' and count<11):
count+=1
print( "Title :",row[1], "\n", "No. of comments: ", row[4])
Title : Ask HN: How much do you make at Amazon? Here is how much I make at Amazon No. of comments: 691
From this analysis it came to my understanding that by consistently posting about Hiring related questions, the author 'whoishiring' received the top number of comments. So talking about Hiring process can be a good idea. But after going through Hacker News portal, I understood that this is a periodic post created by team behind Hacker News to help with recruitment process. So maybe this is not where my friend should focus.
Second highest commented post asks about Tools, third one about general life decision, fourth one about programming languages and the fifth one making profits from Amazon. So there are a few things I could possibly infer from these topics.
If one can find simple, genuine yet curious, at times controversial topics that can create emotions in people, which also have connection with the technical fraternity, I think they can create some popular post on Hacker News channel. That is what I am able to infer from this data set.
Now I have to write this all in an email and send to my friend. What a Sunday it was! Such a fun day! All thanks to Hacker News data set!