Notebook

A Study on 'Hacker News' Posting Patterns¶

"Hey, I am going to become a Hacker!"

"You going to become what?", waking up from my slumber, I asked with a tinge of irritation.

"A hacker and I need your help with that.", he replied enthusiastically.

"You called me to say this on a Sunday morning?! Is that one of your new impulsive half-hearted projects?! But whatever it might be, I won't be much of a help. I'm no coding expert to guide you on this journey of yours. So bye, let me go back to sleep. "

"Oh wait, wait. You can help me. Didn't you tell me last week that you have been studying data analysis. All you have to do is help me to find out how to I can create a popular post on Hacker News. You know its a social news website where people upvote good stuff related to hacking and other things that generates curiosity. If you can help me find a topic that I can submit and come on top of their first page, I think I'm as good as a hacker. This is my shortcut to become a hacker. Now tell me that you will help me. Please."

The idea of using data analysis to find the popular post topic kind of excited me. At least it will be a good practice on my data quest!

"Okay, so you are not going to be a hacker. You just want some eyeballs on you by posting a popular story on that website. I think I can help you with that. But before giving you any hope, let me see if I can find any dataset for this purpose. Without data, there is no data analysis. So hang in there while I look for the data. I'll call you later."

"Okay man, thank you for this. I owe you!"

"Okay, 'Hacker'. Bye."

Finding the Data set¶

My Sunday morning sleep-in has quickly made way to some googlin instead. After a quick search I found a data set on Kaggle.

'Such a lucky bugger, he is'. I thought. Since I found the data set, the project is on.

The Description of data set says

This data set is Hacker News posts from the last 12 months (up to September 26 2016).

It includes the following columns:

title: title of the post (self explanatory)
url: the url of the item being linked to
num_points: the number of upvotes the post received
num_comments: the number of comments the post received
author: the name of the account that made the post
created_at: the date and time the post was made (the time zone is Eastern Time in the US)

Let me open the data set now.

Opening the Data Set¶

In [1]:

from csv import reader                               # importing CSV module to read the .csv file

opened_file=open(r"\Users\Surface GO\Downloads\HN_posts_year_to_Sep_26_2016.csv", encoding='UTF-8')

read_file=reader(opened_file)

hacker_news_full=list(read_file)                    # The whole data set in List form

hn_header=hacker_news_full[0]                       # Getting the header of the data set

hn_full=hacker_news_full[1:]                             # The whole data set without Header

print("The number of rows in data set: ", len(hn_full) )

hn_full[:2]

The number of rows in data set:  293119

Out[1]:

[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24']]

So there are 293199 rows in the current data set. I need to remove the unwanted rows. But how do I find the unwanted rows? Well, I think I need a discussion now. . .

"Hey lucky boy. I found the data set for you. Now you need to help me with something. So you want to post popular posts, right? What if I remove posts that have less than 10 comments from this data. Do you think that makes sense?", I asked

"Man, you know it better. So do what you feel like doing. But I do think that you can remove posts with less than 10 comments. I am looking at at least 3 digit numbers, you know!", I could sense that he was being cheeky, but then he continued.

"Maybe you can also look at the point. They usually show the points along with the comments. So can you apply the same logic with points too?", he asked.

"Okay, let me try first. So I am going to create a new list that contains titles that have more than 10 comment and more than 10 points. But let me first figure out the index and column names."

Function to find Index along with Column name¶

In [2]:

def indexer(dataset_header):         # Input the dataset header to retrieve its index
    index=0
    print("Index")                   #### is there a way to print the variable name used? (eg. index of appstore)
    for column_name in dataset_header:      #finding each column name from the header row
        print(index, ' : ', column_name)    
        index+=1
    print('\n')
    
indexer(hn_header)

Index
0  :  id
1  :  title
2  :  url
3  :  num_points
4  :  num_comments
5  :  author
6  :  created_at

"Okay, so there is 'num-points' at index 3 and 'num_comments' at index 4. Now I am going to find out how many rows of these data set satify the conditions of having more than 10 comments and more than 10 points."

Removing rows with less number of comments and points¶

In [3]:

hn=[]                                          # Creating an empty list to store new data set

for row in hn_full:                            # Iterating through each rows of the whole data set
    comments=int(row[4])
    points=int(row[3])
    if comments >10 and points>10:             # Applying the condition to create a new list
        hn.append(row)
        
print("Length of new data set is ", len(hn))

Length of new data set is  25153

" Hey dude, hear this! The data set reduced from 293k to 25k! Can you believe the number of posts that didn't have at least 10 comments and points?! Are you sure you can do better than this?!" I shared my first finding with my friend.

"Oh wow! Now that is drastic! Well I think I can do better because I know if you 'Ask' or 'Show' something, the chances are that you get better responses. But I don't know which one is better."

"What do you mean?", I asked him.

"Well, if you go to Hacker News Website, you can see that there are two categories, 'Ask' and 'Show' and then there is all other posts. If you are posting on 'Ask' category, it will be given as "Ask HN: Topic name" and if it is on 'Show' category, it will be given as "Show HN: Topic name". Other categories are simply posted with their titles. So if you can find which of these categories are doing better, I am sure I can find some posts to share under that. Did you get it now?"

"Okay, thank you for that information. Now that makes it easier to categorise them. So let me go ahead and make another sets of list for these categories. It will make the analysis simpler, I believe. "

Creating Seperate list for Ask, Show and Other posts¶

The startswith() function will be used to find if the string starts with given slice of string. Here in this case, if it starts with 'ask hn' or 'show hn'. To standardise the string, it will be converted into smaller case by using lower function.

In [4]:

ask_posts = []                             # Creating an empty list to store ask_post rows
show_posts = []                             # Creating an empty list to store show_post rows
other_posts = []                             # Creating an empty list to store other_post rows


for row in hn:                                  # Iterating through the new reduced hn data set
    title=row[1]
    title_lower=title.lower()                    # Finding the lower case version of title
    
    if title_lower.startswith("ask hn"):            # Checking if the title starts with the phrase 'ask hn'
        ask_posts.append(row)                        # if it starts with 'ask hn' gets added to the ask_post list
    elif title_lower.startswith('show hn'):           # checking for titles starting with 'show hn'
        show_posts.append(row)                         # adding titles with 'show hn' to show_posts list
    else:
        other_posts.append(row)                          # adding all other posts to other_post list
        
        
print("No. of Ask posts", len(ask_posts))                   # printing the number of ask posts
print("No. of Show posts", len(show_posts))                  # printing the number of show posts
print("No. of Other posts", len(other_posts))                 # printing the number of other posts
print("Total length of Hacker News", len(hn))                  # checking the lenght of reduced hn list
print("Total",len(ask_posts)+len(show_posts)+len(other_posts))  # comparing with the total to confirm

No. of Ask posts 1091
No. of Show posts 945
No. of Other posts 23117
Total length of Hacker News 25153
Total 25153

" Hey look, I have the initial numbers from the analysis. Looks like there are plenty of posts on other category. The reason for which is probably very obvious. But out of ask and show, 'Ask' category has more posts. i think it is a good idea to be specific and post in one of these categories rather than drowning in a big pile of all other post. So lets look more into Ask and Show category. But as of now it doesn't say which category attracts more comments and interactions. So I will check for the average comments and point on these two categories to see which one is better."

Finding the Average Number of Comments and Points¶

In [5]:

# Finding Average Comments and Points in ASK Category


total_ask_comments=0                 # Setting total comments to 0
total_ask_points=0                     # Setting total points to 0

for row in ask_posts:                     # Iterating through each row in Ask_post list
    comment = int(row[4])                   # assigning integer value of row[4] to comment
    point = int(row[3])                       # assigning integer value of row[3] to point
    
    total_ask_comments += comment                # adding the total comments in each iteration
    total_ask_points += point                      # adding the total points in each iteration
    

avg_ask_comments = total_ask_comments/len(ask_posts)     # finding average number of comments
avg_ask_points = total_ask_points/len(ask_posts)           # finding average number of points

print("Total Comments for Ask Posts: ", total_ask_comments)
print("Average comment count for Ask posts: ", avg_ask_comments)  
print(" Total points for Ask posts: ", total_ask_points )
print("Average points for Ask posts: ", avg_ask_points)
    

Total Comments for Ask Posts:  70559
Average comment count for Ask posts:  64.6736938588451
 Total points for Ask posts:  75262
Average points for Ask posts:  68.98441796516957

In [6]:

# Finding Average Comments and Points in SHOW Category


total_show_comments=0                 # Setting total comments to 0
total_show_points=0                     # Setting total points to 0

for row in show_posts:                     # Iterating through each row in Show_post list
    comment = int(row[4])                   # assigning integer value of row[4] to comment
    point = int(row[3])                       # assigning integer value of row[3] to point
    
    total_show_comments += comment                # adding the total comments in each iteration
    total_show_points += point                      # adding the total points in each iteration
    

avg_show_comments = total_show_comments/len(show_posts)     # finding average number of comments
avg_show_points = total_show_points/len(show_posts)           # finding average number of points

print("Total Comments for Show Posts: ", total_show_comments)
print("Average comment count for Show posts: ", avg_show_comments)  
print(" Total points for Show posts: ", total_show_points )
print("Average points for Show posts: ", avg_show_points)
    

Total Comments for Show Posts:  38278
Average comment count for Show posts:  40.505820105820106
 Total points for Show posts:  101455
Average points for Show posts:  107.35978835978835

" So here is the results after checking the averages. Average comment count for Ask posts is 64.67 and Average comment count for Show posts: 40.50. But interestingly Average points for Ask posts is 68.98 and Average points for Show posts: 107.35. Comments are better for Ask posts while points are better for the show posts. So I went to Hacker News website to read more about it. Looks like getting a number of comments are straight forward while point system follows an algorithm. Since you are looking at getting quick way to get recognition, I think we should focus on the average of comments rather than the points. Do you agree?"

"Yes, I think so too. Let me post some content that gets most interaction. If that gets me a better point well and good. But let us look at things that we have control over. So lets stick to comments or average of it. "

"Okay, in that case, let me analyse a bit deeper in the ask post list. There are author name and created time also given in the data. Maybe I can get something out of it too. "

"Yes please. Can you check what time is the best time to post? For example if there are more users online at a specific time, there will be more upvotes in that time. So if we can find out the best time to post, that will be great. "

"Sure, let me play around with the created_time data field and hopefully I will give you something interesting."

Finding the best time to post for more comments¶

The datetime module is used to work with the date and time data provided. What I am going to do is to create a seperate list for

In [7]:

# Creating a dictionary with Hours as Key and Number of Comments as value.

import datetime as dt          # importing datetime module to work with date & time given as strings

hourly_comment={}                # Empty dictionary to store number of comments made in each hours
hourly_post= {}                   # Empty dictionary to store number of posts created in each hours

for row in ask_posts:              # Iterating through each row of ask post list
    time_string = row[6]            # assigning row[6], "created at" to time given as string
    comments_number = int(row[4])    # Assigning the integer value of number of comments to a variable
    
        # Converting time given in String to datetime object. Sample time is given as 9/26/2016 3:24
        # This is in the format of month/date/4 digit Year Hour: Minute --> %m/%d/%Y %H:%M
    
    converted_time = dt.datetime.strptime(time_string,"%m/%d/%Y %H:%M")
    hour_posted = converted_time.strftime("%H")   # Assigning datetime object in Hour format; eg '13' for 1PM
    
        # Creating a frequency table using Dictionary
    
    if hour_posted in hourly_comment:                     # if hour_posted is already presnent in the dictionary
        hourly_comment[hour_posted] += comments_number     # Add up the comments as value of 'hour' key
        hourly_post[hour_posted] += 1                       # Add 1 to number of posts made in that hour
        
    else:                                                  # if it is not present in the dictionary
        hourly_comment[hour_posted] = 1                   # Assign the value as 1 corresponding to the 'hour' key
        hourly_post[hour_posted] = 1                    # Assign the value as 1 corresponding to the hour key
        
print("Here is the Number of posts in each hour \n \n", hourly_post)
print("\n Here is the Number of comments in each hour \n \n", hourly_comment)

Here is the Number of posts in each hour 
 
 {'19': 62, '15': 103, '09': 18, '20': 52, '17': 58, '14': 66, '11': 33, '23': 39, '13': 64, '02': 37, '21': 51, '16': 62, '07': 26, '06': 31, '00': 29, '03': 38, '04': 25, '22': 50, '10': 44, '12': 55, '18': 69, '08': 29, '01': 32, '05': 18}

 Here is the Number of comments in each hour 
 
 {'19': 2513, '15': 17124, '09': 832, '20': 2530, '17': 3968, '14': 3639, '11': 1630, '23': 1261, '13': 5980, '02': 2022, '21': 2997, '16': 3001, '07': 1037, '06': 949, '00': 1372, '03': 1403, '04': 1611, '22': 2336, '10': 2213, '12': 3234, '18': 3222, '08': 1639, '01': 1232, '05': 1139}

Now that I have two dictionaries with number of comments and number of posts against each hours, I can straight away find the average number of comments posted in each hours.

Finding the Average Comments per post in Each Hour¶

A list will be created to list down the hour and average number of post in each hour. For the ease of sorting we will keep the first column as the average number of posts per hour.

In [8]:

avg_by_hour = []                                       # Creating an empty list to add avg comments per post

for key in hourly_comment:                                # Iterating through each key of the dictionary
    total_comments = hourly_comment[key]                    # Assigning comments using dictionary[key]
    number_of_posts = hourly_post[key]                        # Assigning number of posts using dictionary[key]
    
    avg_comment_per_post = total_comments/number_of_posts        # finding the average
    
    avg_by_hour.append([avg_comment_per_post, key])                  # Appending the average to the list
    
print("The list of Average comments received per post in each hour is below \n \n", avg_by_hour)

The list of Average comments received per post in each hour is below 
 
 [[40.53225806451613, '19'], [166.25242718446603, '15'], [46.22222222222222, '09'], [48.65384615384615, '20'], [68.41379310344827, '17'], [55.13636363636363, '14'], [49.39393939393939, '11'], [32.333333333333336, '23'], [93.4375, '13'], [54.648648648648646, '02'], [58.76470588235294, '21'], [48.403225806451616, '16'], [39.88461538461539, '07'], [30.612903225806452, '06'], [47.310344827586206, '00'], [36.921052631578945, '03'], [64.44, '04'], [46.72, '22'], [50.29545454545455, '10'], [58.8, '12'], [46.69565217391305, '18'], [56.51724137931034, '08'], [38.5, '01'], [63.27777777777778, '05']]

Sorting the Average comments per post per hour¶

Now that I have a list of average comments per post per each hour, I can sort it to see at what hour most number of comments were made per post. To sort the list, now I am going to use sorted() function in descending order.

In [9]:

sorted_avg = sorted(avg_by_hour, reverse=True)                 # Using sorting function to sort in descending order
print ("The sorted averages per post per hour is here \n \n")
sorted_avg

The sorted averages per post per hour is here

Out[9]:

[[166.25242718446603, '15'],
 [93.4375, '13'],
 [68.41379310344827, '17'],
 [64.44, '04'],
 [63.27777777777778, '05'],
 [58.8, '12'],
 [58.76470588235294, '21'],
 [56.51724137931034, '08'],
 [55.13636363636363, '14'],
 [54.648648648648646, '02'],
 [50.29545454545455, '10'],
 [49.39393939393939, '11'],
 [48.65384615384615, '20'],
 [48.403225806451616, '16'],
 [47.310344827586206, '00'],
 [46.72, '22'],
 [46.69565217391305, '18'],
 [46.22222222222222, '09'],
 [40.53225806451613, '19'],
 [39.88461538461539, '07'],
 [38.5, '01'],
 [36.921052631578945, '03'],
 [32.333333333333336, '23'],
 [30.612903225806452, '06']]

Making a short Report¶

From this list I can say that 3PM, 1PM, 5PM are the top 3 hours to post to generate most number of comments according to this data. Now I have to tell my friend about this. But let me write some code to generate a report kind of thing to send to my friend and I can pretend like I am a programmer in front of him(at least!) So I am going to use some formatting techinques to impress him.

In [10]:

for row in sorted_avg[:5]:                                         # iterating through the first five averages
    
    time = dt.datetime.strptime(row[1],"%H").strftime("%I %p")     # converting time string to datetime object
        # %I shows Hour in 12 hour format, %p shows AM or PM
    
    avg = row[0]                                                   # Assigning avg from list
    
        # Printing using .format method. Argument this_time is stored at time and this_avg stored at avg
        # this_avg is formatted to have 2 decimal points
    
    print(" If you post at {this_time}, you have a chance of getting {this_avg:,.2f} \n"
          .format(this_time=time,this_avg=avg))
    
   
    # Printing our conclusion
    
print("So the best time to post to get good traction is {} EST".format(
    dt.datetime.strptime(sorted_avg[0][1],"%H").strftime("%I %p")))

 If you post at 03 PM, you have a chance of getting 166.25 

 If you post at 01 PM, you have a chance of getting 93.44 

 If you post at 05 PM, you have a chance of getting 68.41 

 If you post at 04 AM, you have a chance of getting 64.44 

 If you post at 05 AM, you have a chance of getting 63.28 

So the best time to post to get good traction is 03 PM EST

Mini Conclusion¶

" Hey bro, I think I have come to conclusion regarding what time to post. From the data we analysed, which is the data collected during 2016, we collected a subset of the data which has more than 10 comments and 10 points.

From that data set I can tell you that posting under Ask HN category can create a better engangement which leads to more number of comments.

But if you want more comments instantly, I have listed down some better time to post. 3PM, 1PM, 5 M, 4AM and 5AM are those times. Out of this 3PM is the best time according to our analysis with 166.25 as the average comments per post per hour.

So in short if you want to become popular on Hacker News, you need to find posts that you can submit on Ask HN Category and post them at 3PM EST and cross your fingers. If it works out, show some gratitude!"

"Wow, that is a great news. Thanks a lot brother. Let me try that trick in the next possible opportunity. And yes I will definitely show my gratitude soon."

"Okay, that is great. Anyway I will be working on this data a bit more and try to find if there is any connection with the authors, if someone is doing better than others, if so how and so on. So hopefully you will hear from me soon again. Till then bye and thank you for motivating me to do such a project. This was fun."

Relation between Authors and number of comments¶

In the mean time I got really interested in it and started digging deep to find some kind of correlation between the authors and number of comments. So I created a dictionary to find the comment distribution among the authors.

In [11]:

# Creating a Author - Comment distribution

authors={}                         # Creating an empty dictionary to store the values
for row in ask_posts:               # iterating through ask_posts
    name=row[5]                      # assigning the name of the authors
    comment=int(row[4])               # assigning the number of comments
    if name in authors:                # Checking if authors name is present in the dictionary
        authors[name] += comment        # if it is present add the number of comments received for that title
    else:                                # if not
        authors[name] = comment           # assign the first number of comments associated with that name
        


author_list=[]                                 # Creating a list with author names and comments

for name in authors:                             # Iterating through author dictionary
    author_list.append([authors[name], name])     # append the empty list with author and no. of comments
    

sorted_list=sorted(author_list, reverse=True)        # Sort the list in descending order
print(sorted_list[:10])                                # Print the first 10 authors with highest no of comments

[[12892, 'whoishiring'], [868, 'mod50ack'], [767, 'throw94'], [718, 'barefootcoder'], [691, 'boren_ave11'], [650, 'gtirloni'], [648, 'sebg'], [602, 'sama'], [581, 'milfseriously'], [571, 'dang']]

Finding Titles and corresponding comments¶

Now that we have a list of authors with highest number of comments, let me go through each author and find at least 10 titles of their posts and corresponding comments they received. First 5 author's will be analysed to see if anything can be deduced from it.

In [12]:

# Printing the details of the author with highest comments

count=0

for row in ask_posts:                                                # Iterating through each row of ask_posts    
    if (row[5]=='whoishiring' and count<11):                         # finding the author and limiting it to 10 entries.
        count+=1
        print( "Title :",row[1], "\n", "No. of comments: ", row[4])  #Printing Title : No of comments
                                                    

Title : Ask HN: Who wants to be hired? (September 2016) 
 No. of comments:  166
Title : Ask HN: Freelancer? Seeking freelancer? (September 2016) 
 No. of comments:  85
Title : Ask HN: Who is hiring? (September 2016) 
 No. of comments:  910
Title : Ask HN: Who wants to be hired? (August 2016) 
 No. of comments:  118
Title : Ask HN: Freelancer? Seeking freelancer? (August 2016) 
 No. of comments:  127
Title : Ask HN: Who is hiring? (August 2016) 
 No. of comments:  947
Title : Ask HN: Who wants to be hired? (July 2016) 
 No. of comments:  210
Title : Ask HN: Freelancer? Seeking freelancer? (July 2016) 
 No. of comments:  81
Title : Ask HN: Who is hiring? (July 2016) 
 No. of comments:  898
Title : Ask HN: Who wants to be hired? (June 2016) 
 No. of comments:  250
Title : Ask HN: Freelancer? Seeking freelancer? (June 2016) 
 No. of comments:  200

In [13]:

# Printing author with 2nd highest comments

count=0
for row in ask_posts:
    
    if (row[5]=='mod50ack' and count<11):
        count+=1
        print( "Title :",row[1], "\n", "No. of comments: ", row[4])
    

Title : Ask HN: What's the best tool you used to use that doesn't exist anymore? 
 No. of comments:  868

In [14]:

# Printing author with 3rd highest comments

count=0
for row in ask_posts:
    
    if (row[5]=='throw94' and count<11):
        count+=1
        print( "Title :",row[1], "\n", "No. of comments: ", row[4])

Title : Ask HN: What was your why didn't I start doing this sooner moment? 
 No. of comments:  767

In [15]:

# Printing author with 4th highest comments

count=0
for row in ask_posts:
    
    if (row[5]=='barefootcoder' and count<11):
        count+=1
        print( "Title :",row[1], "\n", "No. of comments: ", row[4])

Title : Ask HN: Is web programming a series of hacks on hacks? 
 No. of comments:  660
Title : Ask HN: How do you find a good corp-to-corp tech recruiter? 
 No. of comments:  58

In [16]:

# Printing author with 5th highest comments

count=0
for row in ask_posts:
    
    if (row[5]=='boren_ave11' and count<11):
        count+=1
        print( "Title :",row[1], "\n", "No. of comments: ", row[4])

Title : Ask HN: How much do you make at Amazon? Here is how much I make at Amazon 
 No. of comments:  691

More Conclusions¶

From this analysis it came to my understanding that by consistently posting about Hiring related questions, the author 'whoishiring' received the top number of comments. So talking about Hiring process can be a good idea. But after going through Hacker News portal, I understood that this is a periodic post created by team behind Hacker News to help with recruitment process. So maybe this is not where my friend should focus.

Second highest commented post asks about Tools, third one about general life decision, fourth one about programming languages and the fifth one making profits from Amazon. So there are a few things I could possibly infer from these topics.

Ask a simple question that everyone can relate, and people will responde, for eg about an obsolete tool. Maybe talk about MS Paint and some people might get all nostalgic while others talk about it being completely obsolete.
Ask about something that can make people curious, such as some one's online earning from Amazon or blog etc
Ask about things that everyone has some kind of opinion, for eg what would you tell your younger self
Ask about any controversial topic in the world of programming, might create an engaging discussion.

If one can find simple, genuine yet curious, at times controversial topics that can create emotions in people, which also have connection with the technical fraternity, I think they can create some popular post on Hacker News channel. That is what I am able to infer from this data set.

Now I have to write this all in an email and send to my friend. What a Sunday it was! Such a fun day! All thanks to Hacker News data set!