Analyzing posts from Hack News and popularity of them

This project has the objective of analyzing which one has more comments: Ask HN or Show HN. Also if the posts created in a certain time receive more comments than average.

We are going to use a csv file that was taken from a Kaggle open dataset that can be found here. First lets import the libraries we'll need and read the file we are going to use.

1 Importing and reading the data

In [2]:
# Importing csv reader
from csv import reader

# Opening, reading and creating a list of lists with the csv file
hn = list(reader(open('/home/nathalia/Documents/2 data science/6 DataQuest Projects/2 Exploring Hacker News Posts/hacker_news.csv')))

# Printing the first five rows
print(hn[:6])
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]

Now we're going to store the headers in a separate list so we can analyze the data more freely, but without losing our reference.

In [3]:
# Storing the headers in another list
headers = hn[0]

# Removing the first row of hn
hn = hn[1:]

# Checking the result
print(headers)
print("\n")
print(hn[:2])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]

2 Analyzing which type of post has more comments

This next step consists on taking just the relevant posts to our analysis, since it is just about the Aks HN and Show HN posts.

In [4]:
# Creating the lists to store the data
ask_posts = []
show_posts = []
other_posts = []

# Populating the lists
for row in hn:
    title = row[1].lower()     # The first column contains 'id' information
    if title.startswith('ask hn') == True:
        ask_posts.append(row)
    elif title.startswith('show hn') == True:
        show_posts.append(row)
    else:
        other_posts.append(row)
        
# Checking if everything worked fine
a = len(ask_posts)
b = len(show_posts)
c = len(other_posts)
print(f'ask_posts length: {a} \nshow_posts length: {b} \nother_posts length: {c}')
    
ask_posts length: 1744 
show_posts length: 1162 
other_posts length: 17194

It seems everything worked fine. Now lets see if ask posts or show posts receive more comments on average.

In [5]:
# Creating the variable to store the ask comments values
total_ask_comments = 0

# Populating the total_ask_comments variable
for item in ask_posts:
    n_comments_ask = int(item[4])
    total_ask_comments += n_comments_ask
    
# Calculating the avarage of ask posts comments
avg_ask_comments = (total_ask_comments)/(len(ask_posts))
print(avg_ask_comments)
14.038417431192661
In [6]:
# Creating the variable to store the show comments values
total_show_comments = 0
    
# Populating the total_show_comments
for item in show_posts:
    n_comments_show = int(item[4])
    total_show_comments += n_comments_show

# Calculating the average of show comments    
avg_show_comments = total_show_comments/(len(show_posts))
print(avg_show_comments)
10.31669535283993
In [7]:
# Comparing the results
print(f"Average comments on Aks HN posts: {avg_ask_comments}")
print(f"Average comments on Show HN posts: {avg_show_comments}")
Average comments on Aks HN posts: 14.038417431192661
Average comments on Show HN posts: 10.31669535283993

As we can see above, there are more comments in Aks HN posts than in Show HN posts, almost 4 more. Because of that, we are going to focus our next analysis on it: Does the publication period of time affect it?

3 Cheking if time of publication affects amount of comments

To do so we are going to follow two steps. First we will calculate the amount of posts created in each hour of the day, along with the number of comments received. Second we will calculate the average number of comments ask posts receive by our created.

Calculating the amount comments per post per hour

In [8]:
# Importing the datetime module
import datetime as dt

# Creating a list to store the values of posts and comments per hour
result_list = []

# Populating our result list
for item in ask_posts:
    created_at = item[6]
    n_comments = item[4]
    row = list([created_at, n_comments])
    result_list.append(row)

# Creating two dictionaries to make the frequency tables
counts_by_hour = {}
comments_by_hour = {}

# Populating our dictionaries
for item in result_list:
    datetime_str = item[0]
    comments = int(item[1])
    datetime_hour = dt.datetime.strptime(datetime_str, "%m/%d/%Y %H:%M")
    hour = datetime_hour.hour
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments

# Checking if it worked
print(comments_by_hour)
print(counts_by_hour)
{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}
{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}

Done that, we will calculate the average amount of comments per hour.

In [9]:
# Creating the list to store the date
avg_by_hour = []

# Populating our list with the values of posts per hour
for item in counts_by_hour:
    information = list([item, round(int(comments_by_hour[item])/int(counts_by_hour[item]),2)])
    avg_by_hour.append(information)
    
# Checking if everything worked fine
print(avg_by_hour)
[[9, 5.58], [13, 14.74], [10, 13.44], [14, 13.23], [16, 16.8], [23, 7.99], [12, 9.41], [17, 11.46], [15, 38.59], [21, 16.01], [20, 21.52], [2, 23.81], [18, 13.2], [3, 7.8], [5, 10.09], [19, 10.8], [1, 11.38], [22, 6.75], [8, 10.25], [4, 7.17], [0, 8.13], [6, 9.02], [7, 7.85], [11, 11.05]]

Once we did it, we need to display the data in a clear way to see the hours with the most comments. We'll do that with a second list, that we'll create next.

In [10]:
# Creating the empty list that will be sorted
swap_avg_by_hour = []

# Populating the list inverting the elements of the first one
for item in avg_by_hour:
    new_items = list([item[1], item[0]])
    swap_avg_by_hour.append(new_items)
    
# Checking to see if it worked
print(swap_avg_by_hour)
[[5.58, 9], [14.74, 13], [13.44, 10], [13.23, 14], [16.8, 16], [7.99, 23], [9.41, 12], [11.46, 17], [38.59, 15], [16.01, 21], [21.52, 20], [23.81, 2], [13.2, 18], [7.8, 3], [10.09, 5], [10.8, 19], [11.38, 1], [6.75, 22], [10.25, 8], [7.17, 4], [8.13, 0], [9.02, 6], [7.85, 7], [11.05, 11]]
In [11]:
# Sorting the swap list
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

# Title of our small dataframe
print("Top 5 Hours for Ask Posts Comments")

# Looping through our data to print the top 5 hours
for row in sorted_swap[:5]:
    datetime_object = dt.datetime.strptime(str(row[1]), "%H")
    time_object = datetime_object.strftime("%H:%M")
    print("{}: {} average comments per post".format(time_object, row[0]))
Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.8 average comments per post
21:00: 16.01 average comments per post

As we can see above, if we want to make a post that reaches over 20 comments (probably), we should do it at 15, 2 or 20 o'clock, Eastern Time. Or, if you live here in Brasil too, we should do it at 17, 4 or 20 o'clock, Brasílias Time.