# Analyzing posts from Hack News and popularity of them¶

### This project has the objective of analyzing which one has more comments: Ask HN or Show HN. Also if the posts created in a certain time receive more comments than average.¶

We are going to use a csv file that was taken from a Kaggle open dataset that can be found here. First lets import the libraries we'll need and read the file we are going to use.

## 1 Importing and reading the data¶

In [2]:
# Importing csv reader

# Opening, reading and creating a list of lists with the csv file
hn = list(reader(open('/home/nathalia/Documents/2 data science/6 DataQuest Projects/2 Exploring Hacker News Posts/hacker_news.csv')))

# Printing the first five rows
print(hn[:6])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


Now we're going to store the headers in a separate list so we can analyze the data more freely, but without losing our reference.

In [3]:
# Storing the headers in another list

# Removing the first row of hn
hn = hn[1:]

# Checking the result
print("\n")
print(hn[:2])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]


## 2 Analyzing which type of post has more comments¶

This next step consists on taking just the relevant posts to our analysis, since it is just about the Aks HN and Show HN posts.

In [4]:
# Creating the lists to store the data
show_posts = []
other_posts = []

# Populating the lists
for row in hn:
title = row[1].lower()     # The first column contains 'id' information
elif title.startswith('show hn') == True:
show_posts.append(row)
else:
other_posts.append(row)

# Checking if everything worked fine
b = len(show_posts)
c = len(other_posts)
print(f'ask_posts length: {a} \nshow_posts length: {b} \nother_posts length: {c}')


ask_posts length: 1744
show_posts length: 1162
other_posts length: 17194


It seems everything worked fine. Now lets see if ask posts or show posts receive more comments on average.

In [5]:
# Creating the variable to store the ask comments values


14.038417431192661

In [6]:
# Creating the variable to store the show comments values

for item in show_posts:

# Calculating the average of show comments

10.31669535283993

In [7]:
# Comparing the results

Average comments on Aks HN posts: 14.038417431192661
Average comments on Show HN posts: 10.31669535283993


As we can see above, there are more comments in Aks HN posts than in Show HN posts, almost 4 more. Because of that, we are going to focus our next analysis on it: Does the publication period of time affect it?

## 3 Cheking if time of publication affects amount of comments¶

To do so we are going to follow two steps. First we will calculate the amount of posts created in each hour of the day, along with the number of comments received. Second we will calculate the average number of comments ask posts receive by our created.

### Calculating the amount comments per post per hour¶

In [8]:
# Importing the datetime module
import datetime as dt

# Creating a list to store the values of posts and comments per hour
result_list = []

# Populating our result list
created_at = item[6]
result_list.append(row)

# Creating two dictionaries to make the frequency tables
counts_by_hour = {}

# Populating our dictionaries
for item in result_list:
datetime_str = item[0]
datetime_hour = dt.datetime.strptime(datetime_str, "%m/%d/%Y %H:%M")
hour = datetime_hour.hour
if hour in counts_by_hour:
counts_by_hour[hour] += 1
else:
counts_by_hour[hour] = 1

# Checking if it worked
print(counts_by_hour)

{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}
{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}


Done that, we will calculate the average amount of comments per hour.

In [9]:
# Creating the list to store the date
avg_by_hour = []

# Populating our list with the values of posts per hour
for item in counts_by_hour:
avg_by_hour.append(information)

# Checking if everything worked fine
print(avg_by_hour)

[[9, 5.58], [13, 14.74], [10, 13.44], [14, 13.23], [16, 16.8], [23, 7.99], [12, 9.41], [17, 11.46], [15, 38.59], [21, 16.01], [20, 21.52], [2, 23.81], [18, 13.2], [3, 7.8], [5, 10.09], [19, 10.8], [1, 11.38], [22, 6.75], [8, 10.25], [4, 7.17], [0, 8.13], [6, 9.02], [7, 7.85], [11, 11.05]]


Once we did it, we need to display the data in a clear way to see the hours with the most comments. We'll do that with a second list, that we'll create next.

In [10]:
# Creating the empty list that will be sorted
swap_avg_by_hour = []

# Populating the list inverting the elements of the first one
for item in avg_by_hour:
new_items = list([item[1], item[0]])
swap_avg_by_hour.append(new_items)

# Checking to see if it worked
print(swap_avg_by_hour)

[[5.58, 9], [14.74, 13], [13.44, 10], [13.23, 14], [16.8, 16], [7.99, 23], [9.41, 12], [11.46, 17], [38.59, 15], [16.01, 21], [21.52, 20], [23.81, 2], [13.2, 18], [7.8, 3], [10.09, 5], [10.8, 19], [11.38, 1], [6.75, 22], [10.25, 8], [7.17, 4], [8.13, 0], [9.02, 6], [7.85, 7], [11.05, 11]]

In [11]:
# Sorting the swap list
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

# Title of our small dataframe

# Looping through our data to print the top 5 hours
for row in sorted_swap[:5]:
datetime_object = dt.datetime.strptime(str(row[1]), "%H")
time_object = datetime_object.strftime("%H:%M")
print("{}: {} average comments per post".format(time_object, row[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.8 average comments per post
21:00: 16.01 average comments per post


As we can see above, if we want to make a post that reaches over 20 comments (probably), we should do it at 15, 2 or 20 o'clock, Eastern Time. Or, if you live here in Brasil too, we should do it at 17, 4 or 20 o'clock, Brasílias Time.