Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit.

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

We'll compare these two types of posts to determine the following:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

In [1]:

```
# import reader to open the data set file
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
data_set = list(read_file)
headers = data_set[0]
hn = data_set[1:]
```

To make it easier to explore the two data sets, we'll first write a function named explore_data() that we can use repeatedly to explore rows in a more readable way. We'll also add an option for our function to show the number of rows and columns for any data set.

In [2]:

```
# define function to explore the data set
def explore_data(dataset, start, end, rows_and_columns=False):
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print('\n')
if rows_and_columns==True:
print('Number of rows is ', len(dataset))
print('Number of colums is ', len(dataset[0]))
#explore the first 5 rows of our data set
print(headers)
print('\n')
explore_data(hn, 0, 5, True)
```

Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll separate posts beginning with Ask HN and Show HN (and case variations) into two different lists next.

In [3]:

```
#create 3 list for different post types
ask_posts = []
show_posts = []
other_posts = []
#loop over the data set and append rows for ask_posts, show_posts and other_posts
#to the lists respectively
for row in hn:
title = row[1]
title = title.lower()
if title.startswith('ask hn') == True:
ask_posts.append(row)
elif title.startswith('show hn') == True:
show_posts.append(row)
else:
other_posts.append(row)
#create a function which will return the number of posts in each category
def num_posts(posts):
number = 0
for row in posts:
number += 1
return number
num_ask = num_posts(ask_posts)
num_show = num_posts(show_posts)
num_others = num_posts(other_posts)
print('The number of posts in ask_posts is ', num_ask )
print('The number of posts in show_posts is ', num_show)
print('The number of posts in other_posts is ', num_others)
```

Now when we separated defferent types of posts, we'll determine if ask posts or show posts receive more comments on average.

In [4]:

```
#compute the number of ask HN comments
total_ask_comments = 0
#loop over the ask_posts list to calculate the number of comments
for row in ask_posts:
total_ask_comments += int(row[4])
avg_ask_comments = total_ask_comments/num_ask
print(avg_ask_comments)
```

We've computed that on average a post in Ask HN has ~14 comments. Now we'll rake a look at Show HN posts.

In [5]:

```
#compute the number of Show HN comments
total_show_comments = 0
#loop over show_post lists and to calculate the number of comments
for row in show_posts:
total_show_comments += int(row[4])
avg_show_comments = total_show_comments/num_show
print(avg_show_comments)
```

We've computed that on average a post in Show HN gets ~ 10 comments.

We've figured out that Ask HN posts receive more comments than Show HN. We can assume that people are more willing to ask questions than discuss, critisize, or praise whatether other users showed in the Sho HN post type.

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments.

In [6]:

```
#import thr datetime class useing an alias dt
from datetime import datetime as dt
#append the date in row[6] and the number of comments in row[4] to the result list
result_list = []
for row in ask_posts:
a_list = []
a_list.append(row[6])
a_list.append(int(row[4]))
result_list.append(a_list)
"""create dictionaries counts_by_hour for the number of post created at each other
and comments_by_hour for a number of comments left
"""
counts_by_hour = {}
comments_by_hour = {}
#loop over the result list and append data to the dictionaries
for row in result_list:
time_str = row[0]
datetime_dt = dt.strptime(time_str, "%m/%d/%Y %H:%M")
time_str = datetime_dt
hour_dt = time_str.hour
if hour_dt not in counts_by_hour:
counts_by_hour[hour_dt] = 1
comments_by_hour[hour_dt] = row[1]
else:
counts_by_hour[hour_dt] += 1
comments_by_hour[hour_dt] += row[1]
print(counts_by_hour)
print(comments_by_hour)
```

We have created two dictionaries: one, containing the number of posts for each our of the day, and the other containing the corresponding number of comments ask posts received. Now we are ready to calculate the average number of comments for posts created during each hour of the day.

In [7]:

```
avg_by_hour = []
#compute the average number of comments, append the result to the avg_by_hour list
for key in counts_by_hour:
a_list = []
a_list.append(key)
avg = comments_by_hour[key]/counts_by_hour[key]
a_list.append(avg)
avg_by_hour.append(a_list)
for a_list in avg_by_hour:
print(a_list)
```

In [8]:

```
swap_avg_by_hour = []
#swap the time and the average number of comments and append the list to the swap_avg_by_time list
for row in avg_by_hour:
a_list = []
a_list.append(row[1])
a_list.append(row[0])
swap_avg_by_hour.append(a_list)
for a_list in swap_avg_by_hour:
print(a_list)
```

In [9]:

```
#implement the sorted function to the list with swapped comment and time, reverse it
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
string = 'Top 5 Hours for Ask Posts Comments'
print(string)
for row in sorted_swap[:5]:
string1 = dt.strptime(str(row[1]), '%H')
string1 = dt.strftime(string1, '%H:%M')
output = "{0}: {1:.2f} average comments per post".format(string1, row[0])
print(output)
```

According to the result, the best hour to create a post is 15:00 (3pm). That time you are more likely to receive comments. The next is 2:00 (2am). The difference between average comments at 3pm and 2am is big (~15 comments). You can receive an average of 21 comments at 20:00 (8pm) and the difference between posts at 2am is small (~2 comments).

In the next step we are going to calculate the average number of comments for a Show post.

In [10]:

```
#create a list with time in row[6] and number of comments in row[4]
show_result_list = []
for row in show_posts:
a_list1 = []
a_list1.append(row[6])
a_list1.append(int(row[4]))
show_result_list.append(a_list1)
"""create two empty dictionaries, s_counts_by_hour for the number of show posts for each hour, and
s_comments_by_hour for the number of comments for each hour
"""
s_counts_by_hour = {}
s_comments_by_hour = {}
#loop through the show_result_list and append data to the dictionaries
for row in show_result_list:
s_time_str = row[0]
s_datetime_dt = dt.strptime(s_time_str, "%m/%d/%Y %H:%M")
s_hour_dt = s_datetime_dt.hour
if s_hour_dt not in s_counts_by_hour:
s_counts_by_hour[s_hour_dt] = 1
s_comments_by_hour[s_hour_dt] = row[1]
else:
s_counts_by_hour[s_hour_dt] += 1
s_comments_by_hour[s_hour_dt] += row[1]
print(s_counts_by_hour)
print(s_comments_by_hour)
```

In [11]:

```
s_avg_by_hour = []
"""compute the average number of comments and append a list with time
and the average comments to the s_avg_by_hour list
"""
for key in s_counts_by_hour:
s_a_list = []
s_a_list.append(key)
s_avg = s_comments_by_hour[key]/s_counts_by_hour[key]
s_a_list.append(s_avg)
s_avg_by_hour.append(s_a_list)
for s_a_list in s_avg_by_hour:
print(s_a_list)
```

In [12]:

```
s_swap_avg_by_hour = []
#swap the index of time and comments and append the result to s_swap_avg_by_hour
for row in s_avg_by_hour:
sa_list = []
sa_list.append(row[1])
sa_list.append(row[0])
s_swap_avg_by_hour.append(sa_list)
for a_list in s_swap_avg_by_hour:
print(a_list)
```

In [13]:

```
#implement the sorted function to sort the result
s_sorted_swap = sorted(s_swap_avg_by_hour, reverse=True)
s_string = 'Top 5 Hours for Show Posts Comments'
print(s_string)
for row in s_sorted_swap[:5]:
sstring1 = dt.strptime(str(row[1]), '%H')
sstring1 = dt.strftime(sstring1, '%H:%M')
s_output = "{0}: {1:.2f} average comments per post".format(sstring1, row[0])
print(s_output)
```

We can see that the best time for creating a Show post is 18:00 (6pm). On average this post receives almost the same number of comments as the one posted at 00:00 (12pm). The third best time for posting is 14:00 (2pm). The difference between the average number of posts is vague.

Now we'll calculate an average comments number for other posts.

In [14]:

```
# create a list for post in other cateory, loop through other_posts,
#append time (row[6]) and number of comments (row[4])
other_result_list = []
for row in other_posts:
o_list1 = []
o_list1.append(row[6])
o_list1.append(int(row[4]))
other_result_list.append(o_list1)
"""create empty dictionaries:
o_counts_by_hour for the number of post at a given hour
o_comments_by_hour number of comments for posts of a given hour
"""
o_counts_by_hour = {}
o_comments_by_hour = {}
#loop over other_result_list, append data to the dictionaries
for row in other_result_list:
o_time_str = row[0]
o_datetime_dt = dt.strptime(o_time_str, "%m/%d/%Y %H:%M")
o_hour_dt = o_datetime_dt.hour
if o_hour_dt not in o_counts_by_hour:
o_counts_by_hour[o_hour_dt] = 1
o_comments_by_hour[o_hour_dt] = row[1]
else:
o_counts_by_hour[o_hour_dt] += 1
o_comments_by_hour[o_hour_dt] += row[1]
print(o_counts_by_hour)
print(o_comments_by_hour)
```

In [15]:

```
o_avg_by_hour = []
#compute the average number of comments and append it with the time to a new list
for key in o_counts_by_hour:
o_a_list = []
o_a_list.append(key)
o_avg = o_comments_by_hour[key]/o_counts_by_hour[key]
o_a_list.append(o_avg)
o_avg_by_hour.append(o_a_list)
for o_a_list in o_avg_by_hour:
print(o_a_list)
```

In [16]:

```
o_swap_avg_by_hour = []
#swap the index of time and the number of comments
for row in o_avg_by_hour:
oa_list = []
oa_list.append(row[1])
oa_list.append(row[0])
o_swap_avg_by_hour.append(oa_list)
for o_list in o_swap_avg_by_hour:
print(o_list)
```

In [17]:

```
#implement the sorted function to creaate a top-list
o_sorted_swap = sorted(o_swap_avg_by_hour, reverse=True)
o_string = 'Top 5 Hours for Other Posts Comments'
print(o_string)
for row in o_sorted_swap[:5]:
ostring1 = dt.strptime(str(row[1]), '%H')
ostring1 = dt.strftime(ostring1, '%H:%M')
ooutput = "{0}: {1:.2f} average comments per post".format(ostring1, row[0])
print(ooutput)
```

We found out that other post created at 14:00 (2pm) receive more comments. Howether the average numbers are distributed almost evenly, so you will get more comments if you create a post between 11:00 (11am) and 16:00 (4pm).

Now let's compare the results for all three post types.

In [18]:

```
#print all the results from different posts categories
print(string)
for row in sorted_swap[:5]:
string1 = dt.strptime(str(row[1]), '%H')
string1 = dt.strftime(string1, '%H:%M')
output = "{0}: {1:.2f} average comments per post".format(string1, row[0])
print(output)
print('\n')
print(s_string)
for row in s_sorted_swap[:5]:
sstring1 = dt.strptime(str(row[1]), '%H')
sstring1 = dt.strftime(sstring1, '%H:%M')
s_output = "{0}: {1:.2f} average comments per post".format(sstring1, row[0])
print(s_output)
print('\n')
print(o_string)
for row in o_sorted_swap[:5]:
ostring1 = dt.strptime(str(row[1]), '%H')
ostring1 = dt.strftime(ostring1, '%H:%M')
ooutput = "{0}: {1:.2f} average comments per post".format(ostring1, row[0])
print(ooutput)
```

Ask Posts on average get more comments throughout the day. The most commented ones are at 15:00 (3pm), 2:00 (2am), 20:00 (8pm). Then users activity declines. So the best way to get more comments is to create an Ask post at 15:00 (3pm)

Other posts receive less comments. The most commented ones are at 14:00 (2pm), 13:00 (1pm), and 12:00 (12pm). The difference in comments within the top-list is not striking, so you will get many comments posting between 12:00 and 16:00

The least amount of comments is received by the Show posts. The most commented ones are at 18:00 (6pm), 00:00 (12pm), and 14:00 (2pm).

Karma points are calculated as the number of upvotes a given user's content has received minus the number of downvotes. We want to know which type of post gets more upvotes.

In [19]:

```
show_points = 0
#iterate over the show_post, compute the number of points
for row in show_posts:
points = int(row[3])
show_points += points
#find the average by dividing show_points into num_show
avg_show_points = show_points/num_show
print(avg_show_points)
```

In [20]:

```
ask_points = 0
#iterate over the ask_posts to compute the number of points
for row in ask_posts:
points = int(row[3])
ask_points += points
#find the average ask points by dividing ask_points into num_ask
avg_ask_points = ask_points/num_ask
print(avg_ask_points)
```

We can see, that show posts receive more points on average. That means that creating a Show HN post can get you much more upvotes.

Next we'll figure out if post created at a certain time are more likely to receive more points.

In [38]:

```
ask_result_list = []
#loop through ask_post, append time(row[6]) and number of points(row[3]) to the list
for row in ask_posts:
a_list = []
a_list.append(row[6])
a_list.append(int(row[3]))
ask_result_list.append(a_list)
"""create two empty dictionaries:
counts_ask_by_hour for a number of posts created at a given hour
ask_points_by_hour for a number of points for posts at a given hour
"""
counts_ask_by_hour = {}
ask_points_by_hour = {}
#loop through ask_result_list, append the data
for row in ask_result_list:
time_str = row[0]
datetime_dt = dt.strptime(time_str, "%m/%d/%Y %H:%M")
time_str = datetime_dt
hour_dt = time_str.hour
if hour_dt not in ask_points_by_hour:
ask_points_by_hour[hour_dt] = row[1]
counts_ask_by_hour[hour_dt] = 1
else:
ask_points_by_hour[hour_dt] += row[1]
counts_ask_by_hour[hour_dt] +=1
average_by_hour = []
#find out the average by deviding a number of points (ask_points_by_hour) into a number of comments
for key in counts_ask_by_hour:
a_list = []
a_list.append(key)
avg1 = ask_points_by_hour[key]/counts_ask_by_hour[key]
a_list.append(avg1)
average_by_hour.append(a_list)
swap_average_by_hour = []
#swap the index of time and number of posts, append the list to swap_average_by_hour
for row in average_by_hour:
av_list = []
av_list.append(row[1])
av_list.append(row[0])
swap_average_by_hour.append(av_list)
#implement the sorted function to swap_average_by_hour to create a top-list
sor_ask_average = sorted(swap_average_by_hour, reverse=True)
string2 = 'Top 5 Hours for Ask Posts Points:'
print(string2)
for row in sor_ask_average[:5]:
string3 = dt.strptime(str(row[1]), '%H')
string3 = dt.strftime(string3, '%H:%M')
output = "{0}: {1:.2f} point per post".format(string3, row[0])
print(output)
```

We can see, that Ask HN posts created at 15:00 (3pm) get considerably more points that those created at any other time.

Now we'll analyse the Show HN posts.

In [39]:

```
show_result_list = []
#loop through the show_posts and append the time(row[6]) and number of points(row[3]) to show_result_list
for row in show_posts:
c_list = []
c_list.append(row[6])
c_list.append(int(row[3]))
show_result_list.append(c_list)
"""create two emoty dictionaries:
counts_show_by_hour for the number of posts at a given hour
show_points_by_hour for the number of points for these posts
"""
counts_show_by_hour = {}
show_points_by_hour = {}
#loop through show_result_list, append the number of points and calculate the number of posts
for row in show_result_list:
time_str1 = row[0]
datetime_dt1 = dt.strptime(time_str1, "%m/%d/%Y %H:%M")
hour_dt1 = datetime_dt1.hour
if hour_dt1 not in show_points_by_hour:
show_points_by_hour[hour_dt1] = row[1]
counts_show_by_hour[hour_dt1] = 1
else:
show_points_by_hour[hour_dt1] += row[1]
counts_show_by_hour[hour_dt1] += 1
show_average_by_hour = []
#compute the average number of points per post, append to show_average_by_hour
for key in counts_show_by_hour:
a_list = []
a_list.append(key)
avg2 = show_points_by_hour[key]/counts_show_by_hour[key]
a_list.append(avg2)
show_average_by_hour.append(a_list)
swap_showp_by_hour = []
#swap the index of time and number of points and append to the swap_showp_by_hour
for row in show_average_by_hour:
v_list = []
v_list.append(row[1])
v_list.append(row[0])
swap_showp_by_hour.append(v_list)
#implement the sorted function to create a top-list
sor_show_pointsbh = sorted(swap_showp_by_hour, reverse=True)
string4 = 'Top 5 Hours for Show Posts Points:'
print(string4)
for row in sor_show_pointsbh[:5]:
string5 = dt.strptime(str(row[1]), '%H')
string5 = dt.strftime(string5, '%H:%M')
output = '{0}: {1:.2f} points per post'.format(string5, row[0])
print(output)
```

We can see that the difference between the number of Show HN posts received every hour is small. Points are distributed evenly. It doesn't matter much when exactly you create a Show HN post if it's in the period:

- from 22:00 (10pm) to 00:00 (12pm)
- at 12:00 (12am)

Now we will compute the average number of points other posts receive.

In [40]:

```
other_result_list = []
#loop through other_posts, append the time (row[6]) and the number of points (row[3]) to the other_result_list
for row in other_posts:
oth_list = []
oth_list.append(row[6])
oth_list.append(int(row[3]))
other_result_list.append(oth_list)
"""create two empty dictionaries:
counts_other_by_hour - for the number of posts at a given hour
other_points_by_hour - for the number of points for these posts
"""
counts_other_by_hour = {}
other_points_by_hour = {}
#loop through other_result_list, append time and calculate the number of post for each hour
for row in other_result_list:
time_str3 = row[0]
datetime_dt3 = dt.strptime(time_str3, "%m/%d/%Y %H:%M")
hour_dt3 = datetime_dt3.hour
if hour_dt3 not in other_points_by_hour:
other_points_by_hour[hour_dt3] = row[1]
counts_other_by_hour[hour_dt3] = 1
else:
other_points_by_hour[hour_dt3] += row[1]
counts_other_by_hour[hour_dt3] += 1
av_other_points = []
#compute the average of points per hour, append to av_other_points
for key in counts_other_by_hour:
f_list = []
f_list.append(key)
avg_f = other_points_by_hour[key]/counts_other_by_hour[key]
f_list.append(avg_f)
av_other_points.append(f_list)
swap_other_points = []
#swap the index of the time and number of points
for row in av_other_points:
v_list = []
v_list.append(row[1])
v_list.append(row[0])
swap_other_points.append(v_list)
#implement the sorted function on swap_other_points to create a top-list
sor_other_pointsbh = sorted(swap_other_points , reverse=True)
string7 = 'Top 5 Hours for Other Posts Points:'
print(string7)
for row in sor_other_pointsbh[:5]:
string8 = dt.strptime(str(row[1]), '%H')
string8 = dt.strftime(string8, '%H:%M')
oth_output = '{0}: {1:.2f} points per post'.format(string8, row[0])
print(oth_output)
```

The best period for posting an Other Post is between 13:00 (1pm) and 16:00 (4pm). Now we'll compare the result for different post types.

In [42]:

```
#print all the results
string2 = 'Top 5 Hours for Ask Posts Points:'
print(string2)
for row in sor_ask_average[:5]:
string3 = dt.strptime(str(row[1]), '%H')
string3 = dt.strftime(string3, '%H:%M')
output = "{0}: {1:.2f} point per post".format(string3, row[0])
print(output)
print('\n')
string4 = 'Top 5 Hours for Show Posts Points:'
print(string4)
for row in sor_show_pointsbh[:5]:
string5 = dt.strptime(str(row[1]), '%H')
string5 = dt.strftime(string5, '%H:%M')
output = '{0}: {1:.2f} points per post'.format(string5, row[0])
print(output)
print('\n')
string7 = 'Top 5 Hours for Other Posts Points:'
print(string7)
for row in sor_other_pointsbh[:5]:
string8 = dt.strptime(str(row[1]), '%H')
string8 = dt.strftime(string8, '%H:%M')
oth_output = '{0}: {1:.2f} points per post'.format(string8, row[0])
print(oth_output)
```

Maximum points on average receives a post not in the Show or Ask HN category. If you want to create such a post, the best time for it is between 13:00 (1pm) and 16:00 (4pm).

The second place is taken by the Show HN posts. The get less points. If you want to get maximum points, the best time for it is at 23:00 (11pm), 12:00 (12 am), and 22:00 (10pm).

The Ask HN post category receives half as many points as the Show HN one. You can get the maximum points if you create a post at 15:00 (3pm), 13:00 (1pm), 16:00 (4pm).

In [ ]:

```
```