This project will analyse the number of comments on posts that are found on Hacker News, a popular technology website. The aim of this project is two-fold. *Hacker News* creates many different types of posts, but we will focus on two in this project:

- posts that begin with
**Ask HN** - posts that begin with
**Show HN**

We will isolate these posts from the data set and use them to answer two questions:

- Between
**Ask HN**and**Show HN**posts, which types of posts generate more comments on average? - Does the time that the post is created affect the average number of comments?

The original data set for this project can be found on kaggle. However, for this project, a reduced data set will be used. As the purpose of this project is to analyse the number of comments different types of posts receive, all posts that did not receive comments have been removed from the data set.

In [11]:

```
#read in the csv file:
import csv
with open('hacker_news.csv') as file:
reader = csv.reader(file)
hn = list(reader)
header = hn[0] #create a header list to keep track of what the values in the list relate to
hn = hn[1:]
#display the header and the first 5 rows:
print(header)
print('\n')
print(hn[:5])
```

In [13]:

```
#create new lists to isolate the 'Ask HN', 'Show HN' and 'Other' posts
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
title = title.lower() #put all letters in lower case for easier search
if title.startswith('ask hn'):
ask_posts.append(row)
elif title.startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
#check the number of posts in each list:
print('Number of Ask posts:', len(ask_posts))
print('Number of Show posts:', len(show_posts))
print('Number of all other posts:', len(other_posts))
```

In [14]:

```
#create a function to calculate the average number of comments for 'Ask' and 'Show' posts
def avg_comments(list):
total_comments = 0
for row in list:
num_comments = int(row[4])
total_comments += num_comments
avg_comments = total_comments/len(list)
return avg_comments
avg_ask_comments = print('Number of average ask posts:', avg_comments(ask_posts))
avg_show_comments = print('Number of average show posts:', avg_comments(show_posts))
```

So far, we have read the *Hacker News* csv file, created a list containing only the header information from the data and another list of data with the header removed. We then created 3 new lists to isolate the data according to what we would like to analyse:

- ask_posts list
- show_posts list
- other_posts list

We then created a function to compute the average number of comments that the different types of posts get.
**The results are as follows:**

*Ask HN* posts get, on average, 14.04 comments, while *Show HN* posts get 10.32. This suggests that *Ask HN* posts have more engagement than the *Show HN* posts.

Since *Ask HN* posts receive more comments, we will analyse these posts further to see whether the time that the post is posted affects the number of comments that the post receives.

In [15]:

```
#import datetime module
import datetime as dt
result_list = [] #this list will have the time that the post was created, and how many comments were left
for row in ask_posts:
created_at = row[6]
num_comments = row[4]
result_list.append([created_at, num_comments])
#print first few rows to check that the list has been correctly appended
print(result_list[:3])
```

In [17]:

```
#create frequency tables for the number of posts per hour, and number of comments per hour
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
date = row[0]
date = dt.datetime.strptime(date, '%m/%d/%Y %H:%M')
hour = date.strftime('%H')
num_comments = int(row[1])
if hour in counts_by_hour:
counts_by_hour[hour] += 1
comments_by_hour[hour] += num_comments
else:
counts_by_hour[hour] = 1
comments_by_hour[hour] = num_comments
print('Posts per hour:')
print(counts_by_hour)
print('\n')
print('Comments per hour:')
print(comments_by_hour)
```

In [18]:

```
#calculate the average number of comments per hour
avg_by_hour = []
for hour in counts_by_hour:
average = comments_by_hour[hour]/counts_by_hour[hour]
avg_by_hour.append([hour, average])
print('Average number of comments per hour (hour first):')
print(avg_by_hour)
```

In [21]:

```
#swap the columns around so that average is the first element
swap_avg_by_hour = []
for row in avg_by_hour:
swapped = [row[1], row[0]]
swap_avg_by_hour.append(swapped)
print('Unsorted average number of comments per hour (average first):')
print(swap_avg_by_hour)
#sort the list so that it is easier to analyse
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print('\n')
print('Sorted average number of comments per hour (average first):')
print(sorted_swap)
```

Up to this point, we have isolated the time of post creation and the number of comments per post, and generated a new list with this data. Using this list, we extracted the hour from each time, and created two frequency tables:

- the number of posts per hour
- the number of comments per hour

These frequency tables then allowed us to calculate the average comments per hour. Although this gave us the information we needed, the data was not in a readable format for analysis. Therefore, we sorted the average number of comments in descending order, for ease of analysis.

Next, we will format a string and print out the top 5 hours that generate the highest number of comments:

In [22]:

```
print('Top 5 Hours for Ask Posts Comments.')
for row in sorted_swap[:5]:
average = row[0]
hour = row[1]
hour = dt.datetime.strptime(hour, "%H")
hour = hour.strftime("%H:%M")
top_5 = '{}: {:.2f} average comments per post'.format(hour, average)
print(top_5)
```

**Top 5 Hours for Ask Posts Comments (US/Eastern Time)**

- 15:00: 38.59 average comments per post
- 02:00: 23.81 average comments per post
- 20:00: 21.52 average comments per post
- 16:00: 16.8 average comments per post
- 21:00: 16.01 average comments per post

The 'Ask Posts' with the highest number of comments are those posted at **15:00** US/Eastern Time. The next best time to post would be 02:00, then 20:00 and finally 16:00 and 21:00, which have, essentially, the same number of average posts.

In South Africa, the best times to post would be 6 hours ahead of these times. Therefore, in order to get a higher number of comments, posts should be created at 21:00.

This project had two aims: to determine whether 'Ask' or 'Show' posts receive more comments on average, and to determine whether the time that the post was created affected the number of comments.

We found that 'Ask' posts receive more comments on average compared with 'Show' posts. We also found that posts created at 15:00 US/Eastern time received considerably more comments that other times, with the second best time being 02:00.

Therefore, if you are looking to create a post that receives optimal engagement from readers, I'd suggest creating an 'Ask' post, and posting it at 15:00 Eastern time.