Notebook

What Time is it?¶

Introduction¶

This guided project showcases how to work with strings, dates, times and object-orientated programming (OOP).

We'll introduce the datetime module to work with date and time data. The dataset we will be analyzing is from submissions to a site called Hacker News.

It is a platform where users post stories that can receive votes and comments. Posts that gain traction appear at the top of the site listings which can attract many more visitors.

The original dataset contained about 300,000 rows. We will work with a slice of data, approximately 20,000 rows, provided by Dataquest. They cleaned the data by 'removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions.

Dataset Description:¶

id: the unique post identifier

title: post title

url: the URL that the posts links to, if the post has a URL

num_points: net votes (total number of upvotes minus the total number of downvotes)

num_comments: the number of comments on the post

author: the username of the person who submitted the post

created_at: the date and time of the post's submission

Users title posts 'Ask HN' when asking a specific question or 'Show HN' to show a project, product, or something recreational.

We will find what type of posts recieve more comments on average and at what time user interaction is more likely.

Let's begin by importing the appropriate libraries.

In [1]:

import datetime as dt
from csv import reader

Read in Data and Create List of Lists¶

In [2]:

file = open('hacker_news.csv')
file_reader = reader(file)
hack_h = list(file_reader) # list with header row
print(hack_h[:3])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]

Remove the header row and assign it to the variable header, and the rest of the dataset to hack.

In [3]:

header = hack_h[0]
print(header)
hack = hack_h[1:]
print(hack[0])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']

Extracting 'Ask HN' and 'Show HN' Posts¶

The different types of posts need to be isolated to find which had more comments on average.

Begin by intializing three empty lists titled ask_posts, show_posts, and other_posts.

Loop over the hack dataset and find the submissions that are titled 'Ask HN' and 'Show HN'.

We will utilize the attribute string.startswith() to filter posts with our conditions.

Using the string.lower() attribute will aid in simplyfing the search.

In [4]:

ask_posts = []
show_posts = []
other_posts = []

for row in hack:
    
    title = row[1] # assign the title element in the dataset to the variable 'title'
    lowc = title.lower() # apply the .lower() method to return the lowercase version of string
    
    if lowc.startswith('ask hn'):
        ask_posts.append(row)
        
    elif lowc.startswith('show hn'):
        show_posts.append(row)
        
    else:
        other_posts.append(row)

print(
    'Total ask posts: ', len(ask_posts), '\n',
    'Total show posts: ', len(show_posts), '\n',
    'Other Posts: ', len(other_posts)
     )

Total ask posts:  1744 
 Total show posts:  1162 
 Other Posts:  17194

Calculating the Average Number of Comments for Ask HN and Show HN Posts¶

We have filtered the appropriate types of posts into respective lists. Total ask posts: 1744 Total show posts: 1162 Other Posts: 17194

Now we will determine which type of post, on average, received more comments.

We will do this by iterating over appropriate list and summing all the comments each post had.

Let's begin with the 'Ask HN' posts.

In [5]:

total_ask_comments = 0 # this variable will start the sum for the total number of comments

for row in ask_posts:

    comments = int(row[4]) # assign the integer value of the num_comments column to the variable comments 
    total_ask_comments += comments # begin the summation
    
print(f"Total 'ask' comments: {total_ask_comments:,}")

avg_ask_comments = total_ask_comments / len(ask_posts) # Find the average by diving total no. of comments by total posts

print(f"Average 'ask post' comments: {avg_ask_comments:.2f}")

Total 'ask' comments: 24,483
Average 'ask post' comments: 14.04

Now do the same for the show posts.

In [6]:

total_show_comments = 0

for row in show_posts:

    comm = int(row[4])
    total_show_comments += comm
    
print(f"Total 'show' comments: {total_show_comments:,}")

avg_show_comments = total_show_comments / len(show_posts)

print(f"Average 'show post' comments: {avg_show_comments:.2f}")

Total 'show' comments: 11,988
Average 'show post' comments: 10.32

Finding the Number of Ask Posts and Comments per Hour¶

Comparing the average number of post comments we find that 'ask' posts has on average 1.4 times more comments.

The average number of comments is 14 comments for ask posts versus 10 comments for show posts.

We can see that 'ask' posts are more likely to recieve comments than 'show' posts, so we will only focus on 'ask' posts for now.

Intialize a list to store lists of time post created and its no. of comments.

In [7]:

result_list = []

for row in ask_posts:
    
    created = row[6] # assign the date and time column to the variable created
    num_comments = int(row[4]) # assign the number of comments the post recieved to the variable num_comments
    comment_info = [created, num_comments] # make a list of the previous two variables
    result_list.append(comment_info)

print(result_list[:3])

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1]]

Calculate the number of ask posts created in each hour of the day, along with the number of comments received.

In [8]:

posts_p_hour = {}
comments_p_hour = {}
    
for row in result_list: # iterate over the list of lists that stores the datetime and no. of comments for each post
    
    date = row[0] # assign the date data to the variable date
    hour = dt.datetime.strptime(date, "%m/%d/%Y %H:%M") # dt.datetime.strptime() method extracts the date info into appropriate format
    hour = hour.strftime("%H") # datetime.strftime() method extracts the hour from the previous formatted date
    
    if hour not in posts_p_hour: # counts the number of posts /hr while summing the no. of comments for each post /hr
        posts_p_hour[hour] = 1 
        comments_p_hour[hour] = row[1]
    else:
        posts_p_hour[hour] += 1
        comments_p_hour[hour] += row[1]
        
print(comments_p_hour, '\n', posts_p_hour)

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641} 
 {'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}

Find the Average Number of Comments Ask Posts Receive by Hour.¶

In [9]:

avg_by_hour = []

for hour in comments_p_hour:
    
    num_comm = comments_p_hour[hour]
    num_posts = posts_p_hour[hour]
    
    avg = num_comm / num_posts
    
    avg_by_hour.append([hour, avg])

print(sorted(avg_by_hour))

[['00', 8.127272727272727], ['01', 11.383333333333333], ['02', 23.810344827586206], ['03', 7.796296296296297], ['04', 7.170212765957447], ['05', 10.08695652173913], ['06', 9.022727272727273], ['07', 7.852941176470588], ['08', 10.25], ['09', 5.5777777777777775], ['10', 13.440677966101696], ['11', 11.051724137931034], ['12', 9.41095890410959], ['13', 14.741176470588234], ['14', 13.233644859813085], ['15', 38.5948275862069], ['16', 16.796296296296298], ['17', 11.46], ['18', 13.20183486238532], ['19', 10.8], ['20', 21.525], ['21', 16.009174311926607], ['22', 6.746478873239437], ['23', 7.985294117647059]]

Sorting and Printing Values from List of Lists¶

Create a list that reverses the columns in avg_by_hour.

As seen above, the sorted() function returns a sorted list using the first index.

We would like to see in descending order the hours with the highest average number of comments.

Swap the columns.

In [10]:

swap_columns = []

for row in avg_by_hour:
    swap_columns.append([row[1], row[0]])
    
print(swap_columns[:6])

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23']]

Sort `swap_columns` in descending order and print the first five entries.

In [11]:

sorted_swap = sorted(swap_columns, reverse = True)

print('Top 5 Hours for Ask Posts Comments')

for row in sorted_swap[:6]:
    
    hour = dt.datetime.strptime(row[1], "%H") # return a datetime object from the hour index
    hour = hour.strftime("%H:%M") # utilizing the .strftime() method to specify the time format
    comments = row[0]
    
    print(f"{hour}: {comments:.2f} average comments per post")

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
13:00: 14.74 average comments per post

The average amount of comments per post, for the top five hours for Ask posts comments, is found above.

Note that the original dataset contains these timeframes in EDT and we need them in CDT.

eg. 11am Central is 12pm Eastern.

Modify the previous loop to include this conversion. We could also go to the beginning and begin our dataset with already converted timeframes.

In [12]:

print('Top 5 Hours for Ask Posts Comments')

for row in sorted_swap[:6]:
    hour = dt.datetime.strptime(row[1], "%H")
    hour = hour - dt.timedelta(hours=1)
    hour = hour.strftime("%H:%M")
    comments = row[0]
    print(f"{hour}: {comments:.2f} average comments per post")

Top 5 Hours for Ask Posts Comments
14:00: 38.59 average comments per post
01:00: 23.81 average comments per post
19:00: 21.52 average comments per post
15:00: 16.80 average comments per post
20:00: 16.01 average comments per post
12:00: 14.74 average comments per post

This works nicely for the top 5 hours, however pay attention to how the 0th hour becomes the 23rd hour. We shall delve into this at a later time.

Conclusion¶

We wanted to find the the number of posts and comments per hour and average comments per post per hour from the dataset sourced from the site Hacker News.

For loops, dictionaries and the datetime module were utilized to sort, format and analyze the data.

The findings show that posts that begin with 'Ask HN' receive more posts and number of comments than posts starting with 'Show HN'.

An inference can be made that users are more prone to ask for help from the community and given the higher number of comments, other users more often share the same question.

In CDT, it was found that the top three hours for average comments per post were 2pm, 12am, and 7pm.

Further Analysis¶

(return to) Let's delve deeper into the data set and gain more insight from the following:

Determine if show or ask posts receive more points on average.
Determine if posts created at a certain time are more likely to receive more points.
Compare your results to the average number of comments and points other posts receive.