This guided project showcases how to work with strings, dates, times and object-orientated programming (OOP).
We'll introduce the datetime
module to work with date and time data. The dataset we will be analyzing is from submissions to a site called Hacker News.
It is a platform where users post stories that can receive votes and comments. Posts that gain traction appear at the top of the site listings which can attract many more visitors.
The original dataset contained about 300,000 rows. We will work with a slice of data, approximately 20,000 rows, provided by Dataquest. They cleaned the data by 'removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions.
id: the unique post identifier
title: post title
url: the URL that the posts links to, if the post has a URL
num_points: net votes (total number of upvotes minus the total number of downvotes)
num_comments: the number of comments on the post
author: the username of the person who submitted the post
created_at: the date and time of the post's submission
Users title posts 'Ask HN' when asking a specific question or 'Show HN' to show a project, product, or something recreational.
We will find what type of posts recieve more comments on average and at what time user interaction is more likely.
Let's begin by importing the appropriate libraries.
import datetime as dt
from csv import reader
file = open('hacker_news.csv')
file_reader = reader(file)
hack_h = list(file_reader) # list with header row
print(hack_h[:3])
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]
Remove the header row and assign it to the variable header
, and the rest of the dataset to hack
.
header = hack_h[0]
print(header)
hack = hack_h[1:]
print(hack[0])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
The different types of posts need to be isolated to find which had more comments on average.
Begin by intializing three empty lists titled ask_posts
, show_posts
, and other_posts
.
Loop over the hack
dataset and find the submissions that are titled 'Ask HN' and 'Show HN'.
We will utilize the attribute string.startswith()
to filter posts with our conditions.
Using the string.lower()
attribute will aid in simplyfing the search.
ask_posts = []
show_posts = []
other_posts = []
for row in hack:
title = row[1] # assign the title element in the dataset to the variable 'title'
lowc = title.lower() # apply the .lower() method to return the lowercase version of string
if lowc.startswith('ask hn'):
ask_posts.append(row)
elif lowc.startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print(
'Total ask posts: ', len(ask_posts), '\n',
'Total show posts: ', len(show_posts), '\n',
'Other Posts: ', len(other_posts)
)
Total ask posts: 1744 Total show posts: 1162 Other Posts: 17194
We have filtered the appropriate types of posts into respective lists. Total ask posts: 1744 Total show posts: 1162 Other Posts: 17194
Now we will determine which type of post, on average, received more comments.
We will do this by iterating over appropriate list and summing all the comments each post had.
Let's begin with the 'Ask HN' posts.
total_ask_comments = 0 # this variable will start the sum for the total number of comments
for row in ask_posts:
comments = int(row[4]) # assign the integer value of the num_comments column to the variable comments
total_ask_comments += comments # begin the summation
print(f"Total 'ask' comments: {total_ask_comments:,}")
avg_ask_comments = total_ask_comments / len(ask_posts) # Find the average by diving total no. of comments by total posts
print(f"Average 'ask post' comments: {avg_ask_comments:.2f}")
Total 'ask' comments: 24,483 Average 'ask post' comments: 14.04
Now do the same for the show posts.
total_show_comments = 0
for row in show_posts:
comm = int(row[4])
total_show_comments += comm
print(f"Total 'show' comments: {total_show_comments:,}")
avg_show_comments = total_show_comments / len(show_posts)
print(f"Average 'show post' comments: {avg_show_comments:.2f}")
Total 'show' comments: 11,988 Average 'show post' comments: 10.32
Comparing the average number of post comments we find that 'ask' posts has on average 1.4 times more comments.
The average number of comments is 14 comments for ask posts versus 10 comments for show posts.
We can see that 'ask' posts are more likely to recieve comments than 'show' posts, so we will only focus on 'ask' posts for now.
Intialize a list to store lists of time post created and its no. of comments.
result_list = []
for row in ask_posts:
created = row[6] # assign the date and time column to the variable created
num_comments = int(row[4]) # assign the number of comments the post recieved to the variable num_comments
comment_info = [created, num_comments] # make a list of the previous two variables
result_list.append(comment_info)
print(result_list[:3])
[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1]]
Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
posts_p_hour = {}
comments_p_hour = {}
for row in result_list: # iterate over the list of lists that stores the datetime and no. of comments for each post
date = row[0] # assign the date data to the variable date
hour = dt.datetime.strptime(date, "%m/%d/%Y %H:%M") # dt.datetime.strptime() method extracts the date info into appropriate format
hour = hour.strftime("%H") # datetime.strftime() method extracts the hour from the previous formatted date
if hour not in posts_p_hour: # counts the number of posts /hr while summing the no. of comments for each post /hr
posts_p_hour[hour] = 1
comments_p_hour[hour] = row[1]
else:
posts_p_hour[hour] += 1
comments_p_hour[hour] += row[1]
print(comments_p_hour, '\n', posts_p_hour)
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641} {'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
avg_by_hour = []
for hour in comments_p_hour:
num_comm = comments_p_hour[hour]
num_posts = posts_p_hour[hour]
avg = num_comm / num_posts
avg_by_hour.append([hour, avg])
print(sorted(avg_by_hour))
[['00', 8.127272727272727], ['01', 11.383333333333333], ['02', 23.810344827586206], ['03', 7.796296296296297], ['04', 7.170212765957447], ['05', 10.08695652173913], ['06', 9.022727272727273], ['07', 7.852941176470588], ['08', 10.25], ['09', 5.5777777777777775], ['10', 13.440677966101696], ['11', 11.051724137931034], ['12', 9.41095890410959], ['13', 14.741176470588234], ['14', 13.233644859813085], ['15', 38.5948275862069], ['16', 16.796296296296298], ['17', 11.46], ['18', 13.20183486238532], ['19', 10.8], ['20', 21.525], ['21', 16.009174311926607], ['22', 6.746478873239437], ['23', 7.985294117647059]]
Create a list that reverses the columns in avg_by_hour
.
As seen above, the sorted()
function returns a sorted list using the first index.
We would like to see in descending order the hours with the highest average number of comments.
Swap the columns.
swap_columns = []
for row in avg_by_hour:
swap_columns.append([row[1], row[0]])
print(swap_columns[:6])
[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23']]
Sort `swap_columns` in descending order and print the first five entries.
sorted_swap = sorted(swap_columns, reverse = True)
print('Top 5 Hours for Ask Posts Comments')
for row in sorted_swap[:6]:
hour = dt.datetime.strptime(row[1], "%H") # return a datetime object from the hour index
hour = hour.strftime("%H:%M") # utilizing the .strftime() method to specify the time format
comments = row[0]
print(f"{hour}: {comments:.2f} average comments per post")
Top 5 Hours for Ask Posts Comments 15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post 13:00: 14.74 average comments per post
The average amount of comments per post, for the top five hours for Ask posts comments, is found above.
Note that the original dataset contains these timeframes in EDT and we need them in CDT.
eg. 11am Central is 12pm Eastern.
Modify the previous loop to include this conversion. We could also go to the beginning and begin our dataset with already converted timeframes.
print('Top 5 Hours for Ask Posts Comments')
for row in sorted_swap[:6]:
hour = dt.datetime.strptime(row[1], "%H")
hour = hour - dt.timedelta(hours=1)
hour = hour.strftime("%H:%M")
comments = row[0]
print(f"{hour}: {comments:.2f} average comments per post")
Top 5 Hours for Ask Posts Comments 14:00: 38.59 average comments per post 01:00: 23.81 average comments per post 19:00: 21.52 average comments per post 15:00: 16.80 average comments per post 20:00: 16.01 average comments per post 12:00: 14.74 average comments per post
This works nicely for the top 5 hours, however pay attention to how the 0th hour becomes the 23rd hour. We shall delve into this at a later time.
We wanted to find the the number of posts and comments per hour and average comments per post per hour from the dataset sourced from the site Hacker News.
For loops, dictionaries and the datetime
module were utilized to sort, format and analyze the data.
The findings show that posts that begin with 'Ask HN' receive more posts and number of comments than posts starting with 'Show HN'.
An inference can be made that users are more prone to ask for help from the community and given the higher number of comments, other users more often share the same question.
In CDT, it was found that the top three hours for average comments per post were 2pm, 12am, and 7pm.
(return to) Let's delve deeper into the data set and gain more insight from the following:
Determine if show or ask posts receive more points on average.
Determine if posts created at a certain time are more likely to receive more points.
Compare your results to the average number of comments and points other posts receive.