#!/usr/bin/env python # coding: utf-8 # # Project 2: Exploring Hacker News Posts # ## 1. Introduction # In this project, we'll work with a data set of submissions to popular technology site `Hacker News`. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result. # # We're specifically interested in posts whose titles begin with either `Ask HN` or `Show HN`. # - Users submit `Ask HN` posts to ask the Hacker News community a specific question. # - Users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting. # # Our goal for this project is to compare these two types of post and determine the following: # - Do Ask HN or Show HN receive more comments on average? # - Do posts created at a certain time receive more comments on average? # ## 2. Removin Header from a List of Lists # ### 2.1. File: "hacker_news.csv" # # In the code cell below, we: # # - Import the `reader()` function from the `csv` module # - Open the `hacker_news.csv` file using the `open()` function, and assign the output to a variable named `opened_file`. If you run into an error named `UnicodeDecodeError`, add `encoding="utf8"`to the `open()` function (for instance, use `open('hacker_news.csv', encoding='utf8')`) # - Read in the `opened_file` using the `reader()` function, and assign the output to a variable named `read_file` # - Transform the `read_file` to a list of lists using `list()` and save it to a variable named `hn` # - Save the header to a variable named `header` # - Remove the first row from `hn` # - Display the header row and the first `5` rows of the data set. # In[1]: # hacker_news data set from csv import reader opened_file = open('hacker_news.csv', encoding='utf8') read_file = reader(opened_file) hn = list(read_file) header = hn[0] hn = hn[1:] print(header) print('\n') print(hn[:5]) # ### 3. Extracting Ask HN and Show HN Posts # # Now that we've removed the headers from `hn`, we're ready to filter our data. Since we're only concerned with post titles beginning with `Ask HN` or `Show HN`, we'll create new lists of lists containing just the data for those titles. # # To find the posts that begin with either `Ask HN` or `Show HN`, we'll use the string method `startswith`. Given a string object, say, `string1`, we can check if starts with, say, `dq` by inspecting the output of the object `string1.startswith('dq')`. If `string1` starts with `dq`, it will return `True`, otherwise it will return `False`. # # If we wish to control for case, we can use the `lower` method which returns a lowercase version of the starting string. # ### 3.1 Separating Posts # # In the code cell below, we: # # - Create three empty lists called `ask_posts`, `show_posts` and `other_posts` # - Loops through each row in `hn` # - Assign the title in each row to a variable named `title` # - Because the `title` column is the second column, you'll need to get the element at index `1` in each row # - Implement the following steps(use `lower()` and `starswith.` methods): # - If the lowercase version of `title` strats with `ask hn`, append the row to `ask_posts` # - Else if the lowercase version of `title` stars with `show hn`, append the row to `show_posts` # - Else append to `other_posts` # - Check the number of posts in `ask_posts`, `show_posts`, and `other_posts` # # In[2]: # Method to separate posts beginning with Ask HH and Show HN ask_posts = [] show_posts = [] other_posts = [] for row in hn: title = row[1] if title.lower().startswith("ask hn"): ask_posts.append(row) elif title.lower().startswith("show hn"): show_posts.append(row) else: other_posts.append(row) print('ask_posts: ', len(ask_posts)) print('show_posts: ', len(show_posts)) print('other_posts: ', len(other_posts)) # ## 4. Calculating the Average Number of Comments for Ask HN and Show HN Posts # # In the last screen, we separated the `ask posts` and the `show posts` into two list of lists named `ask_posts` and `show_posts`. # # Next, let's determine if `ask posts` or `show posts` receive more comments on average. # ### 4.1. Average Number of Comments for Ask HN # # In the code cell below, we: # # - Find the total number of comments in ask posts and assign it to `total_ask_comments` # - Set `total_ask_comments` to `0` # - Use a for loop to iterate over the `ask posts` # - Because the `num_comments` column is the fifth column in `ask_posts`, you'll need to get the element at index `4` in each row # - You'll also need to convert the value to an integer so that we can calculate the sum of all the comments # - Add this value to `total_ask_comments` # - Compute the average number of comments on `ask posts` and assign it to `avg_ask_comments` # - Print `avg_ask_comments` # # In[3]: # Average Ask HN total_ask_comments = 0 for row in ask_posts: num_comments = int(row[4]) total_ask_comments += num_comments avg_ask_comments = total_ask_comments / len(ask_posts) print(avg_ask_comments) # ### 4.2. Average Number of Comments for Show HN # # In the code cell below, we: # # - Find the total number of comments in ask posts and assign it to `total_show_comments` # - Set `total_show_comments` to `0` # - Use a for loop to iterate over the `ask posts` # - Because the `num_comments` column is the fifth column in `show_posts`, you'll need to get the element at index `4` in each row # - You'll also need to convert the value to an integer so that we can calculate the sum of all the comments # - Add this value to `total_show_comments` # - Compute the average number of comments on `show posts` and assign it to `avg_show_comments` # - Print `avg_show_comments` # In[4]: # Average Show HN total_show_comments = 0 for row in show_posts: num_comments = int(row[4]) total_show_comments += num_comments avg_show_comments = total_show_comments / len(show_posts) print(avg_show_comments) # On average, ask posts approximately receive 10 comments whereas show posts receive almost 5 comments. Since ask posts are more likely to receive comments. # ## 5. Finding the Amount of Ask Posts and Comments by Hour Created # # Since `ask posts` are more likely to receive comments, we'll focus our remaining analysis just on these posts. # # Next, we'll determine if `ask posts` created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis: # # - Calculate the amount of ask posts created in each hour of the day, along with the number of comments received # - Calculate the average number of comments ask posts receive by hour created # # We'll tackle the first step — calculating the amount of ask posts and comments by hour created. We'll use the `datetime` module to work with the data in the `created_at` column. # # Recall that we can use the `datetime.strptime()` constructor to parse dates stored as `strings` and return `datetime objects`. # ### 5.1. Import the datetime # # In the cell below, we: # # - Import the `datetime` module as `dt` # In[5]: # import datetime module import datetime as dt # ### 5.2. Appending created_at and num_columns # # In the cell bellow, we: # # - Create an empty list and assign it to `result_list`. This will be a list of list # - Iterate over `ask_post` and append to `result_list` a list with two elements: # - The first element shall be the column `created_at` # - Because the `created_at` column is the seventh column in `ask_posts`, you'll need to get the element at index `6` in each row # - The second element shall be the `number of comments` of the post # - You'll also need to convert the value to an integer # # In[6]: # Appending columns: created_at and num_columns result_list = [] for row in ask_posts: result_list.append([row[6], int(row[4])]) # ### 5.3. Calculating the amount of ask and comments # # - Create two empty dictionaries called `counts_by_hour` and `comments_by_hour` # - Loop through each row of `result_list` # - Extract the hour from the date, which is the first element of the row # - Use the `datetime.strptime()` method to select just the hour from the datetime object # - Use the string we want to parse as the first argument and a string that specifies the format as the second argument # - Use the `datetime.strftime()` method to select just the hour from the datetime object # - If the hour isn't a key in `counts_by_hour`: # - Create the key in `counts_by_hour` and set it equal to 1 # - Create the key in `comments_by_hour` and set it equal to the `comment` number # - If the hour is already a key in `counts_by_hour`: # - Increment the value in `counts_by_hour` by `1` # - Increment the value in `comments_by_hour` by the `comment` number # # In[7]: # amount of ask post and comments counts_by_hour = {} comments_by_hour = {} date_format = "%m/%d/%Y %H:%M" for row in result_list: date_string = row[0] time = dt.datetime.strptime(date_string, date_format) hour = time.strftime("%H") if hour not in counts_by_hour: counts_by_hour[hour] = 1 comments_by_hour[hour] = row[1] else: counts_by_hour[hour] += 1 comments_by_hour[hour] += row[1] # ## 6. Calculatig the Average Number of Comments for Ask HN Post by Hour # # We created two dictionaries: # # - **counts_by_hour:** Contains `the number of ask posts` created during each hour of the day # - **comments_by_hour:** Contains the corresponding `number of comments ask posts` created at each hour received # # Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day. # In the code below, we: # - Initialized an empty list of lists and assigned it to `avg_by_hour` # - Iterated over the keys of `comments_by_hour` # - Calculate the average number in `avg_num` # - Appended `comment` and `avg_number` in `avg_by_hour` # In[8]: # Average Number of comments avg_by_hour = [] for comment in comments_by_hour: avg_num = comments_by_hour[comment] / counts_by_hour[comment] avg_by_hour.append([comment, avg_num]) avg_by_hour # ## 7. Sorting and Printing Values from a List of Lists # # ### 7.1. Part one: # # - Create a list that equals `avg_by_hour` with swapped columns # - Create an empty list and assign it to `swap_avg_by_hour` # - Iterate over the rows of `avg_by_hour` and append to `swap_avg_by_hour` a list whose first element is the second element of the row, and whose second element is the first element of the row # - Print `swap_avg_by_hour` # # ### 7.2. Part Two: # # - Use the `sorted()` function to sort `swap_avg_by_hour` in descending order. Since the first column of this list is the average number of comments, sorting the list will sort by the average number of comments # - Set the `reverse` argument to `True`, so that the highest value in the first column appears first in the list # - Assign the result to `sorted_swap` # # ### 7.3. Part Three: # # - Print the string "Top 5 Hours for Ask Posts Comments" # - Loop through each average and each hour (in this order) in the first five lists of `sorted_swap` # - Use the `str.format()` method to print the hour and average in the following format: `15:00: 38.59 average comments per post` # - To format the hours, use the `datetime.strptime()` constructor to return a datetime object and then use the `strftime()` method to specify the format of the time # - To format the average, you can use `{:.2f}` to indicate that just two decimal places should be used # # # # In[24]: # Sorting and Printing Values # Part one swap_avg_by_hour = [] for row in avg_by_hour: swap_avg_by_hour.append([row[1], row[0]]) print(swap_avg_by_hour) #Part two sorted_swap = sorted(swap_avg_by_hour, reverse=True) #Part three print("Top 5 Hours for Ask Posts Comments") for row in sorted_swap[:5]: time_string = row[1] time_top = dt.datetime.strptime(time_string, "%H") hour_top = time_top.strftime("%H:%M") print("{}: {:.2f} average comments per post".format(hour_top, row[0])) # The hour that receives the most comments per post on average is `15:00` with an average of `28.68` comments per post. The time zone used is Eastern Time in the US; as a result, we could also write `15:00` as `3:00 pm est`. # ## 8. Conclusion: # # Based on our analysis, we recommend posting at `15:00` or `3:00 p, est` in order to have a higher chance of receiving more comments on an `Aks Post`. # Furthermore, Creating post from `12:00` to `13:00` receives on average `28,7`comments per post which is another good option to do as well.