#!/usr/bin/env python # coding: utf-8 # **Overview:** # # **Hacker News (https://news.ycombinator.com/) is the "most trusted, widely-read, independent source of the latest news and technical coverage on cybersecurity, hacking threads, and infosec trends." In this project, we will look at the two primary types of posts on HN: "Ask HN" and "Show HN". Users use the former to ask the HN community a specific question, and the the latter to show the community something intresting such as a project or product.** # # **This project will compare these two types of posts and analyze which receives the most comments. It will also analyze the time stamps of these posts to see what time is the best to post in order to receive a comment.** # **Part 1: Reading the csv file and understanding the data** # In[1]: import datetime as dt from csv import reader # In[2]: opened = open("HN_posts_year_to_Sep_26_2016.csv") read_file = reader(opened) data = list(read_file) headers = data[0] hn = data[1:] # In[3]: print(headers,'\n') for row in hn[:5]: print(row, '\n') # **Part 2:** # # **To isolate only the posts w/ "Ask HN" or "Show HN", we can use the startswith string method. The cell below highlights our approach to seperating each list** # In[4]: eg_1 = hn[42][1] eg_1 # In[5]: eg_1.startswith("Ask HN") # **Now, we apply this approach to all the title posts in our dataset.** # # **To do so, We first create three empty lists - one for each category of title type. We then loop through each row in the dataset, and create a title variable at the first index of each row. We then use the .lower() method on the title and pass this value into a conditional that checks for 'ask hn' or 'show hn' in the title.** # In[6]: ask_posts = [] show_posts = [] other_posts = [] for row in hn: title = row[1].lower() if title.startswith("ask hn"): ask_posts.append(row) elif title.startswith("show hn"): show_posts.append(row) else: other_posts.append(row) # **Next, we create a function called `show_rows` to double check that our lists contain the correct posts.** # In[7]: def show_rows(dataset): for rows in dataset[:5]: print(rows,'\n') # In[8]: show_rows(ask_posts) # In[9]: show_rows(show_posts) # **The titles match the lists they are in, so we now have an idea of the breakdown of the number of posts on HN from our dataset. There are 9,139 ask posts, 10,158 show posts, and 273,822 other posts.** # In[10]: print(len(ask_posts)) print(len(show_posts)) print(len(other_posts)) # **Part 3:** # # **In this section, we compare the average number of comments in the `ask_posts` and `show_posts` list. To do this, we create a function `total_comments` that returns the total number of comments from each list. Then, we create an other function named `avg_comment` that returns the avg number of commment for each each list.** # In[11]: def total_comments(dataset): total_comments = 0 for row in dataset: comments = int(row[4]) total_comments += comments return total_comments total_ask_comments = total_comments(ask_posts) total_show_comments = total_comments(show_posts) # In[12]: def avg_comment(total_com, dataset): return round((total_com / len(dataset) ),2) avg_ask_comments = avg_comment(total_ask_comments, ask_posts) avg_show_comments = avg_comment(total_show_comments, show_posts) # In[13]: print(avg_ask_comments) print(avg_show_comments) # **From our analysis, we can see that the average "Ask HN" post recieves roughly 10.4 comments while the average "Show HN" post recieves about 4.9 comments.** # **Part 4:** # # **Now, that we have an idea of which post recieves the most comments, we can look deeper into the timing of each post. Our objective is to determine if there is a specific time or range of times that recieve the most comments. To narrow the scope of our investigation, we will look at only the comments and timestamps in the 'Ask HN' dataset.** # # **The first step for this part is to create a new, empty list called `result_list`. We then iterate through each row of data in `ask_posts` and append the time that the each post was created, as well as the num of comments for each post receievd into the `results list`** # In[14]: results_list = [] for row in ask_posts: time_create = row[-1] num_comment = int(row[4]) combined = [time_create, num_comment] results_list.append(combined) # **Below is a sample of the `results_list`.** # In[15]: show_rows(results_list) # **We then pass the info from the `results_list` into two dictonaries, `counts_by_hour` and `comments_by_hour`. To do this, we iterate trhough each value in the `results_list` and create a `time` and `comment_num` variable for these repective values in the list.** # # **Next, we turn the `time` variable into a datetime object, which then allows us to extract the hour of each post. We save this variable as `hour`. Then, we check if it is in `counts_by_hour` dict. If it isn't we create a key and value pair for it in the dict. If it is, then we add one to the existing value.** # # **We use a similar process for the `comments_by_hour` dict. We check if the hour key is in the list and if is not then we set that key equal to the `comment_num` variable. If it is a key, then we add the `comment_num` val to the existing value.** # In[16]: counts_by_hour = {} comments_by_hour = {} for row in results_list: time = row[0] comment_num = row[1] time_obj = dt.datetime.strptime(time, "%m/%d/%Y %H:%M") hour = time_obj.strftime("%H") if hour not in counts_by_hour: counts_by_hour[hour] = 1 else: counts_by_hour[hour] += 1 if hour not in comments_by_hour: comments_by_hour[hour] = comment_num else: comments_by_hour[hour] += comment_num # **Now, we are ready to calculate the average per hour list. To do this, we create a new, empty list called `avg_by_hour`. We then loop through each item in `counts_by_hour` and append the hour, the `k` variable, to the `avg_by_hour` list.** # # **We then use the same `k` variable to access the total number of comments for that hour from the `comments_by_hour` dict. We divide this value by `v`, which is the corresponding value for `counts_by_hour`. This division (the total number of comments for that hour / the number of posts) provides us with an avergae number of comments per post.** # In[17]: avg_by_hour = [] for k,v in counts_by_hour.items(): avg_by_hour.append([k, round(comments_by_hour[k] / v,2)]) # **Below is a sample of the `avg_hour` list. The first value of each row is the hour, and the second value is the avg number of comments for that hour.** # In[18]: show_rows(avg_by_hour) # **We then sort `avg_by_hour` by descending order of comments per hour, and then format the top five values to display their average comments per post.** # In[19]: avg_by_hour.sort(key=lambda row: (row[1]), reverse=True) # In[20]: print("Top 5 Hours for 'Ask HN' Comments:") for row in avg_by_hour[:5]: hour = row[0] time_obj = dt.datetime.strptime(hour, "%H") hour = time_obj.strftime("%H:%M") avg_comment = row[1] avg_comment_str = f"{avg_comment} average comments per post" print((f"{hour} : {avg_comment_str}")) # **The best time to post a question on HN is during the 15:00 hour. The second best hour to ask a question is 13:00, and the third best is 12:00. As such, we can deduce that the early afternoon period is the best to ask a question since the top three hours fall within this period.** # # **We can also observe that the 15:00 period is leaps and bounds ahead of the other hours regarding comments. It has roughly 60% more comments than the second best hour, 13:00.** # **Conclusion:** # # **This project analyzed 300k rows of data pertaining to articles posted on Hacker Network (HN). Specifcally, it looked at two types of posts on the website, 'Ask HN' and 'Show HN', both of which accounted for roughly 20k of the rows of data from our dataset.** # # **We were able to determine that 'Ask HN' posts recieve on average about 5 more comments per post. As well as, breakdown the best time to post an 'Ask HN' question and revieve a response, 15:00.**