#!/usr/bin/env python
# coding: utf-8

# # Project 2: Exploring Hacker News Posts

# ## 1. Introduction

# In this project, we'll work with a data set of submissions to popular technology site `Hacker News`. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
# 
# We're specifically interested in posts whose titles begin with either `Ask HN` or `Show HN`.
# - Users submit `Ask HN` posts to ask the Hacker News community a specific question. 
# - Users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting.
# 
# Our goal for this project is to compare these two types of post and determine the following:
# - Do Ask HN or Show HN receive more comments on average?
# - Do posts created at a certain time receive more comments on average?

# ## 2. Removin Header from a List of Lists

# ### 2.1. File: "hacker_news.csv"
# 
# In the code cell below, we:
# 
# - Import the `reader()` function from the `csv` module
# - Open the `hacker_news.csv` file using the `open()` function, and assign the output to a variable named `opened_file`. If you run into an error named `UnicodeDecodeError`, add `encoding="utf8"`to the `open()` function (for instance, use `open('hacker_news.csv', encoding='utf8')`)
# - Read in the `opened_file` using the `reader()` function, and assign the output to a variable named `read_file`
# - Transform the `read_file` to a list of lists using `list()` and save it to a variable named `hn`
# - Save the header to a variable named `header`
# - Remove the first row from `hn`
# - Display the header row and the first `5` rows of the data set.

# In[1]:


# hacker_news data set

from csv import reader

opened_file = open('hacker_news.csv', encoding='utf8')
read_file = reader(opened_file)

hn = list(read_file)

header = hn[0]
hn = hn[1:]

print(header)
print('\n')
print(hn[:5])


# ### 3. Extracting Ask HN and Show HN Posts
# 
# Now that we've removed the headers from `hn`, we're ready to filter our data. Since we're only concerned with post titles beginning with `Ask HN` or `Show HN`, we'll create new lists of lists containing just the data for those titles.
# 
# To find the posts that begin with either `Ask HN` or `Show HN`, we'll use the string method `startswith`. Given a string object, say, `string1`, we can check if starts with, say, `dq` by inspecting the output of the object `string1.startswith('dq')`. If `string1` starts with `dq`, it will return `True`, otherwise it will return `False`.
# 
# If we wish to control for case, we can use the `lower` method which returns a lowercase version of the starting string.

# ### 3.1 Separating Posts
# 
# In the code cell below, we:
# 
# - Create three empty lists called `ask_posts`, `show_posts` and `other_posts`
# - Loops through each row in `hn`
#     - Assign the title in each row to a variable named `title`
#         - Because the `title` column is the second column, you'll need to get the element at index `1` in each row
# - Implement the following steps(use `lower()` and `starswith.` methods):
#     - If the lowercase version of `title` strats with `ask hn`, append the row to `ask_posts`
#     - Else if the lowercase version of `title` stars with `show hn`, append the row to `show_posts`
#     - Else append to `other_posts`
# - Check the number of posts in `ask_posts`, `show_posts`, and `other_posts`
#     

# In[2]:


# Method to separate posts beginning with Ask HH and Show HN

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('ask_posts: ', len(ask_posts))
print('show_posts: ', len(show_posts))
print('other_posts: ', len(other_posts))   


# ## 4. Calculating the Average Number of Comments for Ask HN and Show HN Posts
# 
# In the last screen, we separated the `ask posts` and the `show posts` into two list of lists named `ask_posts` and `show_posts`.
# 
# Next, let's determine if `ask posts` or `show posts` receive more comments on average.

# ### 4.1. Average Number of Comments for Ask HN
# 
# In the code cell below, we:
# 
# - Find the total number of comments in ask posts and assign it to `total_ask_comments`
#     - Set `total_ask_comments` to `0`
# - Use a for loop to iterate over the `ask posts`
#     - Because the `num_comments` column is the fifth column in `ask_posts`, you'll need to get the element at index `4` in each row
#         - You'll also need to convert the value to an integer so that we can calculate the sum of all the comments
#         - Add this value to `total_ask_comments`
# - Compute the average number of comments on `ask posts` and assign it to `avg_ask_comments`
# - Print `avg_ask_comments`
# 

# In[3]:


# Average Ask HN

total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    avg_ask_comments = total_ask_comments / len(ask_posts) 

print(avg_ask_comments)
    

# ### 4.2. Average Number of Comments for Show HN
# 
# In the code cell below, we:
# 
# - Find the total number of comments in ask posts and assign it to `total_show_comments`
#     - Set `total_show_comments` to `0`
# - Use a for loop to iterate over the `ask posts`
#     - Because the `num_comments` column is the fifth column in `show_posts`, you'll need to get the element at index `4` in each row
#         - You'll also need to convert the value to an integer so that we can calculate the sum of all the comments
#         - Add this value to `total_show_comments`
# - Compute the average number of comments on `show posts` and assign it to `avg_show_comments`
# - Print `avg_show_comments`

# In[4]:


# Average Show HN

total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    avg_show_comments = total_show_comments / len(show_posts) 

print(avg_show_comments)


# On average, ask posts approximately receive 10 comments whereas show posts receive almost 5 comments. Since ask posts are more likely to receive comments.

# ## 5. Finding the Amount of Ask Posts and Comments by Hour Created
# 
# Since `ask posts` are more likely to receive comments, we'll focus our remaining analysis just on these posts.
# 
# Next, we'll determine if `ask posts` created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
# 
# - Calculate the amount of ask posts created in each hour of the day, along with the number of comments received
# - Calculate the average number of comments ask posts receive by hour created
# 
# We'll tackle the first step — calculating the amount of ask posts and comments by hour created. We'll use the `datetime` module to work with the data in the `created_at` column.
# 
# Recall that we can use the `datetime.strptime()` constructor to parse dates stored as `strings` and return `datetime objects`.

# ### 5.1. Import the datetime
# 
# In the cell below, we:
# 
# - Import the `datetime` module as `dt`

# In[5]:


# import datetime module

import datetime as dt


# ### 5.2. Appending created_at and num_columns
# 
# In the cell bellow, we:
# 
# - Create an empty list and assign it to `result_list`. This will be a list of list
# - Iterate over `ask_post` and append to `result_list` a list with two elements:
#     - The first element shall be the column `created_at`
#         - Because the `created_at` column is the seventh column in `ask_posts`, you'll need to get the element at index `6` in each row
#     - The second element shall be the `number of comments` of the post
#         - You'll also need to convert the value to an integer
# 

# In[6]:


# Appending columns: created_at and num_columns

result_list = []

for row in ask_posts:
    result_list.append([row[6], int(row[4])])
    

# ### 5.3. Calculating the amount of ask and comments
# 
# - Create two empty dictionaries called `counts_by_hour` and `comments_by_hour`
# - Loop through each row of `result_list`
# - Extract the hour from the date, which is the first element of the row
# - Use the `datetime.strptime()` method to select just the hour from the datetime object
# - Use the string we want to parse as the first argument and a string that specifies the format as the second argument
#     - Use the `datetime.strftime()` method to select just the hour from the datetime object
#     - If the hour isn't a key in `counts_by_hour`:
#         - Create the key in `counts_by_hour` and set it equal to 1
#         - Create the key in `comments_by_hour` and set it equal to the `comment` number
#     - If the hour is already a key in `counts_by_hour`:
#         - Increment the value in `counts_by_hour` by `1`
#         - Increment the value in `comments_by_hour` by the `comment` number
#  

# In[7]:


# amount of ask post and comments

counts_by_hour = {}
comments_by_hour = {}

date_format = "%m/%d/%Y %H:%M"
for row in result_list:
    date_string = row[0]
    time = dt.datetime.strptime(date_string, date_format)
    
    hour = time.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
    

# ## 6. Calculatig the Average Number of Comments for Ask HN Post by Hour
# 
# We created two dictionaries:
# 
# - **counts_by_hour:** Contains `the number of ask posts` created during each hour of the day
# - **comments_by_hour:** Contains the corresponding `number of comments ask posts` created at each hour received
# 
# Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

# In the code below, we:
# - Initialized an empty list of lists and assigned it to `avg_by_hour`
# - Iterated over the keys of `comments_by_hour`
#     - Calculate the average number in `avg_num`
#     - Appended `comment` and `avg_number` in `avg_by_hour` 

# In[8]:


# Average Number of comments

avg_by_hour = []

for comment in comments_by_hour:
    avg_num = comments_by_hour[comment] / counts_by_hour[comment]
    avg_by_hour.append([comment, avg_num])

avg_by_hour


# ## 7. Sorting and Printing Values from a List of Lists
# 
# ### 7.1. Part one:
# 
# - Create a list that equals `avg_by_hour` with swapped columns
#     - Create an empty list and assign it to `swap_avg_by_hour`
#     - Iterate over the rows of `avg_by_hour` and append to `swap_avg_by_hour` a list whose first element is the second element of the row, and whose second element is the first element of the row
# - Print `swap_avg_by_hour`
# 
# ### 7.2. Part Two:
# 
# - Use the `sorted()` function to sort `swap_avg_by_hour` in descending order. Since the first column of this list is the average number of comments, sorting the list will sort by the average number of comments
#     - Set the `reverse` argument to `True`, so that the highest value in the first column appears first in the list
#     - Assign the result to `sorted_swap`
#     
# ### 7.3. Part Three:
# 
# - Print the string "Top 5 Hours for Ask Posts Comments"
# - Loop through each average and each hour (in this order) in the first five lists of `sorted_swap`
# - Use the `str.format()` method to print the hour and average in the following format: `15:00: 38.59 average comments per post`
#     - To format the hours, use the `datetime.strptime()` constructor to return a datetime object and then use the `strftime()` method to specify the format of the time
#     - To format the average, you can use `{:.2f}` to indicate that just two decimal places should be used
#     
#     
# 

# In[24]:


# Sorting and Printing Values

# Part one

swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

#Part two

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

#Part three

print("Top 5 Hours for Ask Posts Comments")

for row in sorted_swap[:5]:
    time_string = row[1]
    time_top = dt.datetime.strptime(time_string, "%H")
    hour_top = time_top.strftime("%H:%M")
    print("{}: {:.2f} average comments per post".format(hour_top, row[0])) 
    

# The hour that receives the most comments per post on average is `15:00` with an  average of `28.68` comments per post. The time zone used is Eastern Time in the US; as a result, we could also write `15:00` as `3:00 pm est`.

# ## 8. Conclusion:
# 
# Based on our analysis, we recommend posting at `15:00` or `3:00 p, est` in order to have a higher chance of receiving more comments on an `Aks Post`.
# Furthermore, Creating post from `12:00` to `13:00` receives on average `28,7`comments per post which is another good option to do as well.