Notebook

Analysing 'Ask' vs 'Show' Posts in relation to user interactivity¶

In this project, we're going to work with a data set of submissions posted on popular technology website Hacker Newss. This website is a site hosted by startup incubator Y Combinator where users submit their own 'posts' which other readers can vote and comment upon. Another website similar to this is Reddit.

Hacker News is extremely popular in technology and startup circles and posts that are well voted and commented end up having hundreds of thousands of visits as a result.

The data set is available online and can be found and downloaded here. This is a reduced version of the original dataset which actually contains almost 300k rows. Our dataset contains of approximately 20k rows as posts that did not receive any comments were removed.

Purpose of this project¶

We're specifically interested in 'Ask Posts', which are posts where the user specifically looks for feedback by asking a question like:

How to get more followers on Instagram?
How to use Chrome's Developer Tools?
How to make a website responsive?

On the other hand, 'Show Posts' are posts where the user specifically wants to showcase something to the community which the latter will provide feedback on. Some examples include:

Look at this programming language I created
This is how I'm spending my Covid-19 quarantine
I created a sports predicting model for football

At the end of this project we'll be coming up with conclusions such as:

Which posts receive more comments on average? Ask Posts or Show Posts?
Does time of upload have any effect on the feedback a post receives?

In [1]:

from csv import reader                      # import reader module

opened_file = open('hacker_news.csv')       # open csv file 'hacker_news.csv
read_file = reader(opened_file)             # use reader function to load the opened_file
hn = list(read_file)                        # assign the read_file to a variable 'hn' in the form of a list
hn[:5]                                      # display the first 5 rows of the dataset

Out[1]:

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In order to remove the Headers of each row from our data, we will have to exclude it and separate it from our data, as below:

In [2]:

headers = hn[0]     # Headers at index 0 set to 'headers'
hn = hn[1:]         # First row removed
print(headers)      # Output 'headers'
print(hn[:5])       # Output first five rows

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]

Part 1 - Remove Headers, Display Number of Posts¶

Now that the headers are removed we can start filtering our data - a process called Data Cleaning.

In order to make proper use of our data, we will have to separate the Ask Posts from the Show Posts. Posts that do not fall under this category will be listed as Other Posts.

In the next part of code, three lists were created: ask_posts, show_posts and other_posts.

To find such posts we're going to make use of the startswith() function to look for ask hn and show hn in the beginning of title.

The number of posts under each category is listed below.

In [3]:

# Creating 3 empty lists

ask_posts = []
show_posts = []
other_posts = []

# Iterating through the data set without headers for the title of the post, which has index 1:
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

# Output for each category
print('Nr of ask posts is', len(ask_posts))
print('Nr of show posts is', len(show_posts))
print('Nr of other posts is', len(other_posts))

Nr of ask posts is 1744
Nr of show posts is 1162
Nr of other posts is 17194

In [4]:

# Displaying first 5 rows of each list:
ask_posts[:5]
show_posts[:5]

Out[4]:

[['10627194',
  'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform',
  'https://iot.seeed.cc',
  '26',
  '22',
  'kfihihc',
  '11/25/2015 14:03'],
 ['10646440',
  'Show HN: Something pointless I made',
  'http://dn.ht/picklecat/',
  '747',
  '102',
  'dhotson',
  '11/29/2015 22:46'],
 ['11590768',
  'Show HN: Shanhu.io, a programming playground powered by e8vm',
  'https://shanhu.io',
  '1',
  '1',
  'h8liu',
  '4/28/2016 18:05'],
 ['12178806',
  'Show HN: Webscope  Easy way for web developers to communicate with Clients',
  'http://webscopeapp.com',
  '3',
  '3',
  'fastbrick',
  '7/28/2016 7:11'],
 ['10872799',
  'Show HN: GeoScreenshot  Easily test Geo-IP based web pages',
  'https://www.geoscreenshot.com/',
  '1',
  '9',
  'kpsychwave',
  '1/9/2016 20:45']]

Part 2 - Which category of posts receive the most comments?¶

To do this, we're going to compute the average number of comments for each category.

In this section, we will make use of for loops for each category, then compute the average and compare.

In [5]:

#Finding total nr of comments in ask posts:

total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print('average nr of ask posts comments is', avg_ask_comments)

#Finding total nr of comments in show posts:

total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])

avg_show_comments = total_show_comments / len(show_posts)
print('average nr of show posts comments is', avg_show_comments)

average nr of ask posts comments is 14.038417431192661
average nr of show posts comments is 10.31669535283993

The results above show that on average, Ask Posts (14) get more feedback by comments than Show Posts (10).

Part 3 - Does time affect the amount of feedback posts receive?¶

The next interesting analogy is to check whether the time at which the post is uploaded has any effect on the level of feedback that the post might receive, be it by comments or upvotes.

In this part, we have used the datetime module to parse the data from the Date column in our dataset, which has an index number of 0.

Furthermore, 2 dictionaries were created - one for the comments and one for the counts, relative to the hour, in 24-hr format.

In [6]:

# Screen 5/8
# Importing datetime module as dt

import datetime as dt

result_list = []
for posts in ask_posts:
    created = posts[6]
    comments = int(posts[4])
    result_list.append([created, comments])

counts_by_hour = {}
comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'

for row in result_list:
    date = row[0]
    comment = row[1]
    date_obj = dt.datetime.strptime(date, date_format).strftime("%H")
    if date_obj not in counts_by_hour:
        counts_by_hour[date_obj] = 1
        comments_by_hour[date_obj] = comment
    else:
        counts_by_hour[date_obj] += 1
        comments_by_hour[date_obj] += comment

print('the nr of comments on ask posts by the hour are:')
comments_by_hour

the nr of comments on ask posts by the hour are:

Out[6]:

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

Result: From the output above, one can immediately note that Ask Posts uploaded in the afternoon, specifically between 13:00 and 19:00 generate the most feedback, peaking at 15:00.

On the contrary, posts uploaded during the night (22:00 - 07:00) generate the least frequency of feedback by comments. However, there seems to be an abnormality at 02:00, which has generated a lot of interest.

This result can also be exhibited by calculating the average comments by the hour:

In [7]:

avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])

avg_by_hour

Out[7]:

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

In [8]:

swap_avg_by_hour = []

for rows in avg_by_hour:
    a = rows[1]
    b = rows[0]
    swap_avg_by_hour.append([a,b])
    
print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print('Top 5 hours for Ask Posts Comments')

for avg, hr in sorted_swap[:5]:
    print('{}: {:.2f} average comments per post'.format(
        dt.datetime.strptime(hr, '%H').strftime('%H:%M'), avg))
    

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]
Top 5 hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post

Part 4 - Determine if show or ask posts receive more points on average¶

It would also be interesting to determine which type of posts would be upvoted the most. In this next section, a simple calculation for the average number of counts for Ask Posts and Show Posts was executed separately, with the output listed underneath each section of code.

In [13]:

# Finding average nr of counts for Ask Posts

total_ask_counts = 0
for row in ask_posts:
    counts = int(row[3])
    total_ask_counts += counts

avg_ask_counts = total_ask_counts / len(ask_posts)
print('The avg number of counts for Ask Posts is ',avg_ask_counts)

The avg number of counts for Ask Posts is  15.061926605504587

In [14]:

# Finding average nr of counts for Show Posts:

total_show_counts = 0
for row in show_posts:
    show_counts = int(row[3])
    total_show_counts += show_counts
    
avg_show_counts = total_show_counts / len(show_posts)
print('The avg number of counts for Show Posts is ',avg_show_counts)

The avg number of counts for Show Posts is  27.555077452667813

From the results above it is clear that Show Posts get more upvotes than Ask Posts. This can be understandable since people tend to show more appreciation for others' work - hence upvote it - then appreciating a simple question (Ask Post).

On the other hand, it is also understandable that Ask Posts get more answers than Show Posts since users are actually looking for feedback in the form of an answer.

Part 5 - Are posts created at a certain time more upvoted than others?¶

It could be that an article or post uploaded at a certain time might get more attention or upvotes than posts uploaded at a different time. For example, we already saw that posts uploaded at 3pm get more feedback, i.e. comments back than posts uploaded at any other time of day.

In this section, we will see whether time has any effect on the upvotes a certain post may get.

We will do this in the following order:

Ask Posts
Show Posts
Other Posts

In [34]:

import datetime as dt

ask_list_counts_vs_time = []
total=0

# Checking for upvoting vs. time for Ask Posts
for posts in ask_posts:
    created = posts[6]
    counts = int(posts[3])
    ask_list_counts_vs_time.append([created, counts])
    
ask_counts_by_hour = {}
ask_comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'

for row in ask_list_counts_vs_time:
    date = row[0]
    count = row[1]
    created_obj = dt.datetime.strptime(date, date_format).strftime('%H')
    if created_obj not in ask_counts_by_hour:
        ask_counts_by_hour[created_obj] = 1
        ask_comments_by_hour[created_obj] = count
    else:
        ask_counts_by_hour[created_obj] += 1
        ask_comments_by_hour[created_obj] += count
        
print('the nr of counts on ask posts by the hour are:')
ask_counts_by_hour  

the nr of counts on ask posts by the hour are:

Out[34]:

{'09': 45,
 '13': 85,
 '10': 59,
 '14': 107,
 '16': 108,
 '23': 68,
 '12': 73,
 '17': 100,
 '15': 116,
 '21': 109,
 '20': 80,
 '02': 58,
 '18': 109,
 '03': 54,
 '05': 46,
 '19': 110,
 '01': 60,
 '22': 71,
 '08': 48,
 '04': 47,
 '00': 55,
 '06': 44,
 '07': 34,
 '11': 58}

In [20]:

import datetime as dt

show_list_counts_vs_time = []

# Checking for upvoting vs. time for Show Posts
for post in show_posts:
    created = post[6]
    counts = int(post[3])
    show_list_counts_vs_time.append([created, counts])
    
show_counts_by_hour = {}
show_comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'

for row in show_list_counts_vs_time:
    date = row[0]
    count = row[1]
    show_created_obj = dt.datetime.strptime(date, date_format).strftime('%H')
    if show_created_obj not in show_counts_by_hour:
        show_counts_by_hour[show_created_obj] = 1
        show_comments_by_hour[show_created_obj] = count
    else:
        show_counts_by_hour[show_created_obj] += 1
        show_comments_by_hour[show_created_obj] += count
        
print('the nr of counts on show posts by the hour are:')
show_counts_by_hour

the nr of counts on show posts by the hour are:

Out[20]:

{'14': 86,
 '22': 46,
 '18': 61,
 '07': 26,
 '20': 60,
 '05': 19,
 '16': 93,
 '19': 55,
 '15': 78,
 '03': 27,
 '17': 93,
 '06': 16,
 '02': 30,
 '13': 99,
 '08': 34,
 '21': 47,
 '04': 26,
 '11': 44,
 '12': 61,
 '23': 36,
 '09': 30,
 '01': 28,
 '10': 36,
 '00': 31}

Conclusion¶

From the results above we can conclude the following points:-

Show posts uploaded between 13:00 and 17:00 are the most likely to get the highest number of upvotes.
Ask posts also get a high number of upvotes, especially those uploaded between 13:00 and 21:00.

In fact, one can deduce that people spend more time looking at Ask Posts (13:00 - 21:00) rather than at Show Posts.