#!/usr/bin/env python
# coding: utf-8
#
#
# # Document recommendation system
# $$
# \newcommand{\eg}{{\it e.g.}}
# \newcommand{\ie}{{\it i.e.}}
# \newcommand{\argmin}{\operatornamewithlimits{argmin}}
# \newcommand{\mc}{\mathcal}
# \newcommand{\mb}{\mathbb}
# \newcommand{\mf}{\mathbf}
# \newcommand{\minimize}{{\text{minimize}}}
# \newcommand{\diag}{{\text{diag}}}
# \newcommand{\cond}{{\text{cond}}}
# \newcommand{\rank}{{\text{rank }}}
# \newcommand{\range}{{\mathcal{R}}}
# \newcommand{\null}{{\mathcal{N}}}
# \newcommand{\tr}{{\text{trace}}}
# \newcommand{\dom}{{\text{dom}}}
# \newcommand{\dist}{{\text{dist}}}
# \newcommand{\R}{\mathbf{R}}
# \newcommand{\Z}{\mathbf{Z}}
# \newcommand{\SM}{\mathbf{S}}
# \newcommand{\ball}{\mathcal{B}}
# \newcommand{\bmat}[1]{\begin{bmatrix}#1\end{bmatrix}}
# $$
# __
ASE2030: Linear algebra and statistice, Inha University.
__
# _ Jong-Han Kim (jonghank@inha.ac.kr)
_
# ---
#
# You are given a corpus (collection) of 500 Wikipedia pages, collected from weekly lists of the most popular pages for the past nine months.
#
# The data set was preprocessed as follows, which will give you a idea of how you can interpret and handle the data.
#
# - First, the section titles and reference sections (bibliography, notes, references, further reading) from each document were removed.
# - Then each document was converted to a list of words. The conversion removes numbers and stop words (short or common words such as 'the'/'is'/'that'/'what'), and applies a stemming (removing endings from words so you can count things like 'work'/'works'/'worked' together) algorithm to nouns and verbs. So each document is now a list of nouns and verbs.
# - We then formed a _dictionary_ of all the words that appear in at least 20 documents. The dictionary contains 4423 words (nouns and verbs).
# - Using this dictionary, we can represent each document in the corpus by a (normalized) _word histogram vector_ of length 4423. The word histogram vector can be interpreted as a feature vector for a document for this problem.
#
#
# Running the following cell loads the following arrays:
#
# - `titles` : The list of titles of the documents. For example `title[i]` returns the title of the `i`-th document, where `i` ranges from 0 to 499.
# - `dictionary` : The list of 4423 words used for this problem, so each of these words appears in at least 20 documents.
# - `word_histogram` : This is a 500x4423 matrix where each row represents the (normalized) word histogram of a document. For example, `word_histogram[i]` returns a 4423 dimensional word histogram vector for document `i`, and `word_histogram[i,j]` describes how frequently the `j`-th word (the word in `dictionary[j]`) appears in document `i`. So `word_histogram[i,j]` is zero if the word `dictionary[j]` does not appear in document `i`, and `word_histogram[i,j]` is high if the word `dictionary[j]` appears in document `i` many times.
#
# In[ ]:
import numpy as np
import json, urllib.request
url = "https://jonghank.github.io/ee370/files/documents_data.json"
data = urllib.request.urlopen(url).read().decode()
obj = json.loads(data)
word_histogram = np.array([ obj["articles"][i] for i in range(500) ])
dictionary = obj["dictionary"]
titles = obj["titles"]
#
#
# **_(Problem 1a)_** This is a warm-up problem. What is the title of the last document? What are the 20 most frequently appearing words in the last document? Google the name in the resulting title (or "Princess Leia") to figure out who she is, and check if your results make sense.
# In[ ]:
# your code here
import numpy as np
i = word_histogram.shape[0]-1
print("Title: "+titles[i])
word_index_sorted = np.argsort(word_histogram[i])[::-1]
print("20 most frequently appearing words:")
for k in range(20):
print("%16s" % dictionary[word_index_sorted[k]], end="")
if k%5==4:
print("")
#
#
# **_(Problem 1b)_** Fit the $k$-means model with $K=9$ (nine clusters).
# You may make use of the `KMeans` function from the `scikit-learn` Python machine learning package (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans).
#
# Summarize your clustering results by clearly displaying the followings, and present short discussions on your results.
#
# For each of the $K$ clusters,
# - Show 10 most frequently appearing words in each cluster representative.
# - Show the number of articles in each cluster.
# - Show the titles of the ten articles closest to the cluster representative.
# - Briefly explain your results, characterizing each of the $K$ clusters.
#
# _Note that the $k$-means algorithm does not always give you the globally optimal solution, in other words it may return similar but slightly different solutions for each run. Please re-run the clustering until every cluster has at least 10 documents. You may need more than 10 runs for such results._
# In[ ]:
# your code here
import sklearn.cluster as skc
K = 9
model = skc.KMeans(n_clusters=K, max_iter=100).fit(word_histogram)
theta = model.cluster_centers_
c = model.labels_
for k in range(K):
print(f"CLUSTER {k}: {sum(c==k)} articles")
# In[ ]:
# your code here
for k in range(K):
# the number of articles in each cluster
print(f"CLUSTER {k}: {sum(c==k)} articles")
# 10 most frequently appearing words in each cluster representative
frequent_word_index = np.argsort(theta[k])[::-1]
print(" most frequently appearing words: ")
for i in range(10):
print("%16s" % dictionary[frequent_word_index[i]], end="")
if i%5==4:
print("")
# titles of 10 articles closest to the representative
dist = np.linalg.norm(word_histogram-theta[k], axis=1)
sorted = np.argsort(dist)
print(" ten articles closest to the representative: ")
for j in range(10):
print("%40s" % titles[sorted[j]], end="")
if j%2==1:
print("")
#
#
# **_(Problem 1c)_** You started a web service called the 'Kiwipedia', which goes like this.
#
# - When a user is reading a web article, your service opens an annoying pop-up window with an ugly Kiwi-bot character in it suggesting five Wikipedia pages that the current user will most probably be interested in.
# - This is important since you will get paid by the number of incoming clicks via your service.
# - Since there is no other direct ways to guess the users' preferences, you just assume that s/he will prefer popular Wikipedia pages with topics similar to the current webpage.
#
# Explain in details how you will use the results obtained in the previous subproblems in order to implement your idea.
# In[ ]:
# your answer here
#
# you can do either of the followings.
#
# 1) you can find the cluster to which the current article belongs to, by
# measuring the distance to the cluster centers, and then list the
# articles with closest distance to the chosen cluster center. this is
# desirable since they are the "important" pages that capture the
# central issues that the current page is dealing with.
# 2) you can simply list the articles with closest distance to the current
# article. this is good since they are simply close to the current
# page, however you cannot tell whether they are "important" or not.
# this implies your suggention will probably be "garbage" if the
# current page is garbage.
#
#
#
# **_(Problem 1d)_** Now implement your idea. Suppose a user is currently at a webpage, which can be characterized by a 4423 dimensional word histogram vector, `word_histogram_current_page`, given below. The word histogram uses the same 4423-word dictionary you used in the previous subproblems.
#
# List the titles of the five pages that you will have your Kiwi-bot to display in your pop-up window.
#
# _Though your instructor computed and gave the word histogram vector for you today, everyone in this class should now be able to build this vector on your own from raw webpages._
# In[ ]:
word_histogram_current_page = np.array(obj["article_crnt"])
# your code here
dist_to_centers = np.linalg.norm(word_histogram_current_page-theta, axis=1)
dist_to_articles = np.linalg.norm(word_histogram_current_page-word_histogram, axis=1)
index_closest_centers = np.argsort(dist_to_centers)[0]
k = index_closest_centers
# titles of 5 articles closest to the representative
dist = np.linalg.norm(word_histogram-theta[k], axis=1)
sorted = np.argsort(dist)
print(" five articles closest to the representative: ")
for j in range(5):
print(" ", titles[sorted[j]])
index_closest_articles = np.argsort(dist_to_articles)
k = index_closest_articles
# titles of 5 articles closest to the current page
print(" five articles closest to the current page: ")
for j in range(5):
print(" ", titles[index_closest_articles[j]])
import matplotlib.pyplot as plt
plt.figure(figsize=(10,4), dpi=100)
plt.plot(dist_to_centers, "o-")
plt.grid()
plt.xlabel("cluster number")
plt.ylabel("distance")
plt.title(f"distance from {K} cluster centers")
plt.show()
plt.figure(figsize=(10,4), dpi=100)
plt.plot(dist_to_articles, "o-")
plt.grid()
plt.xlabel("article number")
plt.ylabel("distance")
plt.title(f"distance from {500} articles")
plt.show()