Notebook

Document recommendation system¶

$$ \newcommand{\eg}{{\it e.g.}} \newcommand{\ie}{{\it i.e.}} \newcommand{\argmin}{\operatornamewithlimits{argmin}} \newcommand{\mc}{\mathcal} \newcommand{\mb}{\mathbb} \newcommand{\mf}{\mathbf} \newcommand{\minimize}{{\text{minimize}}} \newcommand{\diag}{{\text{diag}}} \newcommand{\cond}{{\text{cond}}} \newcommand{\rank}{{\text{rank }}} \newcommand{\range}{{\mathcal{R}}} \newcommand{\null}{{\mathcal{N}}} \newcommand{\tr}{{\text{trace}}} \newcommand{\dom}{{\text{dom}}} \newcommand{\dist}{{\text{dist}}} \newcommand{\R}{\mathbf{R}} \newcommand{\Z}{\mathbf{Z}} \newcommand{\SM}{\mathbf{S}} \newcommand{\ball}{\mathcal{B}} \newcommand{\bmat}[1]{\begin{bmatrix}#1\end{bmatrix}} $$

ASE2030: Linear algebra and statistice, Inha University.

Jong-Han Kim (jonghank@inha.ac.kr)

You are given a corpus (collection) of 500 Wikipedia pages, collected from weekly lists of the most popular pages for the past nine months.

The data set was preprocessed as follows, which will give you a idea of how you can interpret and handle the data.

First, the section titles and reference sections (bibliography, notes, references, further reading) from each document were removed.
Then each document was converted to a list of words. The conversion removes numbers and stop words (short or common words such as 'the'/'is'/'that'/'what'), and applies a stemming (removing endings from words so you can count things like 'work'/'works'/'worked' together) algorithm to nouns and verbs. So each document is now a list of nouns and verbs.
We then formed a dictionary of all the words that appear in at least 20 documents. The dictionary contains 4423 words (nouns and verbs).
Using this dictionary, we can represent each document in the corpus by a (normalized) word histogram vector of length 4423. The word histogram vector can be interpreted as a feature vector for a document for this problem.

Running the following cell loads the following arrays:

titles : The list of titles of the documents. For example title[i] returns the title of the i-th document, where i ranges from 0 to 499.
dictionary : The list of 4423 words used for this problem, so each of these words appears in at least 20 documents.
word_histogram : This is a 500x4423 matrix where each row represents the (normalized) word histogram of a document. For example, word_histogram[i] returns a 4423 dimensional word histogram vector for document i, and word_histogram[i,j] describes how frequently the j-th word (the word in dictionary[j]) appears in document i. So word_histogram[i,j] is zero if the word dictionary[j] does not appear in document i, and word_histogram[i,j] is high if the word dictionary[j] appears in document i many times.

In [ ]:

import numpy as np
import json, urllib.request

url = "https://jonghank.github.io/ee370/files/documents_data.json"
data = urllib.request.urlopen(url).read().decode()
obj = json.loads(data)

word_histogram = np.array([ obj["articles"][i] for i in range(500) ])
dictionary = obj["dictionary"]
titles = obj["titles"]

(Problem 1a) This is a warm-up problem. What is the title of the last document? What are the 20 most frequently appearing words in the last document? Google the name in the resulting title (or "Princess Leia") to figure out who she is, and check if your results make sense.

In [ ]:

# your code here

import numpy as np
i = word_histogram.shape[0]-1
print("Title: "+titles[i])
word_index_sorted = np.argsort(word_histogram[i])[::-1]
print("20 most frequently appearing words:")
for k in range(20):
  print("%16s" % dictionary[word_index_sorted[k]], end="")  
  if k%5==4:
    print("")

Title: Carrie_Fisher
20 most frequently appearing words:
            film            star            wars          appear            play
         episode         actress            role        reynolds        princess
          comedy       interview           drink           award          script
    relationship            book        original            drug           novel

(Problem 1b) Fit the $k$-means model with $K=9$ (nine clusters). You may make use of the KMeans function from the scikit-learn Python machine learning package (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans).

Summarize your clustering results by clearly displaying the followings, and present short discussions on your results.

For each of the $K$ clusters,

Show 10 most frequently appearing words in each cluster representative.
Show the number of articles in each cluster.
Show the titles of the ten articles closest to the cluster representative.
Briefly explain your results, characterizing each of the $K$ clusters.

Note that the $k$-means algorithm does not always give you the globally optimal solution, in other words it may return similar but slightly different solutions for each run. Please re-run the clustering until every cluster has at least 10 documents. You may need more than 10 runs for such results.

In [ ]:

# your code here

import sklearn.cluster as skc

K = 9
model = skc.KMeans(n_clusters=K, max_iter=100).fit(word_histogram)
theta = model.cluster_centers_
c = model.labels_

for k in range(K):
  print(f"CLUSTER {k}: {sum(c==k)} articles")

CLUSTER 0: 51 articles
CLUSTER 1: 21 articles
CLUSTER 2: 45 articles
CLUSTER 3: 43 articles
CLUSTER 4: 12 articles
CLUSTER 5: 12 articles
CLUSTER 6: 194 articles
CLUSTER 7: 46 articles
CLUSTER 8: 76 articles

In [ ]:

# your code here

for k in range(K):

  # the number of articles in each cluster
  print(f"CLUSTER {k}: {sum(c==k)} articles")
  # 10 most frequently appearing words in each cluster representative
  frequent_word_index = np.argsort(theta[k])[::-1]
  print("  most frequently appearing words: ")
  for i in range(10):
    print("%16s" % dictionary[frequent_word_index[i]], end="")
    if i%5==4:
      print("")
  
  # titles of 10 articles closest to the representative
  dist = np.linalg.norm(word_histogram-theta[k], axis=1)
  sorted = np.argsort(dist)
  print("  ten articles closest to the representative: ")
  for j in range(10):
    print("%40s" % titles[sorted[j]], end="")
    if j%2==1:
      print("")

CLUSTER 0: 51 articles
  most frequently appearing words: 
            game          season            team             win          player
            play          league           final           score           match
  ten articles closest to the representative: 
                             Kobe_Bryant                              Lamar_Odom
                            Johan_Cruyff                              Yogi_Berra
                           Jose_Mourinho                       Halo_5:_Guardians
                               Tom_Brady                             Ben_Affleck
                             Eli_Manning                   Hillsborough_disaster
CLUSTER 1: 21 articles
  most frequently appearing words: 
           fight             win           event        champion         fighter
            bout           title             ali    championship           bonus
  ten articles closest to the representative: 
                   Floyd_Mayweather,_Jr.                             Kimbo_Slice
                            Ronda_Rousey                               Jose_Aldo
                             Joe_Frazier                      Wladimir_Klitschko
                            Saul_Alvarez                        Gennady_Golovkin
                               Nate_Diaz                          Conor_McGregor
CLUSTER 2: 45 articles
  most frequently appearing words: 
          series          season         episode            film      television
       character            role            play            cast            star
  ten articles closest to the representative: 
                             The_X-Files                         Game_of_Thrones
         House_of_Cards_(U.S._TV_series)                           Charlie_Sheen
              Supergirl_(U.S._TV_series)                             Ben_Affleck
                   American_Horror_Story                   Daredevil_(TV_series)
                             Rod_Serling                    Sherlock_(TV_series)
CLUSTER 3: 43 articles
  most frequently appearing words: 
         holiday       celebrate        festival        calendar     celebration
            date          united            moon          church         eclipse
  ten articles closest to the representative: 
                               Halloween                          Mahatma_Gandhi
                      Frederick_Douglass                           Sigmund_Freud
                    Christopher_Columbus                             Ben_Affleck
                           Carly_Fiorina                          Harriet_Tubman
                             Marco_Rubio                                Jim_Webb
CLUSTER 4: 12 articles
  most frequently appearing words: 
           match    championship           event             win           style
             raw            team           title         perform         episode
  ten articles closest to the representative: 
                         Wrestlemania_32                     Royal_Rumble_(2016)
                          Payback_(2016)               Night_of_Champions_(2015)
                  Survivor_Series_(2015)                    Extreme_Rules_(2016)
                         Fastlane_(2016)                   Hell_in_a_Cell_(2015)
    TLC:_Tables,_Ladders_&_Chairs_(2015)                         Andre_the_Giant
CLUSTER 5: 12 articles
  most frequently appearing words: 
           black       character          marvel          appear         strange
          series            hole            cage           power            kill
  ten articles closest to the representative: 
                                Deadpool                        Elektra_(comics)
                                Punisher                  Black_Panther_(comics)
                      Frederick_Douglass                             Ben_Affleck
                           Sigmund_Freud                   Daredevil_(TV_series)
                          Harriet_Tubman                           Pablo_Escobar
CLUSTER 6: 194 articles
  most frequently appearing words: 
          united          family            film          report       president
          school        american      university      government            city
  ten articles closest to the representative: 
                             Ben_Affleck                          Mahatma_Gandhi
                           Sigmund_Freud                           Carly_Fiorina
                      Frederick_Douglass                    Christopher_Columbus
                             Marco_Rubio                     Genie_(feral_child)
                            Fidel_Castro                           Pablo_Escobar
CLUSTER 7: 46 articles
  most frequently appearing words: 
           album         release            song           music          single
          record            band            tour         perform           chart
  ten articles closest to the representative: 
                             David_Bowie                              Kanye_West
                             Celine_Dion                           Ariana_Grande
                                   Kesha                                   Adele
                            Gwen_Stefani                            Anti_(album)
                            Dolly_Parton                              Sia_Furler
CLUSTER 8: 76 articles
  most frequently appearing words: 
            film            star         million            role         release
            play       character           movie           actor           award
  ten articles closest to the representative: 
                       Leonardo_DiCaprio                         Kate_Beckinsale
            Star_Wars:_The_Force_Awakens Star_Wars_Episode_I:_The_Phantom_Menace
                          Maureen_O'Hara                            Keanu_Reeves
                             Johnny_Depp                       The_Hateful_Eight
                The_Revenant_(2015_film)                   Back_to_the_Future_II

(Problem 1c) You started a web service called the 'Kiwipedia', which goes like this.

When a user is reading a web article, your service opens an annoying pop-up window with an ugly Kiwi-bot character in it suggesting five Wikipedia pages that the current user will most probably be interested in.
This is important since you will get paid by the number of incoming clicks via your service.
Since there is no other direct ways to guess the users' preferences, you just assume that s/he will prefer popular Wikipedia pages with topics similar to the current webpage.

Explain in details how you will use the results obtained in the previous subproblems in order to implement your idea.

In [ ]:

# your answer here

#
# you can do either of the followings.
#
# 1) you can find the cluster to which the current article belongs to, by 
#    measuring the distance to the cluster centers, and then list the 
#    articles with closest distance to the chosen cluster center. this is 
#    desirable since they are the "important" pages that capture the 
#    central issues that the current page is dealing with.
# 2) you can simply list the articles with closest distance to the current 
#    article. this is good since they are simply close to the current
#    page, however you cannot tell whether they are "important" or not. 
#    this implies your suggention will probably be "garbage" if the 
#    current page is garbage.
#

(Problem 1d) Now implement your idea. Suppose a user is currently at a webpage, which can be characterized by a 4423 dimensional word histogram vector, word_histogram_current_page, given below. The word histogram uses the same 4423-word dictionary you used in the previous subproblems.

List the titles of the five pages that you will have your Kiwi-bot to display in your pop-up window.

Though your instructor computed and gave the word histogram vector for you today, everyone in this class should now be able to build this vector on your own from raw webpages.

In [ ]:

word_histogram_current_page = np.array(obj["article_crnt"])

# your code here

dist_to_centers = np.linalg.norm(word_histogram_current_page-theta, axis=1)
dist_to_articles = np.linalg.norm(word_histogram_current_page-word_histogram, axis=1)

index_closest_centers = np.argsort(dist_to_centers)[0]
k = index_closest_centers
# titles of 5 articles closest to the representative
dist = np.linalg.norm(word_histogram-theta[k], axis=1)
sorted = np.argsort(dist)
print("  five articles closest to the representative: ")
for j in range(5):
    print("    ", titles[sorted[j]])  
    
index_closest_articles = np.argsort(dist_to_articles)
k = index_closest_articles
# titles of 5 articles closest to the current page
print("  five articles closest to the current page: ")
for j in range(5):
    print("    ", titles[index_closest_articles[j]])  

import matplotlib.pyplot as plt
plt.figure(figsize=(10,4), dpi=100)
plt.plot(dist_to_centers, "o-")
plt.grid()
plt.xlabel("cluster number")
plt.ylabel("distance")
plt.title(f"distance from {K} cluster centers")
plt.show()
plt.figure(figsize=(10,4), dpi=100)
plt.plot(dist_to_articles, "o-")
plt.grid()
plt.xlabel("article number")
plt.ylabel("distance")
plt.title(f"distance from {500} articles")
plt.show()

  five articles closest to the representative: 
     David_Bowie
     Kanye_West
     Celine_Dion
     Ariana_Grande
     Kesha
  five articles closest to the current page: 
     David_Bowie_discography
     Blackstar_(David_Bowie_album)
     Sia_Furler
     David_Bowie
     Purpose_(Justin_Bieber_album)