You are given a corpus (collection) of 500 Wikipedia pages, collected from weekly lists of the most popular pages for the past nine months.
The data set was preprocessed as follows, which will give you a idea of how you can interpret and handle the data.
Running the following cell loads the following arrays:
titles
: The list of titles of the documents. For example title[i]
returns the title of the i
-th document, where i
ranges from 0 to 499.dictionary
: The list of 4423 words used for this problem, so each of these words appears in at least 20 documents.word_histogram
: This is a 500x4423 matrix where each row represents the (normalized) word histogram of a document. For example, word_histogram[i]
returns a 4423 dimensional word histogram vector for document i
, and word_histogram[i,j]
describes how frequently the j
-th word (the word in dictionary[j]
) appears in document i
. So word_histogram[i,j]
is zero if the word dictionary[j]
does not appear in document i
, and word_histogram[i,j]
is high if the word dictionary[j]
appears in document i
many times.import numpy as np
import json, urllib.request
url = "https://jonghank.github.io/ee370/files/documents_data.json"
data = urllib.request.urlopen(url).read().decode()
obj = json.loads(data)
word_histogram = np.array([ obj["articles"][i] for i in range(500) ])
dictionary = obj["dictionary"]
titles = obj["titles"]
(Problem 1a) This is a warm-up problem. What is the title of the last document? What are the 20 most frequently appearing words in the last document? Google the name in the resulting title (or "Princess Leia") to figure out who she is, and check if your results make sense.
# your code here
import numpy as np
i = word_histogram.shape[0]-1
print("Title: "+titles[i])
word_index_sorted = np.argsort(word_histogram[i])[::-1]
print("20 most frequently appearing words:")
for k in range(20):
print("%16s" % dictionary[word_index_sorted[k]], end="")
if k%5==4:
print("")
Title: Carrie_Fisher 20 most frequently appearing words: film star wars appear play episode actress role reynolds princess comedy interview drink award script relationship book original drug novel
(Problem 1b) Fit the $k$-means model with $K=9$ (nine clusters).
You may make use of the KMeans
function from the scikit-learn
Python machine learning package (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans).
Summarize your clustering results by clearly displaying the followings, and present short discussions on your results.
For each of the $K$ clusters,
Note that the $k$-means algorithm does not always give you the globally optimal solution, in other words it may return similar but slightly different solutions for each run. Please re-run the clustering until every cluster has at least 10 documents. You may need more than 10 runs for such results.
# your code here
import sklearn.cluster as skc
K = 9
model = skc.KMeans(n_clusters=K, max_iter=100).fit(word_histogram)
theta = model.cluster_centers_
c = model.labels_
for k in range(K):
print(f"CLUSTER {k}: {sum(c==k)} articles")
CLUSTER 0: 51 articles CLUSTER 1: 21 articles CLUSTER 2: 45 articles CLUSTER 3: 43 articles CLUSTER 4: 12 articles CLUSTER 5: 12 articles CLUSTER 6: 194 articles CLUSTER 7: 46 articles CLUSTER 8: 76 articles
# your code here
for k in range(K):
# the number of articles in each cluster
print(f"CLUSTER {k}: {sum(c==k)} articles")
# 10 most frequently appearing words in each cluster representative
frequent_word_index = np.argsort(theta[k])[::-1]
print(" most frequently appearing words: ")
for i in range(10):
print("%16s" % dictionary[frequent_word_index[i]], end="")
if i%5==4:
print("")
# titles of 10 articles closest to the representative
dist = np.linalg.norm(word_histogram-theta[k], axis=1)
sorted = np.argsort(dist)
print(" ten articles closest to the representative: ")
for j in range(10):
print("%40s" % titles[sorted[j]], end="")
if j%2==1:
print("")
CLUSTER 0: 51 articles most frequently appearing words: game season team win player play league final score match ten articles closest to the representative: Kobe_Bryant Lamar_Odom Johan_Cruyff Yogi_Berra Jose_Mourinho Halo_5:_Guardians Tom_Brady Ben_Affleck Eli_Manning Hillsborough_disaster CLUSTER 1: 21 articles most frequently appearing words: fight win event champion fighter bout title ali championship bonus ten articles closest to the representative: Floyd_Mayweather,_Jr. Kimbo_Slice Ronda_Rousey Jose_Aldo Joe_Frazier Wladimir_Klitschko Saul_Alvarez Gennady_Golovkin Nate_Diaz Conor_McGregor CLUSTER 2: 45 articles most frequently appearing words: series season episode film television character role play cast star ten articles closest to the representative: The_X-Files Game_of_Thrones House_of_Cards_(U.S._TV_series) Charlie_Sheen Supergirl_(U.S._TV_series) Ben_Affleck American_Horror_Story Daredevil_(TV_series) Rod_Serling Sherlock_(TV_series) CLUSTER 3: 43 articles most frequently appearing words: holiday celebrate festival calendar celebration date united moon church eclipse ten articles closest to the representative: Halloween Mahatma_Gandhi Frederick_Douglass Sigmund_Freud Christopher_Columbus Ben_Affleck Carly_Fiorina Harriet_Tubman Marco_Rubio Jim_Webb CLUSTER 4: 12 articles most frequently appearing words: match championship event win style raw team title perform episode ten articles closest to the representative: Wrestlemania_32 Royal_Rumble_(2016) Payback_(2016) Night_of_Champions_(2015) Survivor_Series_(2015) Extreme_Rules_(2016) Fastlane_(2016) Hell_in_a_Cell_(2015) TLC:_Tables,_Ladders_&_Chairs_(2015) Andre_the_Giant CLUSTER 5: 12 articles most frequently appearing words: black character marvel appear strange series hole cage power kill ten articles closest to the representative: Deadpool Elektra_(comics) Punisher Black_Panther_(comics) Frederick_Douglass Ben_Affleck Sigmund_Freud Daredevil_(TV_series) Harriet_Tubman Pablo_Escobar CLUSTER 6: 194 articles most frequently appearing words: united family film report president school american university government city ten articles closest to the representative: Ben_Affleck Mahatma_Gandhi Sigmund_Freud Carly_Fiorina Frederick_Douglass Christopher_Columbus Marco_Rubio Genie_(feral_child) Fidel_Castro Pablo_Escobar CLUSTER 7: 46 articles most frequently appearing words: album release song music single record band tour perform chart ten articles closest to the representative: David_Bowie Kanye_West Celine_Dion Ariana_Grande Kesha Adele Gwen_Stefani Anti_(album) Dolly_Parton Sia_Furler CLUSTER 8: 76 articles most frequently appearing words: film star million role release play character movie actor award ten articles closest to the representative: Leonardo_DiCaprio Kate_Beckinsale Star_Wars:_The_Force_Awakens Star_Wars_Episode_I:_The_Phantom_Menace Maureen_O'Hara Keanu_Reeves Johnny_Depp The_Hateful_Eight The_Revenant_(2015_film) Back_to_the_Future_II
(Problem 1c) You started a web service called the 'Kiwipedia', which goes like this.
Explain in details how you will use the results obtained in the previous subproblems in order to implement your idea.
# your answer here
#
# you can do either of the followings.
#
# 1) you can find the cluster to which the current article belongs to, by
# measuring the distance to the cluster centers, and then list the
# articles with closest distance to the chosen cluster center. this is
# desirable since they are the "important" pages that capture the
# central issues that the current page is dealing with.
# 2) you can simply list the articles with closest distance to the current
# article. this is good since they are simply close to the current
# page, however you cannot tell whether they are "important" or not.
# this implies your suggention will probably be "garbage" if the
# current page is garbage.
#
(Problem 1d) Now implement your idea. Suppose a user is currently at a webpage, which can be characterized by a 4423 dimensional word histogram vector, word_histogram_current_page
, given below. The word histogram uses the same 4423-word dictionary you used in the previous subproblems.
List the titles of the five pages that you will have your Kiwi-bot to display in your pop-up window.
Though your instructor computed and gave the word histogram vector for you today, everyone in this class should now be able to build this vector on your own from raw webpages.
word_histogram_current_page = np.array(obj["article_crnt"])
# your code here
dist_to_centers = np.linalg.norm(word_histogram_current_page-theta, axis=1)
dist_to_articles = np.linalg.norm(word_histogram_current_page-word_histogram, axis=1)
index_closest_centers = np.argsort(dist_to_centers)[0]
k = index_closest_centers
# titles of 5 articles closest to the representative
dist = np.linalg.norm(word_histogram-theta[k], axis=1)
sorted = np.argsort(dist)
print(" five articles closest to the representative: ")
for j in range(5):
print(" ", titles[sorted[j]])
index_closest_articles = np.argsort(dist_to_articles)
k = index_closest_articles
# titles of 5 articles closest to the current page
print(" five articles closest to the current page: ")
for j in range(5):
print(" ", titles[index_closest_articles[j]])
import matplotlib.pyplot as plt
plt.figure(figsize=(10,4), dpi=100)
plt.plot(dist_to_centers, "o-")
plt.grid()
plt.xlabel("cluster number")
plt.ylabel("distance")
plt.title(f"distance from {K} cluster centers")
plt.show()
plt.figure(figsize=(10,4), dpi=100)
plt.plot(dist_to_articles, "o-")
plt.grid()
plt.xlabel("article number")
plt.ylabel("distance")
plt.title(f"distance from {500} articles")
plt.show()
five articles closest to the representative: David_Bowie Kanye_West Celine_Dion Ariana_Grande Kesha five articles closest to the current page: David_Bowie_discography Blackstar_(David_Bowie_album) Sia_Furler David_Bowie Purpose_(Justin_Bieber_album)