In this notebook, we're going to apply modern natural language processing models - phrase model, LDA, Word2Vec - to amazon mobile phone reviews, and find out how customers feel about different phone brands by sentiment analysis.
Here's a breakdown of what we're going to do in this notebook.
The data is from "Amazon Reviews: Unlocked Mobile Phones" on Kaggle. To run this notebook yourself, you need to get the data and place it in data directory. You can freely download the data from the link.
import pandas as pd
data = pd.read_csv('data/Amazon_Unlocked_Mobile.csv')
data.head()
Product Name | Brand Name | Price | Rating | Reviews | Review Votes | |
---|---|---|---|---|---|---|
0 | "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... | Samsung | 199.99 | 5 | I feel so LUCKY to have found this used (phone... | 1.0 |
1 | "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... | Samsung | 199.99 | 4 | nice phone, nice up grade from my pantach revu... | 0.0 |
2 | "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... | Samsung | 199.99 | 5 | Very pleased | 0.0 |
3 | "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... | Samsung | 199.99 | 4 | It works good but it goes slow sometimes but i... | 0.0 |
4 | "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7... | Samsung | 199.99 | 4 | Great phone to replace my lost phone. The only... | 0.0 |
The data actually got more information than just reviews. (you can try your own analysis with them!)
For this analysis, we'll extract only reviews text from the data and store it in a file.
%%time
USE_PREMADE_REVIEWS_TEXT = False
from os import path
reviews_text_filepath = 'medium/reviews_text.txt'
if not USE_PREMADE_REVIEWS_TEXT:
with open(reviews_text_filepath, 'w') as f:
for review in data.Reviews.values:
# if the row lacks a review, skip it.
if pd.isnull(review):
continue
f.write(review + '\n')
else:
assert path.exists(reviews_text_filepath)
CPU times: user 679 ms, sys: 133 ms, total: 813 ms Wall time: 849 ms
Let's build a simple function which helps us read each line from the reviews text file.
def read_reviews(filepath):
"""
helper function to read in the file and yield each line at a time.
"""
with open(filepath) as f:
for review in f:
yield review
We got 413K reviews in total. Let's take a sample and see how they look like.
from itertools import islice
def retrieve_review(sample_num):
"""
get a specific review from reviews text file and return it.
"""
return next(islice(read_reviews(reviews_text_filepath), sample_num, sample_num+1))
sample_review = retrieve_review(200)
sample_review
"As good as you can hope for in a phone for seniors. Reminiscent of a cordless house phone -- makes it easy to use for an 84-year-old. Buttons light up bright, ringtone volume is adjustable and on max it's more than loud enough for anyone with a hearing problem; though surprisingly the call volume is not adjustable, though it is louder than the previous Verizon flip phone on max volume. I set this up with Cricket service. The phone takes a MINI SIM CARD. Not a micro as someone else has stated. I had to buy another sim card because the micro I ordered was too small. Mini sims are the original, large cards. We live in the US. I don't know people are claiming this phone doesn't work in the US. I paid 30ish for this phone, but would have paid up to 100 for it because of the design and functionality. Side note: I was concerned that my grandmother would always accidentally hit the SOS button this phone has, but it can be disabled in the settings.\n"
spaCy is a robust natural language processing (NLP) library for Python. spaCy is highly optimized and comes with many read-to-use functionalities, including tokenization, lemmatization, sentence boundary detection, etc. We'll use spaCy to normalize reviews.
%%time
import spacy
# load english vocabulary and language models. This takes some time.
nlp = spacy.load('en')
CPU times: user 12.4 s, sys: 4.11 s, total: 16.5 s Wall time: 18.2 s
def lemmatize(line):
"""
remove punctuation and whitespace.
"""
return [token.lemma_ for token in line
if not token.is_punct and not token.is_space]
Let's see how well spaCy did. Here's a normalized version of the sample review above. You can see that many words have been lowered & stemmed.
sample_review_normalized = lemmatize(nlp(sample_review))
' '.join(sample_review_normalized)
"as good as you can hope for in a phone for senior reminiscent of a cordless house phone make it easy to use for an 84-year old button light up bright ringtone volume be adjustable and on max -PRON- ' more than loud enough for anyone with a hearing problem though surprisingly the call volume be not adjustable though it be loud than the previous verizon flip phone on max volume i set this up with cricket service the phone take a mini sim card not a micro as someone else have state i have to buy another sim card because the micro i order be too small mini sims be the original large card we live in the us i do not know people be claim this phone do not work in the us i pay 30ish for this phone but would have pay up to 100 for it because of the design and functionality side note i be concerned that my grandmother would always accidentally hit the sos button this phone have but it can be disable in the setting"
We now perform normalizatioin for all the reviews we have. This takes a while.
%%time
USE_PREMADE_SENTENCES_NORMALIZED = True
sentences_normalized_filepath = 'medium/sentences_normalized.txt'
if not USE_PREMADE_SENTENCES_NORMALIZED:
with open(sentences_normalized_filepath, 'w') as f:
for review_parsed in nlp.pipe(read_reviews(reviews_text_filepath)):
for sentence_parsed in review_parsed.sents:
lemmas = lemmatize(sentence_parsed)
f.write(' '.join(lemmas) + '\n')
else:
assert path.exists(sentences_normalized_filepath)
CPU times: user 21 µs, sys: 27 µs, total: 48 µs Wall time: 52.9 µs
There are words which are often used together, and which get a special meaning when they're used together. We call them 'phrases'. We're now going to find bigram/trigram phrases from the reviews.
To do so, we turn to the famous NLP library in Python, gensim. Particularly, the Phrases class.
from gensim.models import Phrases
We take the normalized texts from the previous section, and build a bigram model upon them.
%%time
USE_PREMADE_BIGRAM_MODEL = False
bigram_model_filepath = 'medium/bigram_model'
# gensim's LineSentence provies a convenient way to iterate over lines in a text file.
# it outputs one line at a time, so you can save memory space. it works well with other gensim components.
from gensim.models.word2vec import LineSentence
# we take normalized sentences as unigram sentences, which means we didn't apply any phrase modeling yet.
unigram_sentences = LineSentence(sentences_normalized_filepath)
if not USE_PREMADE_BIGRAM_MODEL:
bigram_model = Phrases(unigram_sentences)
bigram_model.save(bigram_model_filepath)
else:
bigram_model = Phrases.load(bigram_model_filepath)
CPU times: user 45.9 s, sys: 1.24 s, total: 47.1 s Wall time: 48.6 s
Let's see how it worked. You can see that some common two word phrases got glued together (by underscores)
sample_review_bigram = bigram_model[sample_review_normalized]
' '.join(sample_review_bigram)
"as good as you can hope for in a phone for senior reminiscent of a cordless house phone make it easy to use for an 84-year old button light up bright ringtone volume be adjustable and on max -PRON- ' more than loud_enough for anyone with a hearing problem though surprisingly the call volume be not adjustable though it be loud than the previous verizon flip phone on max_volume i set this up with cricket service the phone take a mini sim_card not a micro as someone_else have state i have to buy another sim_card because the micro i order be too small mini sims be the original large card we live in the us i do not know people be claim this phone do not work in the us i pay 30ish for this phone but would have pay up to 100 for it because of the design and functionality side note i be concerned that my grandmother would always accidentally_hit the sos_button this phone have but it can be disable in the setting"
We process all the normalized texts in the same way.
%%time
USE_PREMADE_BIGRAM_SENTENCES = False
bigram_sentences_filepath = 'medium/bigram_sentences.txt'
if not USE_PREMADE_BIGRAM_SENTENCES:
with open(bigram_sentences_filepath, 'w') as f:
for unigram_sentence in unigram_sentences:
bigram_sentence = bigram_model[unigram_sentence]
f.write(' '.join(bigram_sentence) + '\n')
else:
assert path.exists(bigram_sentences_filepath)
CPU times: user 1min 38s, sys: 999 ms, total: 1min 39s Wall time: 1min 41s
Let's take one step further. We're going to build a trigram phrase model on bigram model. It means, we can combine together two bigram phrases, or one unigram and one bigram phrase.
%%time
USE_PREMADE_TRIGRAM_MODEL = False
trigram_model_filepath = 'medium/trigram_model'
from gensim.models.word2vec import LineSentence
from gensim.models import Phrases
if not USE_PREMADE_TRIGRAM_MODEL:
bigram_sentences = LineSentence(bigram_sentences_filepath)
trigram_model = Phrases(bigram_sentences)
trigram_model.save(trigram_model_filepath)
else:
trigram_model = Phrases.load(trigram_model_filepath)
CPU times: user 42.6 s, sys: 679 ms, total: 43.3 s Wall time: 43.8 s
Preprocessing step is almost done. All we have to do now is prepare reviews and sentences for later use.
%%time
USE_PREMADE_REVIEWS_FOR_LDA = True
reviews_for_lda_filepath = 'medium/reviews_for_lda.txt'
if not USE_PREMADE_REVIEWS_FOR_LDA:
with open(reviews_for_lda_filepath, 'w') as f:
for review_parsed in nlp.pipe(read_reviews(reviews_text_filepath)):
unigram_review = lemmatize(review_parsed)
bigram_review = bigram_model[unigram_review]
trigram_review = trigram_model[bigram_review]
# remove stop words
trimmed_review = [lemma for lemma in trigram_review
if lemma not in spacy.en.STOP_WORDS and lemma != '-PRON-']
f.write(' '.join(trimmed_review) + '\n')
else:
assert path.exists(reviews_for_lda_filepath)
CPU times: user 34min 16s, sys: 10.1 s, total: 34min 26s Wall time: 34min 44s
%%time
USE_PREMADE_SENTENCES_FOR_WORD2VEC = False
sentences_for_word2vec_filepath = 'medium/sentences_for_word2vec.txt'
if not USE_PREMADE_SENTENCES_FOR_WORD2VEC:
with open(sentences_for_word2vec_filepath, 'w') as f:
for bigram_sentence in bigram_sentences:
trigram_sentence = trigram_model[bigram_sentence]
# remove stop words
trimmed_sentence = [lemma for lemma in trigram_sentence
if lemma not in spacy.en.STOP_WORDS and lemma != '-PRON-']
f.write(' '.join(trimmed_sentence) + '\n')
else:
assert path.exists(sentences_for_word2vec_filepath)
CPU times: user 1min 35s, sys: 452 ms, total: 1min 36s Wall time: 1min 36s
Topic modeling is to automatically find topics from a bunch of documents - reviews, in this case. We'll now perform LDA, the most basic topic modeling method, on our reviews.
from gensim.corpora import Dictionary, MmCorpus
First, we need to compile our dictionary.
%%time
USE_PREMADE_DICTIONARY = False
dictionary_filepath = 'medium/dictionary.dict'
if not USE_PREMADE_DICTIONARY:
reviews_for_lda = LineSentence(reviews_for_lda_filepath)
dictionary = Dictionary(reviews_for_lda)
dictionary.filter_extremes(no_below=10, no_above=0.4)
dictionary.compactify()
dictionary.save(dictionary_filepath)
else:
dictionary = Dictionary.load(dictionary_filepath)
CPU times: user 19.2 s, sys: 154 ms, total: 19.3 s Wall time: 19.4 s
Then, we build a corpus which we'll use when performing LDA.
%%time
USE_PREMADE_CORPUS = False
corpus_filepath = 'medium/corpus.mm'
if not USE_PREMADE_CORPUS:
def make_bow_corpus(filepath):
"""
generator function to read in reviews from the file
and output a bag-of-words represention of the text
"""
for review in LineSentence(filepath):
yield dictionary.doc2bow(review)
MmCorpus.serialize(corpus_filepath, make_bow_corpus(reviews_for_lda_filepath))
review_corpus = MmCorpus(corpus_filepath)
CPU times: user 27.4 s, sys: 510 ms, total: 27.9 s Wall time: 28 s
Finally, we can turn to gensim's LdaMulticore class for parallelized LDA, which is claimed to be faster.
from gensim.models import LdaMulticore
%%time
USE_PREMADE_LDA = False
lda_filepath = 'medium/lda'
if not USE_PREMADE_LDA:
# number of workers should be set to your number of physical cores minus one
lda = LdaMulticore(review_corpus,
num_topics=20,
id2word=dictionary,
workers=2)
lda.save(lda_filepath)
else:
lda = LdaMulticore.load(lda_filepath)
CPU times: user 4min 54s, sys: 34.3 s, total: 5min 28s Wall time: 5min 37s
You can inspect a specific topic from the model by it's index. There's no names for topics, though, because LDA is an unsupervised learning algorithm. Instead, you can see the words associated to the topic.
lda.show_topic(0)
[('issue', 0.026725212643087325), ('bad', 0.024668221610121457), ('time', 0.014440138006607392), ('screen', 0.012994522786482046), ('use', 0.011791822927948966), ('wifi', 0.011791142072353969), ('keyboard', 0.0089068062466145462), ('app', 0.0087798956129946908), ('update', 0.0077374814098902641), ('drop', 0.007385674542713527)]
Manually inspecting topics by index is painful. Let's plot the results from LDA with a great python visualization library, pyLDAvis.
import pyLDAvis
import pyLDAvis.gensim
import pickle
%%time
USE_PREMADE_LDAVIS = False
ldavis_filepath = 'medium/ldavis'
if not USE_PREMADE_LDAVIS:
ldavis = pyLDAvis.gensim.prepare(topic_model=lda,
corpus=review_corpus,
dictionary=dictionary)
with open(ldavis_filepath, 'wb') as f:
pickle.dump(ldavis, f)
else:
with open(ldavis_filepath, 'rb') as f:
ldavis = pickle.load(f)
CPU times: user 8min 8s, sys: 2.41 s, total: 8min 10s Wall time: 6min 50s
pyLDAvis.display(ldavis)
You can find some interesting patterns in this plot. First, the topics on the right side (13, 15, 17) are associated with positive terms (good, great, awesome). Conversely, the topics on the left side are about technical terms, and some of them are about issues and problems. For example, words from topic 5 definitely represents dissatisfaction and complaints of the customer (bad, issue, heat-up).
Based on the findings, I've given each topic a name and its sentimental state. For example, topic 5 is negative and topic 13 is positive, while topic 6 is neutral. Some are ambiguous, of course. But this gives us more insights into the topics.
Note : Unfortunately, gensim and pyLDAvis don't use the same topic index numbers. In the dictionary below, I used index numbers from gensim, and provied matching pyLDAvis index number in the comment.
# (topic_name, sentiment) # pyLDAvis index
# sentiment = 1 : positive, -1 : negative, 0 : neutral
topic_sentiments = {0: ('bad', -1), # 5
1: ('fine', 1), # 16
2: ('charger & cables', 0), # 11
3: ('model upgrade', 0), # 12
4: ('present', 1), # 15
5: ('battery problem', -1), # 2
6: ('memory problem', -1), # 10
7: ('nice size', 1), # 19
8: ('international use', 0), # 6
9: ('awesome', 1), # 20
10: ('meet expectation', 1), # 14
11: ('sim card', 0), # 4
12: ('functionality', 0), # 1
13: ('speaker', 0), # 18
14: ('ram', 0), # 3
15: ('touch & buttons', 0), # 7
16: ('excellent', 1), # 13
17: ('look', 0), # 8
18: ('great', 1), # 9
19: ('good', 1), # 17
}
Given a review, we can now tell which topics the review is from and how positive it is. Here's some helper functions to help us do so.
def get_review_lda(review):
review_parsed = nlp(review)
review_lemmatized = lemmatize(review_parsed)
review_bigram = bigram_model[review_lemmatized]
review_trigram = trigram_model[review_bigram]
review_trimmed = [lemma for lemma in review_trigram if lemma not in spacy.en.STOP_WORDS and lemma != '-PRON-']
review_bow = dictionary.doc2bow(review_trimmed)
review_lda = lda[review_bow]
return review_lda
def get_review_sentiment(review):
if pd.isnull(review): return 0
review_lda = get_review_lda(review)
sentiments = [(topic_sentiments[topic][1] * frequency, frequency) for topic, frequency in review_lda
if topic_sentiments[topic][1] != 0]
if not sentiments:
return 0
else:
return sum(senti for senti, freq in sentiments) / sum(freq for senti, freq in sentiments)
def describe_review(review_num, min_frequency=0.1):
review = retrieve_review(review_num)
review_lda = get_review_lda(review)
sentiment = get_review_sentiment(review)
for topic, frequency in sorted(review_lda, key=lambda x: x[1], reverse=True):
if frequency < min_frequency: continue
print('{:25} {}'.format(topic_sentiments[topic][0], round(frequency, 3)))
print()
print("positiveness : {}".format(sentiment))
Let's take a look at a few examples.
sample_num = 4
print(retrieve_review(sample_num))
Great phone to replace my lost phone. The only thing is the volume up button does not work, but I can still go into settings to adjust. Other than that, it does the job until I am eligible to upgrade my phone again.Thaanks!
describe_review(sample_num)
model upgrade 0.273 speaker 0.268 great 0.208 touch & buttons 0.189 positiveness : 1.0
This review is 27.3% about model upgrade, 26.8% about speaker, 20.8% about how great it is, and 18.9% about touch & buttons. And it's all positive! Looks pretty good.
sample_num = 600
print(retrieve_review(sample_num))
Didnt offer clear and concise instructions on how to put in a sim card. I got bad eyesight so i put in a microsim in the sim slot. Tried to remove it with tweezers now the phone wont power on
describe_review(sample_num)
bad 0.771 sim card 0.169 positiveness : -1.0
This review is 77.1% about how bad the phone is, and the reason is its sim card. It's a negative reviw.
sample_num = 500
print(retrieve_review(sample_num))
The phone is an unlocked phone and the support network of the phone is 3G (WCDMA 850/2100) and 2G Quad Band(GSM 850/900/1800/1900) network ,If you have to buy a sim card buy the GSM/WCDMA SIM card. Do not buy the CDMA SIM card from them.The support language of the phone is English, Bahasa Indonesia, Bahasa Malaya, Burmese, Cestina, Deutsch, Espanola, French, Italiano, Nederland's, Portuguese, Vietnamese, Turkish, Greek, Russian, Hebrew, Arabic, Persian, Thai, Simplified/Traditional Chinese.If your service support those network, and those language, you can use this phone.It supports two different sims cards and a memory card I put a 64 gig memory card in it and it supported it.The internet on this phone is just as fast as 4g I was very impressed with the speed of the phone and clarity.The phone is very well made and durable you will not be disappointed with this phone.I received this phone at no cost/free to test it and try it out in exchange for my honest and unbiased opinion/review of the Juning cell phone.
describe_review(sample_num)
international use 0.492 sim card 0.148 meet expectation 0.146 functionality 0.133 positiveness : 1.0
This customer talks a lot about the international use, sim card, and the functionality of the phone. Overall, it met the expectation of the customer.
Let's expand our sentiment analysis to all of the reviews. How about grouping them by the phone brand? Which phone brand got the most posivie reviews?
%%time
USE_PREMADE_BRAND_SENTIMENT = False
brand_reviews_sentiment_filepath = 'medium/brand_reviews_sentiment.csv'
brand_sentiment_filepath = 'medium/brand_sentiment.csv'
if not USE_PREMADE_BRAND_SENTIMENT:
brand_reviews = data[['Brand Name', 'Reviews']]
brand_reviews_sentiment = brand_reviews.assign(Sentiment = lambda x: x['Reviews'].map(get_review_sentiment))
brand_reviews_sentiment.to_csv(brand_reviews_sentiment_filepath, index=False)
brand_sentiment = brand_reviews_sentiment.groupby('Brand Name')[['Sentiment']].mean()
brand_sentiment.to_csv(brand_sentiment_filepath)
else:
brand_sentiment = pd.read_csv(brand_sentiment_filepath).set_index('Brand Name')
CPU times: user 1h 37s, sys: 1min 18s, total: 1h 1min 55s Wall time: 1h 10min 54s
You can plot the result with any plot library. Here we used Plotly.
import cufflinks
cufflinks.go_offline()
brand_sentiment.loc[['Apple', 'Samsung', 'LG', 'BlackBerry', 'Nokia', 'Motorola', 'HTC', 'BLU', 'Sony', 'Huawei', 'ZTE']].iplot(kind='bar')
Apple has a high level of positive reviews - as they claim - followed by BlackBerry, Samsung, etc.
Word vector modeling (word embedding, put another way) is a method to transform words to vectors, which enables arithmetic with them. Word2Vec has been proposed by Google in 2013, and you can find python implementation of the model in ... gensim (of course!)
from gensim.models import Word2Vec
%%time
USE_PREMADE_WORD2VEC = False
word2vec_filepath = 'medium/word2vec_model'
if not USE_PREMADE_WORD2VEC:
sentences_for_word2vec = LineSentence(sentences_for_word2vec_filepath)
# initiate the model with 100 dimensions of vectors, 5 words to look before and after each focus word, etc.
# and perform the first epoch of training
phone2vec = Word2Vec(sentences_for_word2vec, size=100, window=5, min_count=5, sg=1)
# perform another 10 epochs of training
for _ in range(9):
phone2vec.train(sentences_for_word2vec)
phone2vec.save(word2vec_filepath)
else:
phone2vec = Word2Vec.load(word2vec_filepath)
phone2vec.init_sims()
CPU times: user 51min 8s, sys: 18.9 s, total: 51min 27s Wall time: 20min 17s
print('{} training epochs so far.'.format(phone2vec.train_count))
10 training epochs so far.
We transformed each word in our reviews to 100 dimentional vectors. Wonder how they look? here's word vectors in pandas dataframe form.
# take word vectors of most frequent words.
num_words = 2000
word_embeddings = pd.DataFrame(phone2vec.syn0norm[:num_words, :], index=phone2vec.index2word[:num_words])
word_embeddings.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
phone | -0.127532 | 0.020481 | -0.052120 | 0.071198 | 0.100685 | 0.124345 | 0.017325 | -0.024921 | -0.126637 | -0.031508 | ... | 0.064698 | -0.181563 | -0.134739 | 0.070984 | 0.091476 | 0.253143 | 0.190031 | -0.091487 | 0.067340 | -0.054588 |
good | -0.058733 | -0.021988 | -0.065400 | 0.106013 | 0.071648 | 0.219430 | -0.079847 | -0.121834 | 0.046186 | -0.091939 | ... | -0.051095 | -0.160097 | -0.159881 | 0.119198 | 0.027544 | -0.061379 | 0.069744 | -0.032230 | -0.185933 | -0.048367 |
work | -0.002126 | 0.028364 | 0.103640 | 0.083105 | 0.079385 | -0.047276 | 0.001662 | -0.015813 | -0.159563 | -0.060560 | ... | 0.204926 | -0.160150 | -0.093795 | 0.111234 | 0.037255 | 0.019713 | 0.099565 | -0.039645 | -0.022610 | 0.053408 |
great | 0.001687 | -0.017033 | 0.004800 | 0.074285 | 0.012848 | 0.165874 | -0.135741 | -0.085240 | 0.068658 | -0.069331 | ... | -0.021366 | -0.264300 | -0.038069 | 0.152700 | 0.070082 | -0.106169 | 0.068651 | -0.025668 | -0.083072 | -0.023489 |
use | 0.038651 | 0.038483 | 0.029484 | 0.003339 | 0.022917 | 0.101063 | 0.011964 | -0.087735 | -0.191269 | -0.115036 | ... | 0.138858 | -0.000513 | -0.159838 | 0.103710 | 0.145529 | 0.085408 | 0.126005 | -0.117311 | -0.091205 | 0.044889 |
5 rows × 100 columns
Word vectors capture the 'meaning' of the word in some way, so you can find some cool stuff playing with them.
phone2vec.most_similar(positive=['verizon'], topn=5)
[('at&t', 0.7858468294143677), ('sprint', 0.7684697508811951), ('att', 0.7366464138031006), ('t_mobile', 0.7208009958267212), ('at&t.', 0.6923530101776123)]
Word2Vec effectively figured that verizon is similar to at&t, sprint, and t-mobile, in that they're all phone carriers.
What do you get when you substract 'coolness' from iphone?
phone2vec.most_similar(positive=['iphone'], negative=['cool'], topn=1)
[('galaxy_note', 0.47213214635849)]
You can even plot the word vectors if you got tired of inspecting relations of them by hand. To plot vectors of 100 dimensions on 2 dimensional space (which we can see by our eyes) we need to employ some kind of dimensionality reduction techniques. Here we're going to use t-SNE. It's known to be better to capture non-linear patterns in the data than PCA.
from sklearn.manifold import TSNE
%%time
USE_PREMADE_TSNE = False
tsne_filepath = 'medium/tsne.pkl'
if not USE_PREMADE_TSNE:
tsne = TSNE(random_state=0)
tsne_points = tsne.fit_transform(word_embeddings.values)
with open(tsne_filepath, 'wb') as f:
pickle.dump(tsne_points, f)
else:
with open(tsne_filepath, 'rb') as f:
tsne_points = pickle.load(f)
tsne_df = pd.DataFrame(tsne_points, index=word_embeddings.index, columns=['x_coord', 'y_coord'])
tsne_df['word'] = tsne_df.index
CPU times: user 11.5 s, sys: 1.21 s, total: 12.7 s Wall time: 12.6 s
Now t-SNE is ready. Let's plot it with Python's interactive visualization library, bokeh.
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value
output_notebook()
# prepare the data in a form suitable for bokeh.
plot_data = ColumnDataSource(tsne_df)
# create the plot and configure it
tsne_plot = figure(title='t-SNE Word Embeddings',
plot_width = 800,
plot_height = 800,
active_scroll='wheel_zoom'
)
# add a hover tool to display words on roll-over
tsne_plot.add_tools( HoverTool(tooltips = '@word') )
tsne_plot.circle('x_coord', 'y_coord', source=plot_data,
color='red', line_alpha=0.2, fill_alpha=0.1,
size=10, hover_line_color='orange')
# adjust visual elements of the plot
tsne_plot.title.text_font_size = value('16pt')
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None
# show time!
show(tsne_plot);
dreamgonfly@gmail.com