LSA: 선형대수학적으로 토픽벡터를 발굴하는 방법

LDA: 확률적으로 토픽벡터를 계산해주는 방법

In [1]:

import numpy as np
import warnings
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer  
from sklearn.decomposition  import TruncatedSVD  #축소된 svd
warnings.filterwarnings('ignore')

1. Latent Semantic Analysis (LSA):¶

In [3]:

# The data.
my_docs = ["The economic slowdown is becoming more severe",
           "The movie was simply awesome",
           "I like cooking my own food",
           "Samsung is announcing a new technology",
           "Machine Learning is an example of awesome technology",
           "All of us were excited at the movie",
           "We have to do more to reverse the economic slowdown"]

1.1. Create a TF IDF representation:¶

TfidfVectorizer() arguments:

max_features : maximum number of features (distict words).
min_df : The minimum DF. Integer value means count and real number (0~1) means proportion.
max_df : The maximum DF. Integer value means count and real number (0~1) means proportion. Helps to filter out the stop words.

In [4]:

my_docs = [x.lower() for x in my_docs]  #소문자로 변경

In [5]:

my_stop_words = ['us', 'like']

In [6]:

vectorizer = TfidfVectorizer(max_features = 15, min_df = 1, max_df = 3, stop_words = stopwords.words('english') + my_stop_words)
X = vectorizer.fit_transform(my_docs).toarray()                 #my_stop_words 리스트 +로 추가

In [8]:

Out[8]:

array([[0.        , 0.        , 0.        , 0.53828134, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.64846464, 0.        , 0.53828134, 0.        ],
       [0.        , 0.53828134, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.53828134, 0.        , 0.        ,
        0.        , 0.        , 0.64846464, 0.        , 0.        ],
       [0.        , 0.        , 0.70710678, 0.        , 0.        ,
        0.        , 0.70710678, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.52064676, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.52064676, 0.        ,
        0.52064676, 0.        , 0.        , 0.        , 0.43218152],
       [0.        , 0.53828134, 0.        , 0.        , 0.64846464,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.53828134],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.76944876, 0.        , 0.63870855, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.53828134, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.64846464,
        0.        , 0.        , 0.        , 0.53828134, 0.        ]])

In [6]:

# Size of X (=m x n). m = number of documents = 7 & n = number of features.
X.shape   #[]가 7개 열의 수가 15개

Out[6]:

(7, 15)

In [7]:

# View the features.(칼럼의 이름들)
features = vectorizer.get_feature_names()
print(features)

['announcing', 'awesome', 'cooking', 'economic', 'example', 'excited', 'food', 'movie', 'new', 'reverse', 'samsung', 'severe', 'simply', 'slowdown', 'technology']

1.2. Apply the truncated SVD:¶

In [8]:

#객체 만들고 학습
n_topics = 4  #4개 토픽 발굴이 목표 (15개까지의 주성분이 나올 수 있는데 4차원으로 차원축소하겠다)
svd = TruncatedSVD(n_components=n_topics, n_iter=100)  #n_iter는 디폴트
svd.fit(X)  

Out[8]:

TruncatedSVD(algorithm='randomized', n_components=4, n_iter=100,
             random_state=None, tol=0.0)

In [9]:

# get the V^t matrix. 
vt = svd.components_  #v의 행렬의 전치가 나옴
vtabs = np.abs(vt)

In [10]:

# Check for the size of V^t. 
vt.shape    #전치되어 행이 4됨

Out[10]:

(4, 15)

In [15]:

vt 

#[5줄]이 한 행임
#4.09648554e-1 첫번째원소는 announcing, -3.09648554e-1 두번쨰는 awesome에 해당하는 원소

Out[15]:

array([[ 4.09648554e-17, -3.60448192e-17, -1.64706896e-18,
         6.05710899e-01, -3.14797383e-17,  1.41315903e-17,
        -1.64706896e-18,  1.68987806e-18,  1.51580645e-17,
         3.64848333e-01,  1.51580645e-17,  3.64848333e-01,
        -1.20346441e-17,  6.05710899e-01, -1.34983170e-17],
       [ 1.09328020e-01,  5.24120449e-01,  9.83780423e-16,
        -7.70860566e-17,  2.79641926e-01,  3.00367284e-01,
         9.83780423e-16,  5.41324276e-01,  1.09328020e-01,
        -2.12425741e-16,  1.09328020e-01,  1.27976612e-16,
         3.51763163e-01, -7.70867745e-17,  3.22878460e-01],
       [ 3.17757779e-01,  1.09242359e-01, -3.28091723e-16,
        -4.77055574e-18,  2.84805383e-01, -3.73322920e-01,
        -4.04664825e-16, -4.37060653e-01,  3.17757779e-01,
        -2.37628237e-17,  3.17757779e-01,  7.91995645e-18,
        -1.53201700e-01, -4.77057326e-18,  5.00179172e-01],
       [ 6.99407274e-16, -1.07768538e-15,  7.07106781e-01,
         3.12368259e-17, -9.61233971e-16, -2.53713733e-16,
         7.07106781e-01, -5.54089684e-16,  6.90182980e-16,
         6.04864874e-19,  6.90182980e-16, -5.77947367e-18,
        -3.31494761e-16, -2.42743250e-17, -2.59157645e-16]])

1.3. From each topic, extract the top features: 토픽벡터들의 이름정하기¶

행렬가지고 SVD한 뒤 토픽벡터 가져옴

토픽벡터 성분이 개개 가중치임

성분가지고 3개의 탑 feature를 뽑음

In [11]:

n_top = 3  #토픽벡터들 0~3까지 하나씩 가져옴
for i in range(n_topics):                           
    topic_features = [features[idx] for idx in np.argsort(-vtabs[i,:])]   # argsort() shows the sorted index.
      #np.argsort: 소에서 대로 정렬해 그 위치값(인덱스)을 가져옴(값을 가져오는거 아님)
      #대에서 소로 정렬하고 싶기에 -vtabs 
      #제일 큰 것들의 인덱스를 가져와 그 인덱스에 해당하는 feature들을 모아 리스트 만듬
   
    topic_features_top = topic_features[0:n_top]
    #탑3만 가져옴
    
    if i == 0:
        topic_matrix = [topic_features_top]                    # list의 list 만들 준비!
    else:
        topic_matrix.append(topic_features_top) 

In [12]:

# Show the top features for each topic.
topic_matrix 

#첫번째 토픽벡터에서 탑 3는 ['economic', 'slowdown', 'severe']
#제일 많은 기여하는 성분들 
#성분가지고 3개의 탑 feature를 뽑음

Out[12]:

[['economic', 'slowdown', 'severe'],
 ['movie', 'awesome', 'simply'],
 ['technology', 'movie', 'excited'],
 ['cooking', 'food', 'awesome']]

In [13]:

# In view of the top features, we can name the topics.
topic_names = ['Economy', 'Movie','Technology', 'Cuisine']
#첫번째 토픽벡터는 Economy라고 이름 지음

1.4. Label each document with the most predominant topic:¶

개개문서에서 키워드들이 몇 번 발생하는지

In [14]:

n_docs = len(my_docs)
for i in range(n_docs):
    score_pick = 0
    topic_pick = 0
    tokennized_doc = nltk.word_tokenize(my_docs[i])
    for j in range(n_topics):
        found = [ x in topic_matrix[j] for x in tokennized_doc ] 
        score = np.sum(found)
        if (score > score_pick):
            score_pick = score
            topic_pick = j
    print("Document " + str(i+1) + " = " + topic_names[topic_pick])

#문서들을 하나씩 가져와서 문서안의 토픽벡터 탑3 키워드가 문서 안에 있냐 없냐 보고 많이 겹치면 그 문서의 토픽으로 삼음
    
#첫번째문서는 이코노미
    

Document 1 = Economy
Document 2 = Movie
Document 3 = Cuisine
Document 4 = Technology
Document 5 = Movie
Document 6 = Technology
Document 7 = Economy

NOTE: 어떤 경우는 제대로 레이블 된 것 같고, 어떤 경우는 제대로 안된 것 같음

In [ ]: