Notebook

8장. 감성 분석에 머신 러닝 적용하기¶

In [1]:

%load_ext watermark
%watermark -u -d -v -p numpy,pandas,sklearn,nltk

/home/haesun/anaconda3/envs/python-ml/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  return f(*args, **kwds)

last updated: 2019-03-11 

CPython 3.7.2
IPython 7.3.0

numpy 1.16.1
pandas 0.24.1
sklearn 0.20.2
nltk 3.4

텍스트 처리용 IMDb 영화 리뷰 데이터 준비¶

영화 리뷰 데이터셋 구하기¶

다음처럼 파이썬에서 직접 압축을 풀 수도 있습니다:

In [1]:

# import os, tarfile
# if not os.path.isdir('aclImdb'):
#     with tarfile.open('aclImdb_v1.tar.gz', 'r:gz') as tar:
#         tar.extractall()

영화 리뷰 데이터셋을 더 간편한 형태로 전처리하기¶

In [2]:

import os, pyprind
import pandas as pd

# `basepath`를 압축 해제된 영화 리뷰 데이터셋이 있는 디렉토리로 바꾸세요
basepath = 'aclImdb'
labels   = {'pos': 1, 'neg': 0}
# 50,000 : for 반복문 작업횟수
pbar = pyprind.ProgBar(50000) 
df   = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), 
                      'r', encoding='utf-8') as infile:
                txt = infile.rejad()
            df = df.append([[txt, labels[l]]], 
                           ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:01:53

In [4]:

# 데이터프레임을 섞습니다:
import numpy as np
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

선택사항: 만들어진 데이터를 CSV 파일로 저장합니다:

In [6]:

import pandas as pd

df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head(3)

In [7]:

df.shape

Out[7]:

(50000, 2)

BoW 모델 소개¶

단어를 특성 벡터로 변환하기¶

CountVectorizer의 fit_transform 메서드를 호출하여 BoW 모델의 어휘사전을 만들고 다음 세 문장을 희소한 특성 벡터로 변환합니다:

The sun is shining
The weather is sweet
The sun is shining, the weather is sweet, and one and one is two

In [9]:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer() # 카운트 인스턴스를 생성
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)
bag.toarray()

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}

Out[9]:

array([[0, 1, 0, 1, 1, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 1, 1, 0, 1],
       [2, 3, 2, 1, 1, 1, 2, 1, 1]], dtype=int64)

어휘 사전의 내용을 출력해 보면 BoW 모델의 개념을 이해하는 데 도움이 됩니다:

In [10]:

print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}

이전 결과에서 볼 수 있듯이 어휘 사전은 고유 단어와 정수 인덱스가 매핑된 파이썬 딕셔너리에 저장되어 있습니다. 그다음 만들어진 특성 벡터를 출력해 봅시다:

특성 벡터의 각 인덱스는 CountVectorizer의 어휘 사전 딕셔너리에 저장된 정수 값에 해당됩니다. 예를 들어 인덱스 0에 있는 첫 번째 특성은 ‘and’ 단어의 카운트를 의미합니다. 이 단어는 마지막 문서에만 나타나네요. 인덱스 1에 있는 (특성 벡터의 두 번째 열) 단어 ‘is’는 세 문장에 모두 등장합니다. 특성 벡터의 이런 값들을 단어 빈도(term frequency) 라고도 부릅니다. 문서 d에 등장한 단어 t의 횟수를 tf (t,d)와 같이 씁니다.

In [5]:

print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]

tf-idf를 사용해 단어 적합성 평가하기¶

In [6]:

np.set_printoptions(precision=2)

텍스트 데이터를 분석할 때 클래스 레이블이 다른 문서에 같은 단어들이 나타나는 경우를 종종 보게 됩니다. 일반적으로 자주 등장하는 단어는 유용하거나 판별에 필요한 정보를 가지고 있지 않습니다. 이 절에서 특성 벡터에서 자주 등장하는 단어의 가중치를 낮추는 기법인 tf-idf(term frequency-inverse document frequency)에 대해 배우겠습니다. tf-idf는 단어 빈도와 역문서 빈도(inverse document frequency)의 곱으로 정의됩니다:

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$

여기에서 tf(t, d)는 이전 절에서 보았던 단어 빈도입니다. idf(t, d)는 역문서 빈도로 다음과 같이 계산합니다:

$$\text{idf}(t,d) = \text{log}\frac{n_d}{1+\text{df}(d, t)},$$

여기에서 $n_d$는 전체 문서 개수이고 df(d, t)는 단어 t가 포함된 문서 d의 개수입니다. 분모에 상수 1을 추가하는 것은 선택 사항입니다. 훈련 샘플에 한 번도 등장하지 않는 단어가 있는 경우 분모가 0이 되지 않게 만듭니다. log는 문서 빈도 df(d, t)가 낮을 때 역문서 빈도 값이 너무 커지지 않도록 만듭니다.

사이킷런 라이브러리에는 CountVectorizer 클래스에서 만든 단어 빈도를 입력받아 tf-idf로 변환하는 TfidfTransformer 클래스가 구현되어 있습니다:

In [12]:

from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, 
                         norm='l2', 
                         smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs))
      .toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]

이전 절에서 보았듯이 세 번째 문서에서 단어 ‘is’가 가장 많이 나타났기 때문에 단어 빈도가 가장 컸습니다. 동일한 특성 벡터를 tf-idf로 변환하면 단어 ‘is’는 비교적 작은 tf-idf를 가집니다(0.45). 이 단어는 첫 번째와 두 번째 문서에도 나타나므로 판별에 유용한 정보를 가지고 있지 않을 것입니다.

수동으로 특성 벡터에 있는 각 단어의 tf-idf를 계산해 보면 TfidfTransformer가 앞서 정의한 표준 tf-idf 공식과 조금 다르게 계산한다는 것을 알 수 있습니다. 사이킷런에 구현된 역문서 빈도 공식은 다음과 같습니다.

$$\text{idf} (t,d) = log\frac{1 + n_d}{1 + \text{df}(d, t)}$$

비슷하게 사이킷런에서 계산하는 tf-idf는 앞서 정의한 공식과 조금 다릅니다:

$$\text{tf-idf}(t,d) = \text{tf}(t,d) \times (\text{idf}(t,d)+1)$$

일반적으로 tf-idf를 계산하기 전에 단어 빈도(tf)를 정규화하지만 TfidfTransformer 클래스는 tf-idf를 직접 정규화합니다. 사이킷런의 TfidfTransformer는 기본적으로 L2 정규화를 적용합니다(norm=’l2’). 정규화되지 않은 특성 벡터 v를 L2-노름으로 나누면 길이가 1인 벡터가 반환됩니다:

$$v_{\text{norm}} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v_{1}^{2} + v_{2}^{2} + \dots + v_{n}^{2}}} = \frac{v}{\big (\sum_{i=1}^{n} v_{i}^{2}\big)^\frac{1}{2}}$$

TfidfTransformer의 작동 원리를 이해하기 위해 세 번째 문서에 있는 단어 ‘is'의 tf-idf를 예로 들어 계산해 보죠.

세 번째 문서에서 단어 ‘is’의 단어 빈도는 3입니다(tf=3). 이 단어는 세 개 문서에 모두 나타나기 때문에 문서 빈도가 3입니다(df=3). 따라서 역문서 빈도는 다음과 같이 계산됩니다:

$$\text{idf}("is", d3) = log \frac{1+3}{1+3} = 0$$

이제 tf-idf를 계산하기 위해 역문서 빈도에 1을 더하고 단어 빈도를 곱합니다:

$$\text{tf-idf}("is",d3)= 3 \times (0+1) = 3$$

In [13]:

tf_is    = 3
n_docs   = 3
idf_is   = np.log((n_docs+1) / (3+1))
tfidf_is = tf_is * (idf_is + 1)
print('tf-idf of term "is" = %.2f' % tfidf_is)

tf-idf of term "is" = 3.00

세 번째 문서에 있는 모든 단어에 대해 이런 계산을 반복하면 tf-idf 벡터 [3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0, 1.69, 1.29]를 얻습니다. 이 특성 벡터의 값은 앞서 사용했던 TfidfTransformer에서 얻은 값과 다릅니다. tf-idf 계산에서 빠트린 마지막 단계는 다음과 같은 L2-정규화입니다::

$$\text{tfi-df}_{norm} = \frac{[3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]}{\sqrt{[3.39^2, 3.0^2, 3.39^2, 1.29^2, 1.29^2, 1.29^2, 2.0^2 , 1.69^2, 1.29^2]}}$$$$=[0.5, 0.45, 0.5, 0.19, 0.19, 0.19, 0.3, 0.25, 0.19]$$$$\Rightarrow \text{tfi-df}_{norm}("is", d3) = 0.45$$

결과에서 보듯이 사이킷런의 TfidfTransformer에서 반환된 결과와 같아졌습니다. tf-idf 계산 방법을 이해했으므로 다음 절로 넘어가 이 개념을 영화 리뷰 데이터셋에 적용해 보죠.

In [14]:

tfidf = TfidfTransformer(use_idf=True, norm=None, smooth_idf=True)
raw_tfidf = tfidf.fit_transform(count.fit_transform(docs)).toarray()[-1]
raw_tfidf 

Out[14]:

array([3.39, 3.  , 3.39, 1.29, 1.29, 1.29, 2.  , 1.69, 1.29])

In [15]:

l2_tfidf = raw_tfidf / np.sqrt(np.sum(raw_tfidf**2))
l2_tfidf

Out[15]:

array([0.5 , 0.45, 0.5 , 0.19, 0.19, 0.19, 0.3 , 0.25, 0.19])

텍스트 데이터 정제¶

In [12]:

df.loc[0, 'review'][-50:]

Out[12]:

'and I suggest that you go see it before you judge.'

In [23]:

import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return emoticons, text

preprocessor(df.loc[0, 'review'][-50:])

Out[23]:

([], 'and i suggest that you go see it before you judge ')

In [24]:

preprocessor("</a>This :) is :( a test :-)! ;) ::)")

Out[24]:

([':)', ':(', ':-)', ';)', ':)'], 'this is a test :) :( :) ;) :)')

In [20]:

df['review'] = df['review'].apply(preprocessor)

In [21]:

df['review'].map(preprocessor)

Out[21]:

0        in 1974 the teenager martha moxley maggie grac...
1        ok so i really like kris kristofferson and his...
2         spoiler do not read this if you think about w...
3        hi for all the people who have seen this wonde...
4        i recently bought the dvd forgetting just how ...
5        leave it to braik to put on a good show finall...
6        nathan detroit frank sinatra is the manager of...
7        to understand crash course in the right contex...
8        i ve been impressed with chavez s stance again...
9        this movie is directed by renny harlin the fin...
10       i once lived in the u p and let me tell you wh...
11       hidden frontier is notable for being the longe...
12       it s a while ago that i have seen sleuth 1972 ...
13       what is it about the french first they apparen...
14       this very strange movie is unlike anything mad...
15       i saw this movie on the strength of the single...
16       there are some great philosophical questions w...
17       i was cast as the surfer dude in the beach sce...
18       i had high hopes for this one until they chang...
19       set in and near a poor working class town in t...
20       opulent sets and sumptuous costumes well photo...
21       i saw the film and i got screwed because the f...
22       i m getting a little tired of people misusing ...
23       how offensive those who liked this movie have ...
24       what else can you say about this movie except ...
25       certain aspects of punishment park are less th...
26       first of all i d like to tell you that i m int...
27       you should not take what i am about to say lig...
28       i love the jurassic park movies they are three...
29       the first series of lost kicked off with a ban...
                               ...                        
49970    tom fontana s unforgettable oz is hands down o...
49971    last weekend i bought this zombie movie from t...
49972    i watched the first few moments on tcm a few y...
49973    i saw this movie for the first time in 1988 wh...
49974    al pacino kim basinger tea leoni ryan o neal r...
49975    stanwyck at her villainous best robinson her e...
49976    an allegation of aggravated sexual assault alo...
49977    i thought this movie was wonderfully plotted i...
49978    just like most people i couldn t wait to see t...
49979    dark comedy gallows humor how does one make a ...
49980     probably will contain spoilers after a succes...
49981    i must be that one guy in america that didn t ...
49982    the plot a trucker kristofferson battles a cor...
49983    the ladies man is laugh out loud funny with a ...
49984    well the artyfartyrati of cannes may have like...
49985    the director was probably still in his early l...
49986    you know when you re on the bus and someone de...
49987    five minutes into this movie you realize that ...
49988     if you ve been laughing too much for a long t...
49989    i love dissing this movie my peers always try ...
49990    ok i think the tv show is kind of cute and it ...
49991    big disappointment clash by night is much to t...
49992    cassidy kacia brady puts a gun in her mouth bl...
49993    with rapid intercutting of scenes of insane pe...
49994    when girlfight came out the reviews praised it...
49995    ok lets start with the best the building altho...
49996    the british heritage film industry is out of c...
49997    i don t even know where to begin on this one i...
49998    richard tyler is a little boy who is scared of...
49999    i waited long to watch this movie also because...
Name: review, Length: 50000, dtype: object

문서를 토큰으로 나누기¶

In [22]:

from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()


def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [23]:

tokenizer('runners like running and thus they run')

Out[23]:

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [24]:

tokenizer_porter('runners like running and thus they run')

Out[24]:

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

In [25]:

import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/haesun/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Out[25]:

True

In [26]:

from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:]
if w not in stop]

Out[26]:

['runner', 'like', 'run', 'run', 'lot']

문서 분류를 위한 로지스틱 회귀 모델 훈련하기¶

In [27]:

X_train = df.loc[:1500, 'review'].values
y_train = df.loc[:1500, 'sentiment'].values
X_test = df.loc[1500:, 'review'].values
y_test = df.loc[1500:, 'sentiment'].values

In [28]:

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(solver='liblinear', random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=1)

n_jobs 매개변수에 대하여

앞의 코드 예제에서 컴퓨터에 있는 모든 CPU 코어를 사용해 그리드 서치의 속도를 높이려면 (n_jobs=1 대신) n_jobs=-1로 지정하는 것이 좋습니다. 일부 시스템에서는 멀티프로세싱을 위해 n_jobs=-1로 지정할 때 tokenizer 와 tokenizer_porter 함수의 직렬화에 문제가 발생할 수 있습니다. 이런 경우 [tokenizer, tokenizer_porter]를 [str.split]로 바꾸어 문제를 해결할 수 있습니다. 다만 str.split로 바꾸면 어간 추출을 하지 못합니다.

코드 실행 시간에 대하여

다음 코드 셀을 실행하면 시스템에 따라 30~60분 정도 걸릴 수 있습니다. 매개변수 그리드에서 정의한 대로 22235 + 22235 = 240개의 모델을 훈련하기 때문입니다.

너무 오래 기다리기 어렵다면 데이터셋의 훈련 샘플의 수를 다음처럼 줄일 수 있습니다:

X_train = df.loc[:2500, 'review'].values
y_train = df.loc[:2500, 'sentiment'].values

훈련 세트 크기를 줄이는 것은 모델 성능을 감소시킵니다. 그리드에 지정한 매개변수를 삭제하면 훈련한 모델 수를 줄일 수 있습니다. 예를 들면 다음과 같습니다:

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0]},
              ]

In [29]:

gs_lr_tfidf.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

Fitting 5 folds for each of 48 candidates, totalling 240 fits

/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed: 236.0min finished

Out[29]:

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...nalty='l2', random_state=0, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=1,
       param_grid=[{'vect__ngram_range': [(1, 1)], 'vect__stop_words': [['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's...se_idf': [False], 'vect__norm': [None], 'clf__penalty': ['l1', 'l2'], 'clf__C': [1.0, 10.0, 100.0]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [30]:

print('최적의 매개변수 조합: %s ' % gs_lr_tfidf.best_params_)
print('CV 정확도: %.3f' % gs_lr_tfidf.best_score_)

최적의 매개변수 조합: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x7f4387eb0950>} 
CV 정확도: 0.897

In [31]:

clf = gs_lr_tfidf.best_estimator_
print('테스트 정확도: %.3f' % clf.score(X_test, y_test))

테스트 정확도: 0.899

대용량 데이터 처리-온라인 알고리즘과 외부 메모리 학습¶

In [30]:

# 이 셀의 코드는 책에 포함되어 있지 않습니다. This cell is not contained in the book but
# 이전 코드를 실행하지 않고 바로 시작할 수 있도록 편의를 위해 추가했습니다.
import os, gzip
if not os.path.isfile('movie_data.csv'):
    if not os.path.isfile('movie_data.csv.gz'):
        print('Please place a copy of the movie_data.csv.gz'
              'in this directory. You can obtain it by'
              'a) executing the code in the beginning of this'
              'notebook or b) by downloading it from GitHub:'
              'https://github.com/rasbt/python-machine-learning-'
              'book-2nd-edition/blob/master/code/ch08/movie_data.csv.gz')
    else:
        in_f = gzip.open('movie_data.csv.gz', 'rb')
        out_f = open('movie_data.csv', 'wb')
        out_f.write(in_f.read())
        in_f.close()
        out_f.close()

In [31]:

import numpy as np
import re
from nltk.corpus import stopwords

# `stop` 객체를 앞에서 정의했지만 이전 코드를 실행하지 않고
# 편의상 여기에서부터 코드를 실행하기 위해 다시 만듭니다.
stop = stopwords.words('english')

def tokenizer(text):
    text      = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text      = re.sub('[\W]+', ' ', text.lower()) +\
                    ' '.join(emoticons).replace('-', '')
    tokenized = [w     for w in text.split()    if w not in stop]
    return tokenized

def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv)  # 헤더 넘기기
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

next(stream_docs(path='movie_data.csv'))

Out[31]:

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich family used their influence to cover the murder for more than twenty years. However, a snoopy detective and convicted perjurer in disgrace was able to disclose how the hideous crime was committed. The screenplay shows the investigation of Mark and the last days of Martha in parallel, but there is a lack of the emotion in the dramatization. My vote is seven.<br /><br />Title (Brazil): Not Available"',
 1)

In [36]:

def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        pass
    return docs, y

In [37]:

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(decode_error = 'ignore', 
                         n_features   = 2**21,
                         preprocessor = None, 
                         tokenizer    = tokenizer)

In [38]:

from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version

clf = SGDClassifier(loss='log', random_state=1, max_iter=1)

doc_stream = stream_docs(path='movie_data.csv')

In [35]:

import pyprind
pbar = pyprind.ProgBar(45)

classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

/home/markbaum/Python/python/lib/python3.6/site-packages/sklearn/linear_model/stochastic_gradient.py:183: FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If max_iter is set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence this warning.
  FutureWarning)
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:23

In [39]:

X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('정확도: %.3f' % clf.score(X_test, y_test))

정확도: 0.867

In [40]:

clf = clf.partial_fit(X_test, y_test)

토픽 모델링¶

사이킷런의 LDA¶

In [1]:

import pandas as pd
df = pd.read_csv('data/movie_data.csv', encoding='utf-8')
df.head(3)

Out[1]:

	review	sentiment
0	In 1974, the teenager Martha Moxley (Maggie Gr...	1
1	OK... so... I really like Kris Kristofferson a...	0
2	*SPOILER* Do not read this, if you think a...	0

In [2]:

%%time
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english',
                        max_df=.1,
                        max_features=5000)
X = count.fit_transform(df['review'].values)

CPU times: user 7.4 s, sys: 335 ms, total: 7.73 s
Wall time: 7.45 s

In [5]:

%%time
from sklearn.decomposition import LatentDirichletAllocation
#       LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=10,
                                random_state=123,
                                learning_method='batch',
                                n_jobs=-1) # 4분 11초로 병렬의 효과가 별로 없음 ..ㅠㅠ
X_topics = lda.fit_transform(X)
lda.components_.shape

CPU times: user 4.01 s, sys: 216 ms, total: 4.22 s
Wall time: 4min 11s

In [4]:

n_top_words = 5
feature_names = count.get_feature_names()

for topic_idx, topic in enumerate(lda.components_):
    print("토픽 %d:" % (topic_idx + 1))
    print(" ".join([feature_names[i]
                    for i in topic.argsort()\
                        [:-n_top_words - 1:-1]]))

토픽 1:
worst minutes awful script stupid
토픽 2:
family mother father children girl
토픽 3:
american war dvd music tv
토픽 4:
human audience cinema art sense
토픽 5:
police guy car dead murder
토픽 6:
horror house sex girl woman
토픽 7:
role performance comedy actor performances
토픽 8:
series episode war episodes tv
토픽 9:
book version original read novel
토픽 10:
action fight guy guys cool

각 토픽에서 가장 중요한 단어 다섯 개를 기반으로 LDA가 다음 토픽을 구별했다고 추측할 수 있습니다.

대체적으로 형편없는 영화(실제 토픽 카테고리가 되지 못함)
가족 영화
전쟁 영화
예술 영화
범죄 영화
공포 영화
코미디 영화
TV 쇼와 관련된 영화
소설을 원작으로 한 영화
액션 영화

카테고리가 잘 선택됐는지 확인하기 위해 공포 영화 카테고리에서 3개 영화의 리뷰를 출력해 보죠(공포 영화는 카테고리 6이므로 인덱스는 5입니다):

In [46]:

horror = X_topics[:, 5].argsort()[::-1]

for iter_idx, movie_idx in enumerate(horror[:3]):
    print('\n공포 영화 #%d:' % (iter_idx + 1))
    print(df['review'][movie_idx][:300], '...')

공포 영화 #1:
House of Dracula works from the same basic premise as House of Frankenstein from the year before; namely that Universal's three most famous monsters; Dracula, Frankenstein's Monster and The Wolf Man are appearing in the movie together. Naturally, the film is rather messy therefore, but the fact that ...

공포 영화 #2:
Okay, what the hell kind of TRASH have I been watching now? "The Witches' Mountain" has got to be one of the most incoherent and insane Spanish exploitation flicks ever and yet, at the same time, it's also strangely compelling. There's absolutely nothing that makes sense here and I even doubt there  ...

공포 영화 #3:
<br /><br />Horror movie time, Japanese style. Uzumaki/Spiral was a total freakfest from start to finish. A fun freakfest at that, but at times it was a tad too reliant on kitsch rather than the horror. The story is difficult to summarize succinctly: a carefree, normal teenage girl starts coming fac ...

앞 코드에서 공포 영화 카테고리 중 최상위 3개의 리뷰에서 300자씩 출력했습니다. 정확히 어떤 영화에 속한 리뷰인지는 모르지만 공포 영화의 리뷰임을 알 수 있습니다.