Notebook

8장. 감성 분석에 머신 러닝 적용하기¶

아래 링크를 통해 이 노트북을 주피터 노트북 뷰어(nbviewer.jupyter.org)로 보거나 구글 코랩(colab.research.google.com)에서 실행할 수 있습니다.

watermark는 주피터 노트북에 사용하는 파이썬 패키지를 출력하기 위한 유틸리티입니다. watermark 패키지를 설치하려면 다음 셀의 주석을 제거한 뒤 실행하세요.

In [1]:

#!pip install watermark

In [2]:

%load_ext watermark
%watermark -u -d -v -p numpy,pandas,sklearn,nltk

last updated: 2020-05-22 

CPython 3.7.3
IPython 7.5.0

numpy 1.18.4
pandas 1.0.3
sklearn 0.23.1
nltk 3.4.1

텍스트 처리용 IMDb 영화 리뷰 데이터 준비¶

영화 리뷰 데이터셋 구하기¶

IMDB 영화 리뷰 데이터셋은 http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz에서 내려받을 수 있습니다. 다운로드된 후 파일 압축을 해제합니다.

A) 리눅스(Linux)나 macOS를 사용하면 새로운 터미널(Terminal) 윈도우를 열고 cd 명령으로 다운로드 디렉터리로 이동하여 다음 명령을 실행하세요.

tar -zxf aclImdb_v1.tar.gz

B) 윈도(Windows)를 사용하면 7Zip(http://www.7-zip.org) 같은 무료 압축 유틸리티를 설치하여 다운로드한 파일의 압축을 풀 수 있습니다.

코랩이나 리눅스에서 직접 다운로드하려면 다음 셀의 주석을 제거하고 실행하세요.

In [3]:

#!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

다음처럼 파이썬에서 직접 압축을 풀 수도 있습니다:

In [4]:

import os
import tarfile

if not os.path.isdir('aclImdb'):

    with tarfile.open('aclImdb_v1.tar.gz', 'r:gz') as tar:
        tar.extractall()

영화 리뷰 데이터셋을 더 간편한 형태로 전처리하기¶

pyprind는 주피터 노트북에서 진행바를 출력하기 위한 유틸리티입니다. pyprind 패키지를 설치하려면 다음 셀의 주석을 제거한 뒤 실행하세요.

In [5]:

#!pip install pyprind

In [6]:

import pyprind
import pandas as pd
import os

# `basepath`를 압축 해제된 영화 리뷰 데이터셋이 있는
# 디렉토리로 바꾸세요

basepath = 'aclImdb'

labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), 
                      'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], 
                           ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:01:21

데이터프레임을 섞습니다:

In [7]:

import numpy as np

np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))

선택사항: 만들어진 데이터를 CSV 파일로 저장합니다:

In [8]:

df.to_csv('movie_data.csv', index=False, encoding='utf-8')

In [9]:

import pandas as pd

df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head(3)

Out[9]:

	review	sentiment
0	In 1974, the teenager Martha Moxley (Maggie Gr...	1
1	OK... so... I really like Kris Kristofferson a...	0
2	*SPOILER* Do not read this, if you think a...	0

In [10]:

df.shape

Out[10]:

(50000, 2)

BoW 모델 소개¶

단어를 특성 벡터로 변환하기¶

CountVectorizer의 fit_transform 메서드를 호출하여 BoW 모델의 어휘사전을 만들고 다음 세 문장을 희소한 특성 벡터로 변환합니다:

The sun is shining
The weather is sweet
The sun is shining, the weather is sweet, and one and one is two

In [11]:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

어휘 사전의 내용을 출력해 보면 BoW 모델의 개념을 이해하는 데 도움이 됩니다:

In [12]:

print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}

이전 결과에서 볼 수 있듯이 어휘 사전은 고유 단어와 정수 인덱스가 매핑된 파이썬 딕셔너리에 저장되어 있습니다. 그다음 만들어진 특성 벡터를 출력해 봅시다:

특성 벡터의 각 인덱스는 CountVectorizer의 어휘 사전 딕셔너리에 저장된 정수 값에 해당됩니다. 예를 들어 인덱스 0에 있는 첫 번째 특성은 ‘and’ 단어의 카운트를 의미합니다. 이 단어는 마지막 문서에만 나타나네요. 인덱스 1에 있는 (특성 벡터의 두 번째 열) 단어 ‘is’는 세 문장에 모두 등장합니다. 특성 벡터의 이런 값들을 단어 빈도(term frequency) 라고도 부릅니다. 문서 d에 등장한 단어 t의 횟수를 tf (t,d)와 같이 씁니다.

In [13]:

print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]

tf-idf를 사용해 단어 적합성 평가하기¶

In [14]:

np.set_printoptions(precision=2)

텍스트 데이터를 분석할 때 클래스 레이블이 다른 문서에 같은 단어들이 나타나는 경우를 종종 보게 됩니다. 일반적으로 자주 등장하는 단어는 유용하거나 판별에 필요한 정보를 가지고 있지 않습니다. 이 절에서 특성 벡터에서 자주 등장하는 단어의 가중치를 낮추는 기법인 tf-idf(term frequency-inverse document frequency)에 대해 배우겠습니다. tf-idf는 단어 빈도와 역문서 빈도(inverse document frequency)의 곱으로 정의됩니다:

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$

여기에서 tf(t, d)는 이전 절에서 보았던 단어 빈도입니다. idf(t, d)는 역문서 빈도로 다음과 같이 계산합니다:

$$\text{idf}(t,d) = \text{log}\frac{n_d}{1+\text{df}(d, t)},$$

여기에서 $n_d$는 전체 문서 개수이고 df(d, t)는 단어 t가 포함된 문서 d의 개수입니다. 분모에 상수 1을 추가하는 것은 선택 사항입니다. 훈련 샘플에 한 번도 등장하지 않는 단어가 있는 경우 분모가 0이 되지 않게 만듭니다. log는 문서 빈도 df(d, t)가 낮을 때 역문서 빈도 값이 너무 커지지 않도록 만듭니다.

사이킷런 라이브러리에는 CountVectorizer 클래스에서 만든 단어 빈도를 입력받아 tf-idf로 변환하는 TfidfTransformer 클래스가 구현되어 있습니다:

In [15]:

from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, 
                         norm='l2', 
                         smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs))
      .toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]

이전 절에서 보았듯이 세 번째 문서에서 단어 ‘is’가 가장 많이 나타났기 때문에 단어 빈도가 가장 컸습니다. 동일한 특성 벡터를 tf-idf로 변환하면 단어 ‘is’는 비교적 작은 tf-idf를 가집니다(0.45). 이 단어는 첫 번째와 두 번째 문서에도 나타나므로 판별에 유용한 정보를 가지고 있지 않을 것입니다.

수동으로 특성 벡터에 있는 각 단어의 tf-idf를 계산해 보면 TfidfTransformer가 앞서 정의한 표준 tf-idf 공식과 조금 다르게 계산한다는 것을 알 수 있습니다. 사이킷런에 구현된 역문서 빈도 공식은 다음과 같습니다.

$$\text{idf} (t,d) = log\frac{1 + n_d}{1 + \text{df}(d, t)}$$

비슷하게 사이킷런에서 계산하는 tf-idf는 앞서 정의한 공식과 조금 다릅니다:

$$\text{tf-idf}(t,d) = \text{tf}(t,d) \times (\text{idf}(t,d)+1)$$

일반적으로 tf-idf를 계산하기 전에 단어 빈도(tf)를 정규화하지만 TfidfTransformer 클래스는 tf-idf를 직접 정규화합니다. 사이킷런의 TfidfTransformer는 기본적으로 L2 정규화를 적용합니다(norm=’l2’). 정규화되지 않은 특성 벡터 v를 L2-노름으로 나누면 길이가 1인 벡터가 반환됩니다:

$$v_{\text{norm}} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v_{1}^{2} + v_{2}^{2} + \dots + v_{n}^{2}}} = \frac{v}{\big (\sum_{i=1}^{n} v_{i}^{2}\big)^\frac{1}{2}}$$

TfidfTransformer의 작동 원리를 이해하기 위해 세 번째 문서에 있는 단어 ‘is'의 tf-idf를 예로 들어 계산해 보죠.

세 번째 문서에서 단어 ‘is’의 단어 빈도는 3입니다(tf=3). 이 단어는 세 개 문서에 모두 나타나기 때문에 문서 빈도가 3입니다(df=3). 따라서 역문서 빈도는 다음과 같이 계산됩니다:

$$\text{idf}("is", d3) = log \frac{1+3}{1+3} = 0$$

이제 tf-idf를 계산하기 위해 역문서 빈도에 1을 더하고 단어 빈도를 곱합니다:

$$\text{tf-idf}("is",d3)= 3 \times (0+1) = 3$$

In [16]:

tf_is = 3
n_docs = 3
idf_is = np.log((n_docs+1) / (3+1))
tfidf_is = tf_is * (idf_is + 1)
print('tf-idf of term "is" = %.2f' % tfidf_is)

tf-idf of term "is" = 3.00

세 번째 문서에 있는 모든 단어에 대해 이런 계산을 반복하면 tf-idf 벡터 [3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0, 1.69, 1.29]를 얻습니다. 이 특성 벡터의 값은 앞서 사용했던 TfidfTransformer에서 얻은 값과 다릅니다. tf-idf 계산에서 빠트린 마지막 단계는 다음과 같은 L2-정규화입니다::

$$\text{tfi-df}_{norm} = \frac{[3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]}{\sqrt{[3.39^2, 3.0^2, 3.39^2, 1.29^2, 1.29^2, 1.29^2, 2.0^2 , 1.69^2, 1.29^2]}}$$$$=[0.5, 0.45, 0.5, 0.19, 0.19, 0.19, 0.3, 0.25, 0.19]$$$$\Rightarrow \text{tfi-df}_{norm}("is", d3) = 0.45$$

결과에서 보듯이 사이킷런의 TfidfTransformer에서 반환된 결과와 같아졌습니다. tf-idf 계산 방법을 이해했으므로 다음 절로 넘어가 이 개념을 영화 리뷰 데이터셋에 적용해 보죠.

In [17]:

tfidf = TfidfTransformer(use_idf=True, norm=None, smooth_idf=True)
raw_tfidf = tfidf.fit_transform(count.fit_transform(docs)).toarray()[-1]
raw_tfidf 

Out[17]:

array([3.39, 3.  , 3.39, 1.29, 1.29, 1.29, 2.  , 1.69, 1.29])

In [18]:

l2_tfidf = raw_tfidf / np.sqrt(np.sum(raw_tfidf**2))
l2_tfidf

Out[18]:

array([0.5 , 0.45, 0.5 , 0.19, 0.19, 0.19, 0.3 , 0.25, 0.19])

텍스트 데이터 정제¶

In [19]:

df.loc[0, 'review'][-50:]

Out[19]:

'is seven.<br /><br />Title (Brazil): Not Available'

In [20]:

import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text

In [21]:

preprocessor(df.loc[0, 'review'][-50:])

Out[21]:

'is seven title brazil not available'

In [22]:

preprocessor("</a>This :) is :( a test :-)!")

Out[22]:

'this is a test :) :( :)'

In [23]:

df['review'] = df['review'].apply(preprocessor)

In [24]:

df['review'].map(preprocessor)

Out[24]:

0        in 1974 the teenager martha moxley maggie grac...
1        ok so i really like kris kristofferson and his...
2         spoiler do not read this if you think about w...
3        hi for all the people who have seen this wonde...
4        i recently bought the dvd forgetting just how ...
                               ...                        
49995    ok lets start with the best the building altho...
49996    the british heritage film industry is out of c...
49997    i don t even know where to begin on this one i...
49998    richard tyler is a little boy who is scared of...
49999    i waited long to watch this movie also because...
Name: review, Length: 50000, dtype: object

문서를 토큰으로 나누기¶

In [25]:

from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()


def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [26]:

tokenizer('runners like running and thus they run')

Out[26]:

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [27]:

tokenizer_porter('runners like running and thus they run')

Out[27]:

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

In [28]:

import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/haesun/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Out[28]:

True

In [29]:

from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:]
if w not in stop]

Out[29]:

['runner', 'like', 'run', 'run', 'lot']

문서 분류를 위한 로지스틱 회귀 모델 훈련하기¶

In [30]:

X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

In [31]:

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(solver='liblinear', random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=1)

n_jobs 매개변수에 대하여

앞의 코드 예제에서 컴퓨터에 있는 모든 CPU 코어를 사용해 그리드 서치의 속도를 높이려면 (n_jobs=1 대신) n_jobs=-1로 지정하는 것이 좋습니다. 일부 시스템에서는 멀티프로세싱을 위해 n_jobs=-1로 지정할 때 tokenizer 와 tokenizer_porter 함수의 직렬화에 문제가 발생할 수 있습니다. 이런 경우 [tokenizer, tokenizer_porter]를 [str.split]로 바꾸어 문제를 해결할 수 있습니다. 다만 str.split로 바꾸면 어간 추출을 하지 못합니다.

코드 실행 시간에 대하여

다음 코드 셀을 실행하면 시스템에 따라 30~60분 정도 걸릴 수 있습니다. 매개변수 그리드에서 정의한 대로 22235 + 22235 = 240개의 모델을 훈련하기 때문입니다.

코랩을 사용할 경우에도 CPU 코어가 많지 않기 때문에 실행 시간이 오래 걸릴 수 있습니다.

너무 오래 기다리기 어렵다면 데이터셋의 훈련 샘플의 수를 다음처럼 줄일 수 있습니다:

X_train = df.loc[:2500, 'review'].values
y_train = df.loc[:2500, 'sentiment'].values

훈련 세트 크기를 줄이는 것은 모델 성능을 감소시킵니다. 그리드에 지정한 매개변수를 삭제하면 훈련한 모델 수를 줄일 수 있습니다. 예를 들면 다음과 같습니다:

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0]},
              ]

In [32]:

gs_lr_tfidf.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

Fitting 5 folds for each of 48 candidates, totalling 240 fits

/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
/home/haesun/anaconda3/envs/python-ml/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:386: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', "it'", 'onc', 'onli', 'ourselv', "she'", "should'v", 'themselv', 'thi', 'veri', 'wa', 'whi', "you'r", "you'v", 'yourselv'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed: 145.7min finished

Out[32]:

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('vect',
                                        TfidfVectorizer(lowercase=False)),
                                       ('clf',
                                        LogisticRegression(random_state=0,
                                                           solver='liblinear'))]),
             n_jobs=1,
             param_grid=[{'clf__C': [1.0, 10.0, 100.0],
                          'clf__penalty': ['l1', 'l2'],
                          'vect__ngram_range': [(1, 1)],
                          'vect__stop_words': [['i', 'me', 'my', 'myself', 'we',
                                                'our', 'ours', 'ourselves',
                                                'you', "you're", "you've"...
                                                'our', 'ours', 'ourselves',
                                                'you', "you're", "you've",
                                                "you'll", "you'd", 'your',
                                                'yours', 'yourself',
                                                'yourselves', 'he', 'him',
                                                'his', 'himself', 'she',
                                                "she's", 'her', 'hers',
                                                'herself', 'it', "it's", 'its',
                                                'itself', ...],
                                               None],
                          'vect__tokenizer': [<function tokenizer at 0x7fc16c4e9730>,
                                              <function tokenizer_porter at 0x7fc16c4e97b8>],
                          'vect__use_idf': [False]}],
             scoring='accuracy', verbose=1)

In [33]:

print('최적의 매개변수 조합: %s ' % gs_lr_tfidf.best_params_)
print('CV 정확도: %.3f' % gs_lr_tfidf.best_score_)

최적의 매개변수 조합: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x7fc16c4e9730>} 
CV 정확도: 0.897

In [34]:

clf = gs_lr_tfidf.best_estimator_
print('테스트 정확도: %.3f' % clf.score(X_test, y_test))

테스트 정확도: 0.899

대용량 데이터 처리-온라인 알고리즘과 외부 메모리 학습¶

In [35]:

# 이 셀의 코드는 책에 포함되어 있지 않습니다. This cell is not contained in the book but
# 이전 코드를 실행하지 않고 바로 시작할 수 있도록 편의를 위해 추가했습니다.

import os
import gzip


if not os.path.isfile('movie_data.csv'):
    if not os.path.isfile('movie_data.csv.gz'):
        print('Please place a copy of the movie_data.csv.gz'
              'in this directory. You can obtain it by'
              'a) executing the code in the beginning of this'
              'notebook or b) by downloading it from GitHub:'
              'https://github.com/rasbt/python-machine-learning-'
              'book-2nd-edition/blob/master/code/ch08/movie_data.csv.gz')
    else:
        in_f = gzip.open('movie_data.csv.gz', 'rb')
        out_f = open('movie_data.csv', 'wb')
        out_f.write(in_f.read())
        in_f.close()
        out_f.close()

In [36]:

import numpy as np
import re
from nltk.corpus import stopwords


# `stop` 객체를 앞에서 정의했지만 이전 코드를 실행하지 않고
# 편의상 여기에서부터 코드를 실행하기 위해 다시 만듭니다.
stop = stopwords.words('english')


def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized


def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv)  # 헤더 넘기기
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [37]:

next(stream_docs(path='movie_data.csv'))

Out[37]:

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich family used their influence to cover the murder for more than twenty years. However, a snoopy detective and convicted perjurer in disgrace was able to disclose how the hideous crime was committed. The screenplay shows the investigation of Mark and the last days of Martha in parallel, but there is a lack of the emotion in the dramatization. My vote is seven.<br /><br />Title (Brazil): Not Available"',
 1)

In [38]:

def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        pass
    return docs, y

In [39]:

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier


vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None, 
                         tokenizer=tokenizer)

In [40]:

from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version

clf = SGDClassifier(loss='log', random_state=1, max_iter=1)

doc_stream = stream_docs(path='movie_data.csv')

In [41]:

import pyprind
pbar = pyprind.ProgBar(45)

classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:21

In [42]:

X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('정확도: %.3f' % clf.score(X_test, y_test))

정확도: 0.868

In [43]:

clf = clf.partial_fit(X_test, y_test)

토픽 모델링¶

사이킷런의 LDA¶

In [44]:

import pandas as pd

df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head(3)

Out[44]:

	review	sentiment
0	In 1974, the teenager Martha Moxley (Maggie Gr...	1
1	OK... so... I really like Kris Kristofferson a...	0
2	*SPOILER* Do not read this, if you think a...	0

In [45]:

from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english',
                        max_df=.1,
                        max_features=5000)
X = count.fit_transform(df['review'].values)

In [46]:

from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=10,
                                random_state=123,
                                learning_method='batch')
X_topics = lda.fit_transform(X)

In [47]:

lda.components_.shape

Out[47]:

(10, 5000)

In [48]:

n_top_words = 5
feature_names = count.get_feature_names()

for topic_idx, topic in enumerate(lda.components_):
    print("토픽 %d:" % (topic_idx + 1))
    print(" ".join([feature_names[i]
                    for i in topic.argsort()\
                        [:-n_top_words - 1:-1]]))

토픽 1:
worst minutes awful comedy money
토픽 2:
sex women woman men black
토픽 3:
comedy musical music song dance
토픽 4:
family father mother wife son
토픽 5:
war book american documentary japanese
토픽 6:
episode guy series house girl
토픽 7:
role performance actor john book
토픽 8:
war music beautiful cinema history
토픽 9:
horror budget gore killer effects
토픽 10:
action original series animation disney

각 토픽에서 가장 중요한 단어 다섯 개를 기반으로 LDA가 다음 토픽을 구별했다고 추측할 수 있습니다.

대체적으로 형편없는 영화(실제 토픽 카테고리가 되지 못함)
가족 영화
전쟁 영화
예술 영화
범죄 영화
공포 영화
코미디 영화
TV 쇼와 관련된 영화
소설을 원작으로 한 영화
액션 영화

카테고리가 잘 선택됐는지 확인하기 위해 공포 영화 카테고리에서 3개 영화의 리뷰를 출력해 보죠(공포 영화는 카테고리 6이므로 인덱스는 5입니다):

In [49]:

horror = X_topics[:, 5].argsort()[::-1]

for iter_idx, movie_idx in enumerate(horror[:3]):
    print('\n공포 영화 #%d:' % (iter_idx + 1))
    print(df['review'][movie_idx][:300], '...')

공포 영화 #1:
This is one of the funniest movies I've ever saw. A 14-year old boy Jason Shepherd wrote an English paper called "Big Fat Liar". When his skateboard was taken, he had to use his sister's bike to get to the college on time and he hit a limo. When he went into the limo, he met a famous producer from H ...

공포 영화 #2:
Where to start? Some guy has some Indian pot that he's cleaning, and suddenly Skeletor attacks. He hits a woman in the neck with an axe, she falls down, but then gets up and is apparently uninjured. She runs into the woods, and it turns out there's the basement of a shopping center out there in the  ...

공포 영화 #3:
***SPOILERS*** ***SPOILERS*** Some bunch of Afrikkaner-Hillbilly types are out in the desert looking for Diamonds when they find a hard mound in the middle of a sandy desert area. Spoilers: The dumbest one starts hitting the mound with a pick, and cracks it open. Then he looks into the hole and stic ...

앞 코드에서 공포 영화 카테고리 중 최상위 3개의 리뷰에서 300자씩 출력했습니다. 정확히 어떤 영화에 속한 리뷰인지는 모르지만 공포 영화의 리뷰임을 알 수 있습니다.