Notebook

Deep Learning with Python ３章勉強ノート（その１）¶

ManningのMEAP（出版前の本のできた部分から読めるサービス）でTensorFlowとTheanoの両方に対応したラッパーツールkerasの作者Francois Chollet氏の「Deep Learning with Python」が発売されました。

３章では、機械学習でよく使われる以下の手法をニューラルネットワークを使って実現する方法を紹介しています。

二値分類（binary classification）
マルチクラス分類(multi-class classification)
線形回帰(regression)

ここでは、最初の二値分類の例題をSageのjupyterノートブックで試しながら、Deep Learningへの理解を深めたいと思います。

kerasのインストール方法¶

本文ではkerasのインストール方法が詳しく紹介されていますが、GPUを使わないkerasをdockerのイメージにインストール済みです。 takepwave/sagemathのdockerイメージを使えば、すぐに二値分類を試すことができます。

前準備¶

jupyterノートブックで使用するnumpy、pandasとmatplotlibをインポートします。

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 

%matplotlib inline

kerasのパッケージのインポート¶

二値分類では、Sequential, Dense, to_categoricalを使用するため、最初にインポートしておきます。

In [2]:

from keras.models import Sequential
from keras.layers import Dense
from keras.utils.np_utils import to_categorical

Using TensorFlow backend.

使用するデータについて¶

kerasにいくつかのデータセットが組み込まれています。この内、二値分類ではIMDBのデータセットを使用します。

IMDBは、映画に対する感想を肯定的と否定的にラベル付けされたデータセットです。50,000のIMDB映画レビューのデータセット（単語をインデックス変換して加工済み）をkeras.datasets.imdbパッケージのload_data関数を使って取り込みます。25,000が学習用（train_data）、残りの25,000がテスト用（test_data）として使用します。

nb_wordsで使用する単語数を良く出てくる上位10000語に限定して、分析を行います。

In [3]:

from keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(nb_words=10000)

train_dataの最初のデータを見てみましょう。型はリストで、長さが218、リストの内容は単語のインデックスが整数値で入っています。

In [4]:

print len(train_data), len(test_data)
print type(train_data[0]), len(train_data[0])
np.array(train_data[0]).flatten()

25000 25000
<type 'list'> 218

Out[4]:

array([   1,   14,   22,   16,   43,  530,  973, 1622, 1385,   65,  458,
       4468,   66, 3941,    4,  173,   36,  256,    5,   25,  100,   43,
        838,  112,   50,  670,    2,    9,   35,  480,  284,    5,  150,
          4,  172,  112,  167,    2,  336,  385,   39,    4,  172, 4536,
       1111,   17,  546,   38,   13,  447,    4,  192,   50,   16,    6,
        147, 2025,   19,   14,   22,    4, 1920, 4613,  469,    4,   22,
         71,   87,   12,   16,   43,  530,   38,   76,   15,   13, 1247,
          4,   22,   17,  515,   17,   12,   16,  626,   18,    2,    5,
         62,  386,   12,    8,  316,    8,  106,    5,    4, 2223, 5244,
         16,  480,   66, 3785,   33,    4,  130,   12,   16,   38,  619,
          5,   25,  124,   51,   36,  135,   48,   25, 1415,   33,    6,
         22,   12,  215,   28,   77,   52,    5,   14,  407,   16,   82,
          2,    8,    4,  107,  117, 5952,   15,  256,    4,    2,    7,
       3766,    5,  723,   36,   71,   43,  530,  476,   26,  400,  317,
         46,    7,    4,    2, 1029,   13,  104,   88,    4,  381,   15,
        297,   98,   32, 2071,   56,   26,  141,    6,  194, 7486,   18,
          4,  226,   22,   21,  134,  476,   26,  480,    5,  144,   30,
       5535,   18,   51,   36,   28,  224,   92,   25,  104,    4,  226,
         65,   16,   38, 1334,   88,   12,   16,  283,    5,   16, 4472,
        113,  103,   32,   15,   16, 5345,   19,  178,   32])

ラベルには、肯定の場合1、否定の場合0がセットされています。またインデックスの最大値が9999ですから10000個の単語用のインデックスが入っていることも確認できました。

In [5]:

print train_labels[0:10]

[1 0 0 1 0 0 1 0 1 0]

In [6]:

max([max(sequence) for sequence in train_data])

Out[6]:

単語インデックス¶

どのような単語が映画のレビューに含まれていたかを確認するために、単語インデックスもロードしておきます。

In [7]:

# word_index is a dictionary mapping words to an integer indice
word_index = imdb.get_word_index()
# we reverse it, mapping integer indices to words
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
# we decode the review; note that our indices were offset by 3
# because 0, 1 and 2 are reserved indices for "padding", "start of sequence", and "unknown".
decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in train_data[0]])

最初のレビューの単語を表示みると、以下のようになりました。

In [8]:

print [reverse_word_index[idx] for idx in train_data[0]]

['the', 'as', 'you', 'with', 'out', 'themselves', 'powerful', 'lets', 'loves', 'their', 'becomes', 'reaching', 'had', 'journalist', 'of', 'lot', 'from', 'anyone', 'to', 'have', 'after', 'out', 'atmosphere', 'never', 'more', 'room', 'and', 'it', 'so', 'heart', 'shows', 'to', 'years', 'of', 'every', 'never', 'going', 'and', 'help', 'moments', 'or', 'of', 'every', 'chest', 'visual', 'movie', 'except', 'her', 'was', 'several', 'of', 'enough', 'more', 'with', 'is', 'now', 'current', 'film', 'as', 'you', 'of', 'mine', 'potentially', 'unfortunately', 'of', 'you', 'than', 'him', 'that', 'with', 'out', 'themselves', 'her', 'get', 'for', 'was', 'camp', 'of', 'you', 'movie', 'sometimes', 'movie', 'that', 'with', 'scary', 'but', 'and', 'to', 'story', 'wonderful', 'that', 'in', 'seeing', 'in', 'character', 'to', 'of', '70s', 'musicians', 'with', 'heart', 'had', 'shadows', 'they', 'of', 'here', 'that', 'with', 'her', 'serious', 'to', 'have', 'does', 'when', 'from', 'why', 'what', 'have', 'critics', 'they', 'is', 'you', 'that', "isn't", 'one', 'will', 'very', 'to', 'as', 'itself', 'with', 'other', 'and', 'in', 'of', 'seen', 'over', 'landed', 'for', 'anyone', 'of', 'and', 'br', "show's", 'to', 'whether', 'from', 'than', 'out', 'themselves', 'history', 'he', 'name', 'half', 'some', 'br', 'of', 'and', 'odd', 'was', 'two', 'most', 'of', 'mean', 'for', '1', 'any', 'an', 'boat', 'she', 'he', 'should', 'is', 'thought', 'frog', 'but', 'of', 'script', 'you', 'not', 'while', 'history', 'he', 'heart', 'to', 'real', 'at', 'barrel', 'but', 'when', 'from', 'one', 'bit', 'then', 'have', 'two', 'of', 'script', 'their', 'with', 'her', 'nobody', 'most', 'that', 'with', "wasn't", 'to', 'with', 'armed', 'acting', 'watch', 'an', 'for', 'with', 'heartfelt', 'film', 'want', 'an']

二値分類を試す¶

レビューに含まれている単語のインデックスを、インデックスが含まれている場合１、それ以外は０の長さ10000のベクトルに変えます。それをすべてのレビューに適応してレビュー数x１００００のマトリックスを作ります。

kerasへの入力データは、astype関数でfloat32の実数に変換します。

In [9]:

def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    # create an all-zero matrix of shape (len(sequences), dimension)
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.  # set specific indices of results[i] to 1s
    return results
# our vectorized training data
x_train = vectorize_sequences(train_data)
# our vectorized test data
x_test = vectorize_sequences(test_data)

In [10]:

# our vectorized labels:
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

モデルの定義¶

モデルは3層から成り、最初の2層に16個の隠しユニットを設け、活性化関数をReLuとし、最後に二値化します。

In [11]:

model = Sequential()
model.add(Dense(16, activation='relu', input_dim=10000))
model.add(Dense(16, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

In [12]:

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

モデルができたので、２５，０００のデータの内１０，０００を検証用に使用します。

In [13]:

x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]

モデルを学習用データで訓練します。最初のエポックには時間が掛かりますが、それ以降は短時間で計算します。

In [14]:

history = model.fit(partial_x_train, partial_y_train, nb_epoch=20, batch_size=512, validation_data=(x_val, y_val))

Train on 15000 samples, validate on 10000 samples
Epoch 1/20
15000/15000 [==============================] - 124s - loss: 0.5038 - acc: 0.7877 - val_loss: 0.3864 - val_acc: 0.8678
Epoch 2/20
15000/15000 [==============================] - 2s - loss: 0.3013 - acc: 0.9051 - val_loss: 0.3278 - val_acc: 0.8730
Epoch 3/20
15000/15000 [==============================] - 2s - loss: 0.2232 - acc: 0.9259 - val_loss: 0.2768 - val_acc: 0.8940
Epoch 4/20
15000/15000 [==============================] - 2s - loss: 0.1754 - acc: 0.9455 - val_loss: 0.2729 - val_acc: 0.8920
Epoch 5/20
15000/15000 [==============================] - 2s - loss: 0.1450 - acc: 0.9543 - val_loss: 0.2801 - val_acc: 0.8890
Epoch 6/20
15000/15000 [==============================] - 2s - loss: 0.1166 - acc: 0.9648 - val_loss: 0.3092 - val_acc: 0.8833
Epoch 7/20
15000/15000 [==============================] - 2s - loss: 0.1018 - acc: 0.9691 - val_loss: 0.3046 - val_acc: 0.8843
Epoch 8/20
15000/15000 [==============================] - 2s - loss: 0.0829 - acc: 0.9768 - val_loss: 0.3360 - val_acc: 0.8758
Epoch 9/20
15000/15000 [==============================] - 2s - loss: 0.0724 - acc: 0.9799 - val_loss: 0.3573 - val_acc: 0.8804
Epoch 10/20
15000/15000 [==============================] - 2s - loss: 0.0579 - acc: 0.9861 - val_loss: 0.3692 - val_acc: 0.8799
Epoch 11/20
15000/15000 [==============================] - 2s - loss: 0.0500 - acc: 0.9878 - val_loss: 0.3944 - val_acc: 0.8772
Epoch 12/20
15000/15000 [==============================] - 2s - loss: 0.0402 - acc: 0.9919 - val_loss: 0.4238 - val_acc: 0.8761
Epoch 13/20
15000/15000 [==============================] - 2s - loss: 0.0325 - acc: 0.9943 - val_loss: 0.4497 - val_acc: 0.8749
Epoch 14/20
15000/15000 [==============================] - 2s - loss: 0.0272 - acc: 0.9955 - val_loss: 0.4645 - val_acc: 0.8719
Epoch 15/20
15000/15000 [==============================] - 2s - loss: 0.0231 - acc: 0.9964 - val_loss: 0.5504 - val_acc: 0.8667
Epoch 16/20
15000/15000 [==============================] - 2s - loss: 0.0151 - acc: 0.9985 - val_loss: 0.5236 - val_acc: 0.8720
Epoch 17/20
15000/15000 [==============================] - 2s - loss: 0.0176 - acc: 0.9969 - val_loss: 0.5571 - val_acc: 0.8702
Epoch 18/20
15000/15000 [==============================] - 2s - loss: 0.0092 - acc: 0.9995 - val_loss: 0.6059 - val_acc: 0.8588
Epoch 19/20
15000/15000 [==============================] - 2s - loss: 0.0110 - acc: 0.9984 - val_loss: 0.6006 - val_acc: 0.8679
Epoch 20/20
15000/15000 [==============================] - 2s - loss: 0.0067 - acc: 0.9997 - val_loss: 0.6302 - val_acc: 0.8680

In [15]:

history_dict = history.history
history_dict.keys()

Out[15]:

['acc', 'loss', 'val_acc', 'val_loss']

結果の可視化¶

例題ではmatplotlibを使って収束の様子を可視化していますが、ここではpandasのグラフ機能を使って学習用データの損失関数の値lossと検証用データの損失関数の値val_lossをプロットしてみます。

グラフをカッコよく表示するために、seabornをインポートしています。

In [16]:

import seaborn
# history_dictをpandasnのデータフレームに変換
d = pd.DataFrame(history_dict)
# loss, val_lossをプロット
d[['loss', 'val_loss']].plot()
plt.show()

同様に精度（acc, val_acc）についてもプロットしてみます。

In [17]:

d[['acc', 'val_acc']].plot()
plt.show()

最適なモデル¶

loss, val_lossのグラフからエポック値が４を超えると検証用データの損失関数の値が大きくなり、過学習の様相を呈してきます。 acc, val_accからはエポック値が４でも88%程度の精度が確保できそうだと期待できます。

テスト用のデータでモデルを性能を検証するために、nb_epoch=4で計算したモデルを準備し、evaluateメソッドを使ってテスト用データについて肯定／否定の評価を予想してみましょう。

In [20]:

model = Sequential()

In [21]:

model.add(Dense(16, activation='relu', input_dim=10000))
model.add(Dense(16, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, nb_epoch=4, batch_size=512)
results = model.evaluate(x_test, y_test)

Epoch 1/4
25000/25000 [==============================] - 107s - loss: 0.4709 - acc: 0.8180   
Epoch 2/4
25000/25000 [==============================] - 17s - loss: 0.2707 - acc: 0.9079    
Epoch 3/4
25000/25000 [==============================] - 3s - loss: 0.2057 - acc: 0.9286     
Epoch 4/4
25000/25000 [==============================] - 3s - loss: 0.1722 - acc: 0.9393     
25000/25000 [==============================] - 78s

In [22]:

results

Out[22]:

[0.28535971707344054, 0.88688]

芸術的な域では精度は９５％になることから、８８％の精度は、簡単な例としてはまあまあの値だと思います。

In [23]:

model.predict(x_test)

Out[23]:

array([[ 0.90993786],
       [ 0.90398455],
       [ 0.9996897 ],
       ..., 
       [ 0.52131325],
       [ 0.00504617],
       [ 0.75605333]], dtype=float32)

Scikit-Learnのロジスティック回帰で試す¶

同じ処理をScikit-Learnのロジスティック回帰（LogisticRegression）で計算してみましょう。

最初に、sklearn.linear_modelからLogisticRegressionをモデル（Model）としてインポートします。次に、modelを生成して、学習用データで訓練します。計算はkerasの場合よりも長く掛かります。

In [24]:

from sklearn.linear_model import LogisticRegression as Model

In [25]:

model = Model()
model.fit(x_train, y_train)

Out[25]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

計算の尺度は、metricsパッケージを使って求めます。

精度は、86%とニューラルネットと同等の結果となりました。

In [26]:

import sklearn.metrics as metrics

predicted = model.predict(x_test)
print metrics.confusion_matrix(y_test, predicted)
print metrics.accuracy_score(y_test, predicted)

[[10812  1688]
 [ 1761 10739]]
0.86204

In [ ]: