主成分分析¶

４次元のirisデータを２次元に次元削減

ライブラリの読み込みとデータの確認¶

In [92]:

# 数値計算、ビジュアライゼーションライブラリの読み込み
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
iris_dataset = load_iris()

In [93]:

# irisデータセットのキーの確認
print(iris_dataset.keys())

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [94]:

# irisデータセットの品種の確認
print(iris_dataset['target_names'])

['setosa' 'versicolor' 'virginica']

In [95]:

# irisデータセットの特微量の確認
print(iris_dataset['feature_names'])

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

In [96]:

# PandasのDataFrameで特微量を表示
df = pd.DataFrame(iris_dataset.data, columns=iris_dataset.feature_names)
df.head()

Out[96]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

データの標準化（平均:0, 分散:1）¶

In [100]:

# 標準化
stdsc = StandardScaler()
iris_dataset_std = stdsc.fit_transform(iris_dataset.data)
dfs = pd.DataFrame(iris_dataset_std, columns=iris_dataset.feature_names)
dfs.head()

Out[100]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	-0.900681	1.019004	-1.340227	-1.315444
1	-1.143017	-0.131979	-1.340227	-1.315444
2	-1.385353	0.328414	-1.397064	-1.315444
3	-1.506521	0.098217	-1.283389	-1.315444
4	-1.021849	1.249201	-1.340227	-1.315444

主成分分析¶

In [101]:

# 主成分分析する
pca = PCA()
feature = pca.fit(dfs)

# 分析結果を元にデータセットを主成分に変換する
transformed = pca.fit_transform(features)

In [102]:

#主成分得点
pd.DataFrame(transformed, columns=["PC{}".format(x + 1) for x in range(len(dfs.columns))]).head()

Out[102]:

	PC1	PC2	PC3	PC4
0	-2.684126	0.319397	-0.027915	-0.002262
1	-2.714142	-0.177001	-0.210464	-0.099027
2	-2.888991	-0.144949	0.017900	-0.019968
3	-2.745343	-0.318299	0.031559	0.075576
4	-2.728717	0.326755	0.090079	0.061259

In [103]:

# 第一主成分と第二主成分でプロットする
plt.scatter(transformed[:, 0], transformed[:, 1], alpha=0.8, c=list(dfs.iloc[:, 0]))
    
# グラフを表示する
plt.title('principal component')
plt.xlabel('pc1')
plt.ylabel('pc2')

plt.show()

４つの特徴量からなるデータを２次元に次元削減し、平面上にプロットしています。
色の違いは品種の違いを意味します。
第一主成分（pc1）を x軸
第二主成分（pc2）を y軸

結果、第二主成分(PC2)までで、3品種の区別ができそうなことがわかります。

寄与率と累積寄与率の確認¶

In [104]:

#寄与率
pd.DataFrame(pca.explained_variance_ratio_, index=["PC{}".format(x + 1) for x in range(len(dfs.columns))])

Out[104]:

	0
PC1	0.924619
PC2	0.053066
PC3	0.017103
PC4	0.005212

In [105]:

# 累積寄与率の視覚化
import matplotlib.ticker as ticker
plt.gca().get_xaxis().set_major_locator(ticker.MaxNLocator(integer=True))
plt.plot([0] + list( np.cumsum(pca.explained_variance_ratio_)), "-o")
plt.xlabel("Number of principal components")
plt.ylabel("Cumulative contribution rate")
plt.grid()
plt.show()

結果、第二主成分(PC2)までで元のデータの 97% が表現できていることがわかります。

固有値と固有ベクトルの確認¶

In [106]:

# 各主成分 の固有値
pd.DataFrame(pca.explained_variance_, index=["PC{}".format(x + 1) for x in range(len(dfs.columns))])

Out[106]:

	0
PC1	4.228242
PC2	0.242671
PC3	0.078210
PC4	0.023835

In [108]:

# 各主成分の固有ベクトル
pd.DataFrame(pca.components_, index=["PC{}".format(x + 1) for x in range(len(dfs.columns))])

Out[108]:

	0	1	2	3
PC1	0.361387	-0.084523	0.856671	0.358289
PC2	0.656589	0.730161	-0.173373	-0.075481
PC3	-0.582030	0.597911	0.076236	0.545831
PC4	-0.315487	0.319723	0.479839	-0.753657

観測変数の寄与度¶

In [110]:

# 第一主成分と第二主成分における観測変数の寄与度をプロット
for x, y, name in zip(pca.components_[0], pca.components_[1], dfs.columns[1:]):
    plt.text(x, y, name)
plt.scatter(pca.components_[0], pca.components_[1], alpha=0.8)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.grid()
plt.show()

各成分が何を考慮した値なのかのヒントが得られます。

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2