機械学習：sklearn で提供されるアヤメデータを操作する - Pythonを一から勉強してデータ分析できるようになる

以前見たのは、seaborn で提供されているアヤメデータ。

[Take1] sklearnのアヤメデータを読み込む

【書式】

【コード】

from sklearn import datasets

iris=datasets.load_iris()

print(iris)

【結果】

{'data': array([[5.1, 3.5, 1.4, 0.2],

[4.9, 3. , 1.4, 0.2],

[4.7, 3.2, 1.3, 0.2],

[4.6, 3.1, 1.5, 0.2],

[5. , 3.6, 1.4, 0.2],

[5.4, 3.9, 1.7, 0.4],

[4.6, 3.4, 1.4, 0.3],

[5. , 3.4, 1.5, 0.2],

[4.4, 2.9, 1.4, 0.2],

[4.9, 3.1, 1.5, 0.1],

[5.4, 3.7, 1.5, 0.2],

[4.8, 3.4, 1.6, 0.2],

[4.8, 3. , 1.4, 0.1],

[4.3, 3. , 1.1, 0.1],

[5.8, 4. , 1.2, 0.2],

[5.7, 4.4, 1.5, 0.4],

[5.4, 3.9, 1.3, 0.4],

[5.1, 3.5, 1.4, 0.3],

[5.7, 3.8, 1.7, 0.3],

[5.1, 3.8, 1.5, 0.3],

[5.4, 3.4, 1.7, 0.2],

[5.1, 3.7, 1.5, 0.4],

[4.6, 3.6, 1. , 0.2],

[5.1, 3.3, 1.7, 0.5],

[4.8, 3.4, 1.9, 0.2],

[5. , 3. , 1.6, 0.2],

[5. , 3.4, 1.6, 0.4],

[5.2, 3.5, 1.5, 0.2],

[5.2, 3.4, 1.4, 0.2],

[4.7, 3.2, 1.6, 0.2],

[4.8, 3.1, 1.6, 0.2],

[5.4, 3.4, 1.5, 0.4],

[5.2, 4.1, 1.5, 0.1],

[5.5, 4.2, 1.4, 0.2],

[4.9, 3.1, 1.5, 0.2],

[5. , 3.2, 1.2, 0.2],

[5.5, 3.5, 1.3, 0.2],

[4.9, 3.6, 1.4, 0.1],

[4.4, 3. , 1.3, 0.2],

[5.1, 3.4, 1.5, 0.2],

[5. , 3.5, 1.3, 0.3],

[4.5, 2.3, 1.3, 0.3],

[4.4, 3.2, 1.3, 0.2],

[5. , 3.5, 1.6, 0.6],

[5.1, 3.8, 1.9, 0.4],

[4.8, 3. , 1.4, 0.3],

[5.1, 3.8, 1.6, 0.2],

[4.6, 3.2, 1.4, 0.2],

[5.3, 3.7, 1.5, 0.2],

[5. , 3.3, 1.4, 0.2],

[7. , 3.2, 4.7, 1.4],

[6.4, 3.2, 4.5, 1.5],

[6.9, 3.1, 4.9, 1.5],

[5.5, 2.3, 4. , 1.3],

[6.5, 2.8, 4.6, 1.5],

[5.7, 2.8, 4.5, 1.3],

[6.3, 3.3, 4.7, 1.6],

[4.9, 2.4, 3.3, 1. ],

[6.6, 2.9, 4.6, 1.3],

[5.2, 2.7, 3.9, 1.4],

[5. , 2. , 3.5, 1. ],

[5.9, 3. , 4.2, 1.5],

[6. , 2.2, 4. , 1. ],

[6.1, 2.9, 4.7, 1.4],

[5.6, 2.9, 3.6, 1.3],

[6.7, 3.1, 4.4, 1.4],

[5.6, 3. , 4.5, 1.5],

[5.8, 2.7, 4.1, 1. ],

[6.2, 2.2, 4.5, 1.5],

[5.6, 2.5, 3.9, 1.1],

[5.9, 3.2, 4.8, 1.8],

[6.1, 2.8, 4. , 1.3],

[6.3, 2.5, 4.9, 1.5],

[6.1, 2.8, 4.7, 1.2],

[6.4, 2.9, 4.3, 1.3],

[6.6, 3. , 4.4, 1.4],

[6.8, 2.8, 4.8, 1.4],

[6.7, 3. , 5. , 1.7],

[6. , 2.9, 4.5, 1.5],

[5.7, 2.6, 3.5, 1. ],

[5.5, 2.4, 3.8, 1.1],

[5.5, 2.4, 3.7, 1. ],

[5.8, 2.7, 3.9, 1.2],

[6. , 2.7, 5.1, 1.6],

[5.4, 3. , 4.5, 1.5],

[6. , 3.4, 4.5, 1.6],

[6.7, 3.1, 4.7, 1.5],

[6.3, 2.3, 4.4, 1.3],

[5.6, 3. , 4.1, 1.3],

[5.5, 2.5, 4. , 1.3],

[5.5, 2.6, 4.4, 1.2],

[6.1, 3. , 4.6, 1.4],

[5.8, 2.6, 4. , 1.2],

[5. , 2.3, 3.3, 1. ],

[5.6, 2.7, 4.2, 1.3],

[5.7, 3. , 4.2, 1.2],

[5.7, 2.9, 4.2, 1.3],

[6.2, 2.9, 4.3, 1.3],

[5.1, 2.5, 3. , 1.1],

[5.7, 2.8, 4.1, 1.3],

[6.3, 3.3, 6. , 2.5],

[5.8, 2.7, 5.1, 1.9],

[7.1, 3. , 5.9, 2.1],

[6.3, 2.9, 5.6, 1.8],

[6.5, 3. , 5.8, 2.2],

[7.6, 3. , 6.6, 2.1],

[4.9, 2.5, 4.5, 1.7],

[7.3, 2.9, 6.3, 1.8],

[6.7, 2.5, 5.8, 1.8],

[7.2, 3.6, 6.1, 2.5],

[6.5, 3.2, 5.1, 2. ],

[6.4, 2.7, 5.3, 1.9],

[6.8, 3. , 5.5, 2.1],

[5.7, 2.5, 5. , 2. ],

[5.8, 2.8, 5.1, 2.4],

[6.4, 3.2, 5.3, 2.3],

[6.5, 3. , 5.5, 1.8],

[7.7, 3.8, 6.7, 2.2],

[7.7, 2.6, 6.9, 2.3],

[6. , 2.2, 5. , 1.5],

[6.9, 3.2, 5.7, 2.3],

[5.6, 2.8, 4.9, 2. ],

[7.7, 2.8, 6.7, 2. ],

[6.3, 2.7, 4.9, 1.8],

[6.7, 3.3, 5.7, 2.1],

[7.2, 3.2, 6. , 1.8],

[6.2, 2.8, 4.8, 1.8],

[6.1, 3. , 4.9, 1.8],

[6.4, 2.8, 5.6, 2.1],

[7.2, 3. , 5.8, 1.6],

[7.4, 2.8, 6.1, 1.9],

[7.9, 3.8, 6.4, 2. ],

[6.4, 2.8, 5.6, 2.2],

[6.3, 2.8, 5.1, 1.5],

[6.1, 2.6, 5.6, 1.4],

[7.7, 3. , 6.1, 2.3],

[6.3, 3.4, 5.6, 2.4],

[6.4, 3.1, 5.5, 1.8],

[6. , 3. , 4.8, 1.8],

[6.9, 3.1, 5.4, 2.1],

[6.7, 3.1, 5.6, 2.4],

[6.9, 3.1, 5.1, 2.3],

[5.8, 2.7, 5.1, 1.9],

[6.8, 3.2, 5.9, 2.3],

[6.7, 3.3, 5.7, 2.5],

[6.7, 3. , 5.2, 2.3],

[6.3, 2.5, 5. , 1.9],

[6.5, 3. , 5.2, 2. ],

[6.2, 3.4, 5.4, 2.3],

[5.9, 3. , 5.1, 1.8]]), 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), 'frame': None, 'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'), 'DESCR': '.. _iris_dataset:\n\nIris plants dataset\n--------------------\n\n**Data Set Characteristics:**\n\n :Number of Instances: 150 (50 in each of three classes)\n :Number of Attributes: 4 numeric, predictive attributes and the class\n :Attribute Information:\n - sepal length in cm\n - sepal width in cm\n - petal length in cm\n - petal width in cm\n - class:\n - Iris-Setosa\n - Iris-Versicolour\n - Iris-Virginica\n \n :Summary Statistics:\n\n ============== ==== ==== ======= ===== ====================\n Min Max Mean SD Class Correlation\n ============== ==== ==== ======= ===== ====================\n sepal length: 4.3 7.9 5.84 0.83 0.7826\n sepal width: 2.0 4.4 3.05 0.43 -0.4194\n petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)\n petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)\n ============== ==== ==== ======= ===== ====================\n\n :Missing Attribute Values: None\n :Class Distribution: 33.3% for each of 3 classes.\n :Creator: R.A. Fisher\n :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n :Date: July, 1988\n\nThe famous Iris database, first used by Sir R.A. Fisher. The dataset is taken\nfrom Fisher\'s paper. Note that it\'s the same as in R, but not as in the UCI\nMachine Learning Repository, which has two wrong data points.\n\nThis is perhaps the best known database to be found in the\npattern recognition literature. Fisher\'s paper is a classic in the field and\nis referenced frequently to this day. (See Duda & Hart, for example.) The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant. One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\n|details-start|\n**References**\n|details-split|\n\n- Fisher, R.A. "The use of multiple measurements in taxonomic problems"\n Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n Mathematical Statistics" (John Wiley, NY, 1950).\n- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.\n (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.\n- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n Structure and Classification Rule for Recognition in Partially Exposed\n Environments". IEEE Transactions on Pattern Analysis and Machine\n Intelligence, Vol. PAMI-2, No. 1, 67-71.\n- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions\n on Information Theory, May 1972, 431-433.\n- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II\n conceptual clustering system finds 3 classes in the data.\n- Many, many more ...\n\n|details-end|', 'feature_names': ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], 'filename': 'iris.csv', 'data_module': 'sklearn.datasets.data'}

seaborn で提供されていたのは二次元の表だったが、こちらは配列になっている。

別物ということでとりあえず理解。

このデータを簡略化して記述すると次のようになりそう。

形状的には、次のような意味と思われる。

data：特徴量を4つ格納したリストが150個

target：この150個のデータを0-2で分類したもの？

frame：？？

target_names：1-3に対応する名称？

DESCR：？？（description?）

feature_names：特徴量の名称

# 整理して書くと

{

'data': array([

[5.1, 3.5, 1.4, 0.2],

[4.9, 3. , 1.4, 0.2],

[4.7, 3.2, 1.3, 0.2],

[4.6, 3.1, 1.5, 0.2],

[5. , 3.6, 1.4, 0.2],

～　略：要素数4のリストが150個　～

[6.2, 3.4, 5.4, 2.3],

[5.9, 3. , 5.1, 1.8]

]),

'target': array(

[0, 0, 0, 0, ～　略：1,2,3 の数字が150個　～　2, 2]

'frame': None,

'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'),

'DESCR': ～　よくわからないので略（長い文字列）　～

'feature_names': ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'],

'filename': 'iris.csv',

'data_module': 'sklearn.datasets.data'

}

[Take1-2] 各データのデータ型とサイズを確認する

【書式】

type(データ)

配列.shape

【コード】

import pandas as pd

from sklearn import datasets

# 各データのタイプ

print("データタイプ")

iris=datasets.load_iris()

print("data=",type(iris.data))

print("target=",type(iris.target))

print("target_name=",type(iris.target_names))

print("\n")

# 各データのサイズ

print("データサイズ")

iris=datasets.load_iris()

print("data=",iris.data.shape)

print("target=",iris.target.shape)

print("target_name=",iris.target_names.shape)

【結果】

データタイプ

data= <class 'numpy.ndarray'>

target= <class 'numpy.ndarray'>

target_name= <class 'numpy.ndarray'>

データサイズ

data= (150, 4)

target= (150,)

target_name= (3,)

data と target はデータ数が150個。target_names のデータサイズは(3,)

なので、data を df にして、対応する target を列として追加できそう。

で、この追加した列と target_names を関連付けて名称でわかりやすくしたい。

[Take2] アヤメデータをデータフレームに入れる

【書式】

df=pd.DataFrame(ソース)

【コード】

import pandas as pd

from sklearn import datasets

# 「data」をdf1に入れる

print("data")

iris=datasets.load_iris()

df=pd.DataFrame(iris.data)

print(df.head())

# 「target」をdf2に入れる

print("\n","target")

df2=pd.DataFrame(iris.target)

print(df2.head())

# 「target_names」をdf3に入れる

print("\n","target_names")

df3=pd.DataFrame(iris.target_names)

print(df3.head())

# 「feature_names」をdf4に入れる

print("\n","feature_names")

df4=pd.DataFrame(iris.feature_names)

print(df4.head())

【結果】

data

0 1 2 3

0 5.1 3.5 1.4 0.2

1 4.9 3.0 1.4 0.2

2 4.7 3.2 1.3 0.2

3 4.6 3.1 1.5 0.2

4 5.0 3.6 1.4 0.2

target

0 0

1 0

2 0

3 0

4 0

target_names

0 setosa

1 versicolor

2 virginica

feature_names

0 sepal length (cm)

1 sepal width (cm)

2 petal length (cm)

3 petal width (cm)

出力された先頭の５データは、前段の　data　に含まれるリストの最初から5つと同じ。

で、カラム名を　feature_names　にして、target カラムを追加したい。

[Take3] データフレームの列名をわかりやすく　＆　target のカラムを追加する

【書式】

【コード】

import pandas as pd

from sklearn import datasets

iris=datasets.load_iris()

# 元データの「data」パート？をデータフレームに格納する

df=pd.DataFrame(iris.data)

# 列名を設定する

df.columns=iris.feature_names

# target を列として追加

df['target']=iris.target

# 出力

print(df.head())

【結果】

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \

0 5.1 3.5 1.4 0.2

1 4.9 3.0 1.4 0.2

2 4.7 3.2 1.3 0.2

3 4.6 3.1 1.5 0.2

4 5.0 3.6 1.4 0.2

target

0 0

1 0

2 0

3 0

4 0

target の先頭４件は、Take1で出力した同情報の先頭4つと同じ。

結局、seaborn　で扱ったデータと同じになりそう。

[Take4] 散布図行列でデータ全体を俯瞰する

【書式】

ライブラリ：pandas, matplotlib, seaborn

書式

sns.pairplot(data=df)

plt.show()

オプション

hue="列名" : 色分け基準

【コード】

import matplotlib.pyplot as plt

import pandas as pd

import seaborn as sns

from sklearn import datasets

iris=datasets.load_iris()

df=pd.DataFrame(iris.data)

# 列名を設定する

df.columns=iris.feature_names

# target を列として追加

df['target']=iris.target

# 出力（target で色分け）

sns.pairplot(data=df,hue="target")

plt.show()

【結果】

[Take5] データフレームを品種ごとに別のデータフレームに分ける

Accessで言うところの、フィルターみたいなイメージ

【書式】

書式：フィルター

df1=df[df['target']==検索条件]

書式：df特定列のユニークなデータ

df0["target"].unique()

書式：dfの行数（レコード数）

len(df)

【コード】

import pandas as pd

from sklearn import datasets

iris=datasets.load_iris()

df=pd.DataFrame(iris.data)

# 列名を設定する

df.columns=iris.feature_names

# target を列として追加

# df→全データ

df['target']=iris.target

# targetの種類ごとに異なるデータフレームを作る

# df0→0(setosa)

df0=df[df['target']==0]

# df1→1(versicolor)

df1=df[df['target']==1]

# df2→2(virginica)

df2=df[df['target']==2]

# 出力

print("dfのデータの数：",len(df))

print("df0のデータの数：",len(df0))

print("df1のデータの数：",len(df1))

print("df2のデータの数：",len(df2))

print("\n")

print("dfのtargetの種類：",df["target"].unique())

print("df0のtargetの種類：",df0["target"].unique())

print("df1のtargetの種類：",df1["target"].unique())

print("df2のtargetの種類：",df2["target"].unique())

【結果】

dfのデータの数： 150

df0のデータの数： 50

df1のデータの数： 50

df2のデータの数： 50

dfのtargetの種類： [0 1 2]

df0のtargetの種類： [0]

df1のtargetの種類： [1]

df2のtargetの種類： [2]

各種類のデータが50個づつ格納されていたということ。

[Take6] 品種ごとにヒストグラムを作る

【書式】

書式：横（インチ）ｘ縦（インチ）のグラフ領域を作る

plt.figure(figsize(横,縦))

書式：ヒストグラムを定義する

df["列名"].hist(bins=ビン数, color="色",alpha=透明度)

※bins のデフォルト値は 10（省略可）

※透明度：0(透明)～1(不透明)を指定

【コード】

%matplotlib inline

import matplotlib.pyplot as plt

import pandas as pd

from sklearn import datasets

iris=datasets.load_iris()

df=pd.DataFrame(iris.data)

# 列名を設定する

df.columns=iris.feature_names

# target を列として追加

df['target']=iris.target

# targetの種類ごとに異なるデータフレームを作る

df0=df[df['target']==0]

df1=df[df['target']==1]

df2=df[df['target']==2]

# 各データフレームからヒストグラムを作る

# 描画領域を規定

plt.figure(figsize=(5 ,5))

# 描画対象の列名を格納

xx="sepal width (cm)"

# グラフ定義

df0[xx].hist(bins=10, color="b",alpha=0.5)

df1[xx].hist(bins=10, color="g",alpha=0.5)

df2[xx].hist(bins=10, color="r",alpha=0.5)

plt.xlabel(xx)

plt.ylabel("count")

# グラフ描画

plt.show()

【結果】

[Take7] sepal length / sepal width で散布図を作る

【書式】

書式：散布図

plt.scatter(横軸カラム名,縦軸カラム名,color="色",alpha=透明度)

※透明度：0(透明)～1(不透明)を指定

【コード】

%matplotlib inline

import matplotlib.pyplot as plt

import pandas as pd

from sklearn import datasets

iris=datasets.load_iris()

df=pd.DataFrame(iris.data)

# 列名を設定する

df.columns=iris.feature_names

# target を列として追加

df['target']=iris.target

# targetの種類ごとに異なるデータフレームを作る

df0=df[df['target']==0]

df1=df[df['target']==1]

df2=df[df['target']==2]

# 各データフレームからヒストグラムを作る

# 描画領域を規定

plt.figure(figsize=(5 ,5))

# 描画対象の列名を格納

xx="sepal width (cm)"

yy="sepal length (cm)"

# グラフ定義

plt.scatter(df0[xx],df0[yy],color="b",alpha=0.5)

plt.scatter(df1[xx],df1[yy],color="g",alpha=0.5)

plt.scatter(df2[xx],df2[yy],color="r",alpha=0.5)

plt.xlabel(xx)

plt.ylabel(yy)

# グリッド線を描画する

plt.grid()

# グラフ描画

plt.show()

【結果】

なんとなく、左上と右下に分類される。これが機械学習のとっかかりになるみたい。