文字型データを数値型に変換する - Pythonを一から勉強してデータ分析できるようになる

[Take0] ベースとなるデータフレームの作成

【書式】

【コード】

import pandas as pd

data={

"A": [100,200,300],

"B": [400,500,600],

"C":[400,500,600],

"D": [700,800,900],

"E": ["1,000","1,100","1,200"]

}

dfA=pd.DataFrame(data)

print(dfA)

print("\n")

print("各列のデータ種別")

print(dfA.dtypes)

【結果】

A B C D E

0 100 400 400 700 1,000

1 200 500 500 800 1,100

2 300 600 600 900 1,200

各列のデータ種別

A int64

B int64

C int64

D int64

E object

dtype: object

E列は""で囲ってあるのでテキスト形式になったと思われる

[Take1] データ型を整数型に変換する

【書式】

df["列名"]=df["列名"].sdtype(int)

【コード】

import pandas as pd

data={

"A": [100,200,300],

"B": [400,500,600],

"C":[400,500,600],

"D": [700,800,900],

"E": ["1,000","1,100","1,200"]

}

dfA=pd.DataFrame(data)

print(dfA)

# E列のデータ型をintに変換する

dfA["E"]=dfA["E"].astype(int)

print("\n")

print("各列のデータ種別")

print(dfA.dtypes)

【結果】

A B C D E

0 100 400 400 700 1,000

1 200 500 500 800 1,100

2 300 600 600 900 1,200

---------------------------------------------------------------------------

ValueError Traceback (most recent call last)

in ()

12 # E列のデータ型をintに変換する

---> 13 dfA["E"]=dfA["E"].astype(int)

14 print("\n")

15 print("各列のデータ種別")

6 frames

/usr/local/lib/python3.10/dist-packages/pandas/core/dtypes/astype.py in _astype_nansafe(arr, dtype, copy, skipna)

136 if copy or is_object_dtype(arr.dtype) or is_object_dtype(dtype):

137 # Explicit copy, or required since NumPy can't view from / to object.

--> 138 return arr.astype(dtype, copy=True)

139

140 return arr.astype(dtype, copy=copy)

ValueError: invalid literal for int() with base 10: '1,000'

エラー・・・

「1,000」に入っている「,」を認識できず、整数に変換できないらしい。

そのため、変換前に前処理が必要になる様子。

[Take2] カンマ付き文字列はそのまま返還できないので、カンマを削除してから整数型に変換する

【書式】

df["列名"]=df["列名"].str.replace(",",""))

【コード】

import pandas as pd

data={

"A": [100,200,300],

"B": [400,500,600],

"C":[400,500,600],

"D": [700,800,900],

"E": ["1,000","1,100","1,200"]

}

dfA=pd.DataFrame(data)

print("オリジナルのデータ")

print(dfA)

# E列のデータにおいて、カンマを削除してからintに変換する

dfA["E"]=dfA["E"].str.replace(",","").astype(int)

print("返還後のデータ")

print(dfA)

print("\n")

print("各列のデータ種別")

print(dfA.dtypes)

【結果】

オリジナルのデータ

A B C D E

0 100 400 400 700 1,000

1 200 500 500 800 1,100

2 300 600 600 900 1,200

変換後のデータ

A B C D E

0 100 400 400 700 1000

1 200 500 500 800 1100

2 300 600 600 900 1200

各列のデータ種別

A int64

B int64

C int64

D int64

E int64

dtype: object

df["列名"]=df["列名"].str.replace(",","")).astype(int)

と、繋げて書いてよいみたい。

順番を調整すると面白い使い方ができそうな気がする。