我在python,我试图使缩放到数据帧
subject_id hour_measure urinecolor blood pressure
3 1.00 red 40
1.15 red high
4 2.00 yellow low
因为它包含数字和文本列下面的代码给我错误
#MinMaxScaler for Data
scaler = MinMaxScaler(copy=True, feature_range=(0, 1))
X = scaler.fit_transform(X)
它给我错误的数据帧包含字符串,我怎么能告诉python只缩放包含数字的列,也缩放字符串列中的数值。
将非数字值转换为缺失值,然后使用替代解决方案进行缩放,最后将缺失值替换回原始值:
print (df)
subject_id hour_measure urinecolor blood pressure
0 3 1.00 red 40
1 3 1.15 red high
2 4 2.00 yellow low
3 5 5.00 yellow 100
df = df.set_index('subject_id')
df1 = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
df2 = (df1 - df1.min()) / (df1.max() - df1.min())
df = df2.combine_first(df)
print (df)
hour_measure urinecolor blood pressure
subject_id
3 0.0000 red 0
3 0.0375 red high
4 0.2500 yellow low
5 1.0000 yellow 1
第一个解决方案:
我建议将文本列替换为数字字典,如:
dbp = {'high': 150, 'low': 60}
df['blood pressure'] = df['blood pressure'].replace(dbp)
所有人一起:
#if subject_id are numeric convert them to index
df = df.set_index('subject_id')
dbp = {'high': 150, 'low': 60}
#replace to numbers and convert to integers
df['blood pressure'] = df['blood pressure'].replace(dbp).astype(int)
print (df)
hour_measure urinecolor blood pressure
subject_id
3 1.00 red 40
3 1.15 red 150
4 2.00 yellow 60
print (df.dtypes)
hour_measure float64
urinecolor object
blood pressure int32
dtype: object
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler(copy=True, feature_range=(0, 1))
#select only numeric columns
X = scaler.fit_transform(df.select_dtypes(np.number))
print (X)
[[0. 0. ]
[0.15 1. ]
[1. 0.18181818]]
详情:
print (df.select_dtypes(np.number))
hour_measure blood pressure
subject_id
3 1.00 40
3 1.15 150
4 2.00 60
另一种方法如下:(我添加了新行,请参见血压中的标度值)
hour_measure urinecolor blood pressure temp_column
0 1.00 red 40 40
1 1.15 red high 0
2 2.00 yellow low 0
3 3.00 yellow 20 20
df['temp_column'] = df['blood pressure'].values
df['temp_column'] = df['temp_column'].apply(lambda x: 0 if str(x).isalpha() == True else x)
这将创建具有血压柱数值的新temp_column。
scaler = MinMaxScaler(copy=True, feature_range=(0, 1))
df['hour_measure'] = scaler.fit_transform(df['hour_measure'].values.reshape(-1, 1))
df['temp_column'] = scaler.fit_transform(df['temp_column'].values.reshape(-1 ,1))
我已经将MinMaxScaler应用于包含血压数值的temp_列。我只是把缩放后的数值放回血压栏。
numeric_rows = pd.to_numeric(df['blood pressure'], errors='coerce').dropna().index.tolist()
print('Index of numeric values in blood pressure column: ', numeric_rows)
for i in numeric_rows:
df['blood pressure'].iloc[i] = df['temp_column'].iloc[i]
df = df.drop(['temp_column'], axis=1)
结果:
hour_measure urinecolor blood pressure
0 0.000 red 1
1 0.075 red high
2 0.500 yellow low
3 1.000 yellow 0.5