提问者:小点点

仅缩放数据框中包含字符串的数值


我在python,我试图使缩放到数据帧

subject_id hour_measure         urinecolor   blood pressure                  
3          1.00                 red          40
           1.15                 red          high
4          2.00              yellow          low

因为它包含数字和文本列下面的代码给我错误

 #MinMaxScaler for Data
scaler = MinMaxScaler(copy=True, feature_range=(0, 1))
X = scaler.fit_transform(X)

它给我错误的数据帧包含字符串,我怎么能告诉python只缩放包含数字的列,也缩放字符串列中的数值。


共2个答案

匿名用户

将非数字值转换为缺失值,然后使用替代解决方案进行缩放,最后将缺失值替换回原始值:

print (df)
   subject_id  hour_measure urinecolor blood pressure
0           3          1.00        red             40
1           3          1.15        red           high
2           4          2.00     yellow            low
3           5          5.00     yellow            100

df = df.set_index('subject_id')

df1 = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
df2 = (df1 - df1.min()) / (df1.max() - df1.min())

df = df2.combine_first(df)
print (df)
            hour_measure urinecolor blood pressure
subject_id                                        
3                 0.0000        red              0
3                 0.0375        red           high
4                 0.2500     yellow            low
5                 1.0000     yellow              1

第一个解决方案:

我建议将文本列替换为数字字典,如:

dbp = {'high': 150, 'low': 60}

df['blood pressure'] = df['blood pressure'].replace(dbp)

所有人一起:

#if subject_id are numeric convert them to index
df = df.set_index('subject_id')

dbp = {'high': 150, 'low': 60}
#replace to numbers and convert to integers
df['blood pressure'] = df['blood pressure'].replace(dbp).astype(int)

print (df)
            hour_measure urinecolor  blood pressure
subject_id                                         
3                   1.00        red              40
3                   1.15        red             150
4                   2.00     yellow              60

print (df.dtypes)
hour_measure      float64
urinecolor         object
blood pressure      int32
dtype: object
from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler(copy=True, feature_range=(0, 1))
#select only numeric columns
X = scaler.fit_transform(df.select_dtypes(np.number))
print (X)
[[0.         0.        ]
 [0.15       1.        ]
 [1.         0.18181818]]

详情:

print (df.select_dtypes(np.number))
            hour_measure  blood pressure
subject_id                              
3                   1.00              40
3                   1.15             150
4                   2.00              60

匿名用户

另一种方法如下:(我添加了新行,请参见血压中的标度值)

       hour_measure urinecolor blood pressure  temp_column
0          1.00        red             40           40
1          1.15        red           high            0
2          2.00     yellow            low            0
3          3.00     yellow             20           20

df['temp_column'] = df['blood pressure'].values
df['temp_column'] = df['temp_column'].apply(lambda x: 0 if str(x).isalpha() == True else x)

这将创建具有血压柱数值的新temp_column。

scaler = MinMaxScaler(copy=True, feature_range=(0, 1))
df['hour_measure'] = scaler.fit_transform(df['hour_measure'].values.reshape(-1, 1))
df['temp_column'] = scaler.fit_transform(df['temp_column'].values.reshape(-1 ,1))

我已经将MinMaxScaler应用于包含血压数值的temp_列。我只是把缩放后的数值放回血压栏。

numeric_rows = pd.to_numeric(df['blood pressure'], errors='coerce').dropna().index.tolist()
print('Index of numeric values in blood pressure column: ', numeric_rows)
for i in numeric_rows:
    df['blood pressure'].iloc[i] = df['temp_column'].iloc[i]
df = df.drop(['temp_column'], axis=1)

结果:

   hour_measure urinecolor blood pressure
0         0.000        red              1
1         0.075        red           high
2         0.500     yellow            low
3         1.000     yellow            0.5