根据这里的说法,sklearn不能处理分类变量。建议使用一种热编码来处理这些特性。然而,我不明白一个热编码怎么能有帮助?例如,country=USA或China或England被转换为country=USA是真是假,新功能“country==USA”仍然是分类的(只能取0或1)。这不会改变任何事情。Sklearn仍然将0或1视为数值。
对于这里的一个真实示例,我转换了数据:
human,warm-blooded,hair,yes,no,no,yes,no,mammal
python,cold-blooded,scales,no,no,no,no,yes,reptile
salmon,cold-blooded,scales,no,yes,no,no,no,fish
whale,warm-blooded,hair,yes,yes,no,no,no,mammal
frog,cold-blooded,none,no,semi,no,yes,yes,amphibian
komodo dragon,cold-blooded,scales,no,no,no,yes,no,reptile
bat,warm-blooded,hair,yes,no,yes,yes,yes,mammal
pigeon,warm-blooded,feathers,no,no,yes,yes,no,bird
cat,warm-blooded,fur,yes,no,no,yes,no,mammal
leopard shark,cold-blooded,scales,yes,yes,no,no,no,fish
turtle,cold-blooded,scales,no,semi,no,yes,no,reptile
penguin,warm-blooded,feathers,no,semi,no,yes,no,bird
porcupine,warm-blooded,quills,yes,no,no,yes,yes,mammal
eel,cold-blooded,scales,no,yes,no,no,no,fish
salamander,cold-blooded,none,no,semi,no,yes,yes,amphibian
gila monster,cold-blooded,scales,no,no,no,yes,yes,
进入
[[1. 0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0.]
[1. 0. 1. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1.]
[1. 0. 0. 0. 1. 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1.]
[1. 0. 0. 0. 1. 0. 1. 0. 1. 1. 0. 1. 0. 0. 0. 1. 0. 0. 0.]
[1. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0.]
[1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1.]
[0. 1. 1. 0. 0. 0. 1. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0.]
[0. 1. 1. 0. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0.]
[1. 0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0.]
[1. 0. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1.]
[1. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1.]
[1. 0. 0. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0.]
[1. 0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0.]
[1. 0. 0. 0. 1. 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1.]
[1. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0.]
[1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1.]]
然后建立一个决策树,就像
决策树(点击打开)分割点仍然很可笑(比如分娩=否)
首先,我们还必须记住,SKLearning只能构建二叉树。例如,有一个颜色特征需要0,1,2,3,4,5不同的颜色。我们用颜色分割颜色特征
一个热编码就好像Sklearn可以处理分类数据一样。例如,如果数据中有一个真实的特征“country==USA”,它取0或1,并用作分割特征,则叶为country==USA为0,country==USA为1。尽管Sklearn仍然使用数字分割点,如0.5(分割点必须介于0和1之间,否则分割不好),但如果country==USA