独热编码（one-hot）

独热编码(one-hot encoding) 独热编码是通过创建一个新的虚拟特征，虚拟特征的每一列各代表标称数据的一个值。例如，颜色一共有四个取值green、blue、red、black，独热编码是通过四位二进制来表示，如果是green就表示为[1,0,0,0]，对应的颜色是[green,blue,red,black]，如果属于哪一种颜色，则取值为1，否则为0。

实现的两种方式

1、OneHotEncoder^[1]：

import sklearn.preprocessing as pre_processing
import numpy as np

label=pre_processing.LabelEncoder()
labels=label.fit_transform(['中国','美国','法国','德国'])
print(labels)

# out:
# [0 3 2 1]

labels=np.array(labels).reshape(len(labels),1) #先将X组织成（sample，feature）的格式
 
onehot=pre_processing.OneHotEncoder()
onehot_label=onehot.fit_transform(labels)
print(onehot_label.toarray())   #这里一定要进行toarray()

# out：
# [[1. 0. 0. 0.]
#  [0. 0. 0. 1.]
#  [0. 0. 1. 0.]
#  [0. 1. 0. 0.]]

2、pd.get_dummies^[2]

在get_dummies()函数中，默认情况下独热编码生成的是bool类型的True和False。如果你希望将其转换为数值型的0和1，你可以使用参数dtype来指定数据类型为整数型。

1	onedata = pd.get_dummies(data,columns=['SoilType'],dtype=int)

例子

首先我们的数据：

import numpy as np
import pandas as pd

# 创建颜色数据
colors = ['红色', '蓝色', '绿色']
color_data = np.random.choice(colors, size=100, p=[0.4, 0.4, 0.2])
# 创建目标变量数据
value_data = np.random.randn(100) * 10 + 50
# 将数据合并为数据集
data = pd.DataFrame({'color': color_data, 'value': value_data})
print("data:{}\n".format(data))

data:   color      value
0     蓝色  64.725733
1     蓝色  34.763842
2     红色  43.883487
3     红色  42.106679
4     红色  48.876011
..   ...        ...
95    蓝色  64.521983
96    蓝色  57.808631
97    绿色  44.561463
98    蓝色  61.872843
99    蓝色  49.776371

[100 rows x 2 columns]

进行独热编码后的数据：

# 使用pandas的get_dummies方法进行one-hot编码
onehot = pd.get_dummies(data['color'], prefix='color') # 取出这列，并对这列进行独热编码
# 将编码后的特征与目标变量合并
data_onehot = pd.concat([data['value'], onehot], axis=1)
print("onehot之后的数据：\n{}".format(data_onehot))

onehot之后的数据：
        value  color_红色  color_绿色  color_蓝色
0   64.725733         0         0         1
1   34.763842         0         0         1
2   43.883487         1         0         0
3   42.106679         1         0         0
4   48.876011         1         0         0
..        ...       ...       ...       ...
95  64.521983         0         0         1
96  57.808631         0         0         1
97  44.561463         0         1         0
98  61.872843         0         0         1
99  49.776371         0         0         1

[100 rows x 4 columns]

划分训练集和测试集：

# 将数据集划分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(data_onehot.drop('value', axis=1), data_onehot['value'], test_size=0.2, random_state=42)

# 创建线性回归模型并拟合训练集
model = LinearRegression()
model.fit(X_train, y_train)

# 在测试集上进行预测并评估性能
y_pred = model.predict(X_test)
score = r2_score(y_test, y_pred)
print("R-squared score:", score)

1	R-squared score: -0.12562920003687705

Jay's Blog

独热编码（one-hot）

实现的两种方式

例子

参考文章