机器学习（二）—— 数据预处理相关库sklearn

数据预处理相关库——sklearn.preprocessingimport sklearn.preprocessing as sp1. 均值移除api：sp.scale()均值移除也叫做数据的标准化，可以让样本矩阵种的每一列的平均值为0，标准差为1，是的不同的样本特征值减少差异，便于机器学习代码示例：import numpy as npimport sklearn.preprocessing as

稻城亚丁途

579人浏览 · 2022-03-16 22:11:51

稻城亚丁途 · 2022-03-16 22:11:51 发布

数据预处理相关库——sklearn

import sklearn.preprocessing as sp

scikit-learn是基于Python语言的机器学习库，具有：
	简单高效的数据分析工具
	可在多种环境中重复使用
	建立在Numpy，Scipy以及matplotlib等数据科学库之上

1. 均值移除

api：sp.scale()
均值移除也叫做数据的标准化，可以让样本矩阵种的每一列的平均值为0，标准差为1，是的不同的样本特征值减少差异，便于机器学习
代码示例：

import numpy as np
import sklearn.preprocessing as sp

"""
均值移除(标准化)
np对数组求均值 np.mean(x) 
np对矩阵求均值 np.mean(x, axis=0) axis省略 求矩阵所有元素的均值  axis=0 计算每一列的均值  axis=1 计算每一行的均值
np对数组求标准差 np.std(x, ddof=1) ddof=0 计算的是样本标准差   ddof省略/ddof=1 计算的是总体(母体)标准差 
np对助阵求标准差 np.std(x, axis=1) axis省略 计算全局标准差  axis=0 计算每一列的标准差   axis=1 计算每一行的标准差
"""
# 使得集合的元素均值为0
ages = [17, 20, 23]
mean = np.mean(ages)
ages1 = [17 - mean, 20 - mean, 23 - mean]
# 使得集合元素的标准差为1
std = np.std(ages1, ddof=0)  # 样本标准差
ages2 = [(17 - mean) / std, (20 - mean) / std, (23 - mean) / std]  # [-1.2247448713915892, 0.0, 1.2247448713915892]
print(np.mean(ages2), np.std(ages2, ddof=0))
# sp.scale()函数封装了上边的过程
ages3 = sp.scale(ages)
print(ages3)  # [-1.22474487  0.          1.22474487]

# 矩阵
"""
sp.scale(matrix, axis=0) axis=0 对列进行操作 axis=1 对行进行操作
"""
matrix1 = np.array([[12, 3, 32], [23, 44, 15], [8, 43, 1]])
matrixMean1 = np.mean(matrix1, axis=0)
matrixStd1 = np.std(matrix1, axis=0)
meanStd = sp.scale(matrix1)
matrixMean2 = meanStd.mean(axis=0)
matrixStd2 = meanStd.std(axis=0)
print(matrixMean1, '\n', matrixStd1, '\n', matrixMean2, '\n', matrixStd2)

2. 范围缩放

api：sp.MinMaxScaler()
将样本矩阵中的每一列的最值设定在相同区间，统一各列特征值范围，通常缩放为[0, 1]
代码示例：

import sklearn.preprocessing as sp
import numpy as np

"""
MinMax缩放器  MinMaxScaler()
sp.MinMaxScaler(feature_range=(num1, num2)) num1 num2 为缩放范围  feature_range可省略 #默认为范围0~1
sp.MinMaxScaler(feature_range=(num1, num2)).fit_transform(样本原始矩阵)
"""

matrix = np.array([[12, 3, 32], [23, 44, 15], [8, 43, 1]])
mms = sp.MinMaxScaler(feature_range=(0, 1))
matrix1 = mms.fit_transform(matrix)
print(matrix1)

# 手动计算
new_arr = []
for row in matrix.T:
    min = row.min()
    max = row.max()
    # 整理缩放关系的矩阵：A、B
    A = np.array([[min, 1], [max, 1]])
    B = np.array([0, 1])
    # x = np.linalg.lstsq(A, B)[0]
    # x = np.linalg.solve(A, B)
    k, b = np.linalg.solve(A, B)
    new_row = row*k + b
    new_arr.append(new_row)
print(np.array(new_arr).T)

3. 归一化

api：sp.normalize()
当每个样本的每个特征值占比更加重要时，用每个样本的每个特征值除以每个样本的各个特征值的和，常用于特征相似对比

import sklearn.preprocessing as sp
import numpy as np

"""
sp.normalize(array, norm=)
    norm的范数
        l1范数：向量中元素绝对值之和
        l2范数：向量中元素平方之和
"""
arr = np.array([[12, 3, 32], [23, 44, 15], [8, 43, 1]])
arr1 = sp.normalize(arr, norm='l1')
print(arr1)

4. 二值化

划分阈值，非0即1，简化模型，常用于图片处理
数值处理代码示例：

import sklearn.preprocessing as sp
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as image

import os, sys

arr = np.array([[12, 3, 32], [23, 44, 15], [8, 43, 1]])
b = sp.Binarizer(threshold=16)
arr2 = b.transform(arr)
print(arr2)

输出：
[[0 0 1]
[1 1 0]
[0 1 0]]
threshold可以根据情况进行调整

图片处理代码示例：

# 二值化图片
filename = '../../picture/blackAndWhitePicture.jpg'
picture = image.imread(filename)
b1 = sp.Binarizer(threshold=120)
res = b1.transform(picture)
print(res)
# 查看
plt.imshow(res, cmap='gray')  # cmap 调颜色
plt.show()

原图：
在这里插入图片描述
输出图：

5. 独热编码

api: sp.OneHotEncoding()
One-Hot-Encoding 即为样本特征的每个值建立由一个1和若干个0组成的序列。对每一列确定共有对少不同的值，如有n各不同值，则第一个值编码为一个1后面跟n-1个0，第二个值编码为01后面个n-2个0，依次类推
适用于数值比较复杂的文本矩阵
代码示例：

import sklearn.preprocessing as sp
import numpy as np

"""
方式一：
ohe = sp.OneHotEncoding(sparse=, dtype=)    创建对热编码器
res = ohe.fit_transform(原始样本矩阵)          返回独热编码后的矩阵
其中 sparse表示是否采用紧缩格式，即稀疏矩阵， dtype表示数据类型
方式二：
ohe = sp.OneHotEncoding(sparse=, dtype=)    创建对热编码器
encode_dict = ohe.fit(原始样本矩阵)            对原始样本矩阵进行训练，得到编码字典对象
res = encode_dict.transform(原始样本矩阵)      返回独热编码后的矩阵
其中 sparse表示是否采用紧缩格式，即稀疏矩阵， dtype表示数据类型
"""

arr = np.array([[1, 3, 2],
                [7, 5, 4],
                [1, 8, 6],
                [7, 3, 9]])

# 创建独热编码器
ohe1 = sp.OneHotEncoder(sparse=False, dtype='int32')
res1 = ohe1.fit_transform(arr)
# 创建独热编码器
ohe2 = sp.OneHotEncoder(sparse=True, dtype='int32')
# 得到编码字典对象
encode_dict = ohe2.fit(arr)
res2 = encode_dict.transform(arr)
print(res1, '\n', encode_dict, '\n', res2)

6. 标签编码

api：sp.LabelEncoder()
根据字符串形式的特征值在他认证序列中的位置，为其指定一个数字标签，便于基于数值算法的学习模型的使用
代码示例：

# 标签编码
import sklearn.preprocessing as sp
import numpy as np

"""
lc = sp.LabelEncoder()
res = lc.fit_transform(原始样本特征数组)  训练并编码
原始样本矩阵 = lc.inverse_transform(res) 求逆，即根据编码结果反差字典，得到原始样本特征数组
"""

arr1 = np.array(['audi', 'ford', 'audi', 'toyota', 'ford'])
lc = sp.LabelEncoder()
res = lc.fit_transform(arr1)
arr2 = lc.inverse_transform(res)
print(res, '\n', arr2)