8.聚类算法
简介应用概念聚类算法与分类算法最大的区别聚类算法是⽆监督的学习算法,⽽分类算法属于监督的学习算法。API案例流程分析导入依赖import matplotlib.pyplot as pltfrom sklearn.datasets import make_blobsfrom sklearn.cluster import KMeansfrom sklearn.metrics import calins
·
简介

应用


概念

聚类算法与分类算法最大的区别
聚类算法是⽆监督的学习算法,⽽分类算法属于监督的学习算法。
API

案例


流程分析

导入依赖
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import calinski_harabasz_score
创建数据
# X为样本特征,Y为样本簇类别, 共1000个样本,每个样本2个特征,共4个簇,
# 簇中⼼在[-1,-1], [0,0],[1,1], [2,2], 簇⽅差分别为[0.4, 0.2, 0.2, 0.2]
x,y = make_blobs(n_samples=1000,
n_features=2,
centers=[[-1,-1],[0,0],[1,1],[2,2]],
cluster_std=[0.4,0.2,0.2,0.2],
random_state=9)
x,y

数据可视化
plt.scatter(x[:,0],x[:,1],marker="o")
plt.show()

kmeans 训练,可视化,聚类为2
y_pre = KMeans(n_clusters=2,random_state=9).fit_predict(x)
# 可视化
plt.scatter(x[:,0],x[:,1],marker="o",c=y_pre)
plt.show()
# 用ch_scole 查看最后效果
print(calinski_harabasz_score(x,y_pre))

kmeans 训练,可视化,聚类为3
y_pre = KMeans(n_clusters=3,random_state=9).fit_predict(x)
# 可视化
plt.scatter(x[:,0],x[:,1],marker="o",c=y_pre)
plt.show()
# 用ch_scole 查看最后效果
print(calinski_harabasz_score(x,y_pre))

kmeans 训练,可视化,聚类为4
y_pre = KMeans(n_clusters=4,random_state=9).fit_predict(x)
# 可视化
plt.scatter(x[:,0],x[:,1],marker="o",c=y_pre)
plt.show()
# 用ch_scole 查看最后效果
print(calinski_harabasz_score(x,y_pre))

聚类算法实现流程

K-means实现流程


模型评估
误差平⽅和(SSE\The sum of squares due to error)




“肘”⽅法(Elbowmethod)—K值确定

轮廓系数法(SilhouetteCoefficient)




CH系数(Calinski-HarabaszIndex)


小结

算法优化
K-means优点

K-means缺点

Canopy算法配合初始聚类
Canopy算法配合初始聚类实现流程

优缺点

K-means++


二分k-means
实现流程



k-medoids(K-中心聚类算法)

Kernel k-means (了解)

ISODATA(了解)

Mini Batch K-Means(了解)

特征降维

降维-特征选择

方法

低方差特征过滤

API

threshold:阀值方差
import pandas as pd
from sklearn.feature_selection import VarianceThreshold
data = pd.read_csv("factor_returns.csv")
print(data.head())
print(data.shape)
# 实例化对象
transfer = VarianceThreshold(threshold=2)
# 转换
transfer_data = transfer.fit_transform(data.iloc[:,1:10])
print(transfer_data)
print(data.iloc[:,1:10].shape)
print(transfer_data.shape)

皮尔逊相关系数(Pearson Correlation Coefficient)

例子


特点

API

from scipy.stats import pearsonr
x1=[12.5,15.3,23.2,26.4,33.5,34.4,39.4,45.2,55.4,60.9]
x2=[21.2,23.9,32.9,34.1,42.5,43.2,49.0,52.8,59.4,63.5]
ret = pearsonr(x1,x2)
print("皮尔逊相关系数的结果是:\n",ret)

斯⽪尔曼相关系数(RankIC)

例子

特点

API

案例
from scipy.stats import spearmanr
x1=[12.5,15.3,23.2,26.4,33.5,34.4,39.4,45.2,55.4,60.9]
x2=[21.2,23.9,32.9,34.1,42.5,43.2,49.0,52.8,59.4,63.5]
ret = spearmanr(x1,x2)
print("斯⽪尔曼相关系数的结果是:\n",ret)

降维-主成分分析(可以理解为特征提取)

API

from sklearn.decomposition import PCA
data=[[2,8,4,5],[6,3,0,8],[5,4,9,1]]
# pca小数保留百分比
transfer = PCA(n_components=0.9)
trans_data = transfer.fit_transform(data)
print("保留0.9的数据最后维度:\n",trans_data)
# pca保留3列
transfer = PCA(n_components=3)
trans_data = transfer.fit_transform(data)
print("保留3列的数据最后维度:\n",trans_data)

pca,K-means 实现用户对物品类别的喜好划分案例

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
获取数据
order_product = pd.read_csv("order_products__prior.csv")
products = pd.read_csv("products.csv")
orders = pd.read_csv("orders.csv")
aisles = pd.read_csv("aisles.csv")
order_product.head()

products.head()

orders.head()

aisles.head()

数据基本处理
# 合并表格
table1 = pd.merge(order_product,products,on=["product_id","product_id"])
table2 = pd.merge(table1,orders,on=["order_id","order_id"])
table = pd.merge(table2,aisles,on=["aisle_id","aisle_id"])
tabel.shape

table.head()

# 交叉表合并
data = pd.crosstab(table["user_id"],table["aisle"])
data.shape

data.head()

# 数据截取
new_data = data[:1000]
特征工程-pca
transfer = PCA(n_components=0.9)
trans_data = transfer.fit_transform(new_data)
trans_data.shape

机器学习(k-means)
estimator = KMeans(n_clusters=5)
pre_data = estimator.fit_predict(trans_data)
模型评估
silhouette_score(trans_data,pre_data)

更多推荐



所有评论(0)