K-means是一种聚类算法,用于将一组样本分成预定数量的簇。它通过计算样本之间的距离,将它们分配到最近的簇中,然后根据分配的结果,更新簇的中心位置。这个过程迭代进行,直到簇的中心位置不再变化或达到预定的迭代次数。

K-means算法的主要步骤包括:

  1. 随机选择K个簇中心点(代表每个簇的点)
  2. 对每个样本,计算其与每个簇中心点之间的距离,并将其分配到最近的簇中
  3. 根据分配的结果,更新每个簇的中心位置(为该簇中所有样本的平均值)
  4. 重复步骤2和3,直到簇的中心位置不再变化或达到预定的迭代次数

K-means算法的目标是最小化簇内样本之间的方差,同时最大化簇与簇之间的距离,以达到有效的聚类效果。它是一种简单且高效的聚类算法,常用于数据挖掘、图像处理和模式识别等领域。

一、数据 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import folium #visualize map
import time
from sklearn.cluster import KMeans #k-means clustering
from yellowbrick.cluster import KElbowVisualizer #Elbow visualize K-means计算肘部那个点可视化
import warnings
warnings.filterwarnings('ignore')##忽略警告

df_ori = pd.read_csv('uber-raw-data-apr14.csv')

 

df_ori.info()
'''结果:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 564516 entries, 0 to 564515
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   Date/Time  564516 non-null  object 
 1   Lat        564516 non-null  float64
 2   Lon        564516 non-null  float64
 3   Base       564516 non-null  object 
dtypes: float64(2), object(2)
memory usage: 17.2+ MB
'''

df_ori['Base'].unique()
#结果:array(['B02512', 'B02598', 'B02617', 'B02682', 'B02764'], dtype=object)

clus_k_ori = df_ori[['Lat', 'Lon']]

二、找最佳k

start = time.time()
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

model_ori = KMeans()
visualizer = KElbowVisualizer(model_ori, k = (1, 18)) #k = 1 to 17
visualizer.fit(clus_k_ori)
visualizer.show()
time2=time

end = time.time()
end-start

三、建模

kmeans_ori = KMeans(n_clusters = 6, random_state = 0) #k = 6

centroids_k_ori = kmeans_ori.cluster_centers_

clocation_k_ori = pd.DataFrame(centroids_k_ori, columns = ['Latitude', 'Longitude'])
'''结果:热点中心点经纬度
 	Latitude 	Longitude
0 	40.731107 	-73.998625
1 	40.659619 	-73.774090
2 	40.688595 	-73.965538
3 	40.765525 	-73.972847
4 	40.700541 	-74.201673
5 	40.798126 	-73.869032
'''

plt.scatter(clocation_k_ori['Latitude'], clocation_k_ori['Longitude'], marker = "x", color = 'r', s = 200)

 ​​​​​​​

label_k_ori = kmeans_ori.labels_
df_new_k = df_ori.copy()
df_new_k['Clusters'] = label_k_ori

count_2 = 0
count_0 = 0
for value in df_new_k['Clusters']:
    if value == 2:
        count_2 += 1
    if value == 0:
        count_0 += 1
print(count_0, count_2)
#结果:249352 59942

sns.catplot(data = df_new_k, x = "Clusters", kind = "count", height = 7, aspect = 2)

##可以看出就算是热点,但是数量不一定一致,差异也可能很大

new_location_ori = [(40.76, -73.99)]#拿一个点预测
kmeans_ori.predict(new_location_ori)
#结果:array([3])

修改时间格式:

df_ori.columns = ['timestamp', 'lat', 'lon', 'base']#修改时间格式

import time
ti = time.time()

df_ori['timestamp'] = pd.to_datetime(df_ori['timestamp'])

tf = time.time()
print(tf-ti,' seconds.')

df_ori['weekday'] = df_ori.timestamp.dt.weekday
df_ori['month'] = df_ori.timestamp.dt.month
df_ori['day'] = df_ori.timestamp.dt.day
df_ori['hour'] = df_ori.timestamp.dt.hour
df_ori['minute'] = df_ori.timestamp.dt.minute

根据时间分布 :

## Hourly Ride Data每小时派单量
## groupby operation
hourly_ride_data = df_ori.groupby(['day','hour','weekday'])['timestamp'].count()

## reset index
hourly_ride_data = hourly_ride_data.reset_index()

## rename column
hourly_ride_data = hourly_ride_data.rename(columns = {'timestamp':'ride_count'})

## ocular analysis
hourly_ride_data

 

## Weekday Hourly Averages 每周每小时平均派单量
## groupby operation
weekday_hourly_avg = hourly_ride_data.groupby(['weekday','hour'])['ride_count'].mean()

## reset index
weekday_hourly_avg = weekday_hourly_avg.reset_index()

## rename column
weekday_hourly_avg = weekday_hourly_avg.rename(columns = {'ride_count':'average_rides'})

## sort by categorical index
weekday_hourly_avg = weekday_hourly_avg.sort_index()

## ocular analysis
weekday_hourly_avg

 

##Define Color Palette
tableau_color_blind = [(0, 107, 164), (255, 128, 14), (171, 171, 171), (89, 89, 89),
             (95, 158, 209), (200, 82, 0), (137, 137, 137), (163, 200, 236),
             (255, 188, 121), (207, 207, 207)]

for i in range(len(tableau_color_blind)):  
    r, g, b = tableau_color_blind[i]  
    tableau_color_blind[i] = (r / 255., g / 255., b / 255.)
## create figure
fig = plt.figure(figsize=(12,6))
ax = fig.add_subplot(111)

## set palette   
current_palette = sns.color_palette(tableau_color_blind)

## plot data
sns.pointplot(ax=ax, x='hour',y='average_rides',hue='weekday', 
              palette = current_palette, data = weekday_hourly_avg)

## clean up the legend
l = ax.legend()
l.set_title('')

## format plot labels
ax.set_title('Weekday Averages for April 2014', fontsize=30)
ax.set_ylabel('Rides per Hour', fontsize=20)
ax.set_xlabel('Hour', fontsize=20)

散点图分布: 

%matplotlib inline

plt.figure(figsize=(16, 12))

plt.plot(df_ori.lon, df_ori.lat, '.', ms=.8, alpha=.5)

plt.ylim(bottom=40.5,top=41)
plt.xlim(left=-74.4,right=-73.5)

plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('New York Uber Pickups 2014')

plt.show()

 

Logo

CSDN联合极客时间,共同打造面向开发者的精品内容学习社区,助力成长!

更多推荐