groupby和聚合函数进行分组与聚合

分组与聚合通常是分析数据的一种方式，通常与一些统计函数一起使用，查看数据的分组情况什么是分组与聚合？分组APIDataFrame.groupby(key, as_index=False)key:分组的列数据，可以多个案例:不同颜色的不同笔的价格数据import pandas as pdimport numpy as npimport matplotlib.pyplot as pltcol =pd.

缘源园

1006人浏览 · 2021-04-06 20:03:12

缘源园 · 2021-04-06 20:03:12 发布

分组与聚合通常是分析数据的一种方式，通常与一些统计函数一起使用，查看数据的分组情况

1、什么是分组与聚合？

分组聚合原理

2、分组API

DataFrame.groupby(key, as_index=False)
- key:分组的列数据，可以多个
案例:不同颜色的不同笔的价格数据

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

col =pd.DataFrame({'color': ['white','red','green','red','green'],
 'object'：['pen','pencil','pencil','ashtray','pen'],
'price1':[5.56,4.20,1.30,0.56,2.75],
'price2':[4.75,4.12,1.60,0.75,3.15]})

col

	color	object	price1	price2
0	white	pen	5.56	4.75
1	red	pencil	4.20	4.12
2	green	pencil	1.30	1.60
3	red	ashtray	0.56	0.75
4	green	pen	2.75	3.15

进行分组，对颜色分组，price进行聚合

# 分组，求平均值
col.groupby(['color'])['price1'].mean()
'''
color
green    2.025
red      2.380
white    5.560
Name: price1, dtype: float64
'''

col['price1'].groupby(col['color']).mean()
'''
color
green    2.025
red      2.380
white    5.560
Name: price1, dtype: float64
'''

# 分组，数据的结构不变
col.groupby(['color'], as_index=False)['price1'].mean()
	color	price1
0	green	2.025
1	red	2.380
2	white	5.560

3、星巴克零售店铺数据

现在我们有一组关于全球星巴克店铺的统计数据，如果我想知道美国的星巴克数量和中国的哪个多，或者我想知道中国每个省份星巴克的数量的情况，那么应该怎么办？数据来源：https://www.kaggle.com/starbucks/store-locations/data

3.1 数据获取

# 导入星巴克店的数据
starbucks = pd.read_csv("./data/starbucks/directory.csv")

starbucks.head()

	Brand	Store Number	Store Name	Ownership Type	Street Address	City	State/Province	Country	Postcode	Phone Number	Timezone	Longitude	Latitude
0	Starbucks	47370-257954	Meritxell, 96	Licensed	Av. Meritxell, 96	Andorra la Vella	7	AD	AD500	376818720	GMT+1:00 Europe/Andorra	1.53	42.51
1	Starbucks	22331-212325	Ajman Drive Thru	Licensed	1 Street 69, Al Jarf	Ajman	AJ	AE	NaN	NaN	GMT+04:00 Asia/Dubai	55.47	25.42
2	Starbucks	47089-256771	Dana Mall	Licensed	Sheikh Khalifa Bin Zayed St.	Ajman	AJ	AE	NaN	NaN	GMT+04:00 Asia/Dubai	55.47	25.39
3	Starbucks	22126-218024	Twofour 54	Licensed	Al Salam Street	Abu Dhabi	AZ	AE	NaN	NaN	GMT+04:00 Asia/Dubai	54.38	24.48
4	Starbucks	17127-178586	Al Ain Tower	Licensed	Khaldiya Area, Abu Dhabi Island	Abu Dhabi	AZ	AE	NaN	NaN	GMT+04:00 Asia/Dubai	54.54	24.51

3.2 进行分组聚合

# 按照国家分组，求出每个国家的星巴克零售店数量
count = starbucks.groupby(['Country']).count()

画图显示结果

count['Brand'].plot(kind='bar', figsize=(20, 8))  #计算brand的数据
plt.show()

count['Brand'].sort_values(ascending=False)[:20]
'''
Country
US    13608
CN     2734
CA     1468
JP     1237
KR      993
GB      901
MX      579
TW      394
TR      326
PH      298
TH      289
ID      268
MY      234
DE      160
AE      144
FR      132
SG      130
RU      109
AR      108
KW      106
Name: Brand, dtype: int64
'''

count['Brand'].sort_values().plot(kind='bar', figsize=(20, 8))  #计算brand的数据
plt.show()

假设我们加入省市一起进行分组

# 设置多个索引，set_index()
starbucks.groupby(['Country', 'State/Province']).count()

Brand	Store Number	Store Name	Ownership Type	Street Address	City	Postcode	Phone Number	Timezone	Longitude	Latitude
Country	State/Province
AD	7	1	1	1	1	1	1	1	1	1	1	1
AE	AJ	2	2	2	2	2	2	0	0	2	2	2
	AZ	48	48	48	48	48	48	7	20	48	48	48
	DU	82	82	82	82	82	82	16	50	82	82	82
	FU	2	2	2	2	2	2	1	0	2	2	2
...	...	...	...	...	...	...	...	...	...	...	...	...
US	WV	25	25	25	25	25	25	25	23	25	25	25
US	WY	23	23	23	23	23	23	23	22	23	23	23
VN	HN	6	6	6	6	6	6	6	6	6	6	6
VN	SG	19	19	19	19	19	19	19	17	19	19	19
ZA	GT	3	3	3	3	3	3	3	2	3	3	3