Python数据聚合和分组
from pandas import Series,DataFrame import pandas as pd import numpy as np import matplotlib.pyplot as plt import matplotlib as mpl import seaborn as sns # 导入 seaborn 库,并取别名为 sns %matplotlib inline # 在Ipython编译器里直接使用,功能是可以内嵌绘图,并且可以省略掉plt.show()这一步
In [2]: pd.set_option("mode.chained_assignment",None) # 关闭警告
1、从github上下载这个文件,这是官方给的范例数据库:https://github.com/mwaskom/seaborn-data/ 2、找到load_dataset()在本地的数据库地址。 get_data_home()函数的作用就是获取load_dataset() 的数据库地址。 sns.utils.get_data_home() 之后就会出现已下形式的地址
<你的驱动器>:Users<你的用户名>seaborn-data 例如:‘C:Usersuser1seaborn-data’ 3、将下载的文件夹解压,然后把里面的内容复制到数据库地址下。
In [3]: tips=sns.load_dataset("tips") # load_dataset("tips")函数默认首先从本地库调取tips.csv文件 tips.head()
Out[3]:
total_bill
tip
sex
smoker
day
time
size
0
16.99
1.01
Female
No
Sun
Dinner
2
1
10.34
1.66
Male
No
Sun
Dinner
3
2
21.01
3.50
Male
No
Sun
Dinner
3
3
23.68
3.31
Male
No
Sun
Dinner
2
4
24.59
3.61
Female
No
Sun
Dinner
4 数据分组groupby分组
In [4]: grouped = tips["tip"].groupby(tips["sex"]) grouped # 返回的grouped为GroupBy对象,是保存的中间数据,
Out[4]:
In [5]: grouped.mean() # 对该对象调用mean方法即可返回数据
Out[5]: sex Male 3.089618 Female 2.833448 Name: tip, dtype: float64
In [7]: date_mean = tips["tip"].groupby([tips["day"],tips["time"]]).mean() # 通过多个分组键进行计算,通过day和time,计算小费平均值 date_mean
Out[7]: day time Thur Lunch 2.767705 Dinner 3.000000 Fri Lunch 2.382857 Dinner 2.940000 Sat Dinner 2.993103 Sun Dinner 3.255132 Name: tip, dtype: float64
In [8]: date_mean.plot(kind="barh") # barh为柱形图
Out[8]:
In [9]: tips.dtypes
Out[9]: total_bill float64 tip float64 sex category smoker category day category time category size int64 dtype: object
In [14]: for name,group in tips.groupby(tips["sex"]): print(name) print(group) Male total_bill tip sex smoker day time size 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 5 25.29 4.71 Male No Sun Dinner 4 6 8.77 2.00 Male No Sun Dinner 2 7 26.88 3.12 Male No Sun Dinner 4 8 15.04 1.96 Male No Sun Dinner 2 9 14.78 3.23 Male No Sun Dinner 2 10 10.27 1.71 Male No Sun Dinner 2 12 15.42 1.57 Male No Sun Dinner 2 13 18.43 3.00 Male No Sun Dinner 4 15 21.58 3.92 Male No Sun Dinner 2 17 16.29 3.71 Male No Sun Dinner 3 19 20.65 3.35 Male No Sat Dinner 3 20 17.92 4.08 Male No Sat Dinner 2 23 39.42 7.58 Male No Sat Dinner 4 24 19.82 3.18 Male No Sat Dinner 2 25 17.81 2.34 Male No Sat Dinner 4 26 13.37 2.00 Male No Sat Dinner 2 27 12.69 2.00 Male No Sat Dinner 2 28 21.70 4.30 Male No Sat Dinner 2 30 9.55 1.45 Male No Sat Dinner 2 31 18.35 2.50 Male No Sat Dinner 4 34 17.78 3.27 Male No Sat Dinner 2 35 24.06 3.60 Male No Sat Dinner 3 36 16.31 2.00 Male No Sat Dinner 3 38 18.69 2.31 Male No Sat Dinner 3 39 31.27 5.00 Male No Sat Dinner 3 40 16.04 2.24 Male No Sat Dinner 3 41 17.46 2.54 Male No Sun Dinner 2 .. ... ... ... ... ... ... ... 195 7.56 1.44 Male No Thur Lunch 2 196 10.34 2.00 Male Yes Thur Lunch 2 199 13.51 2.00 Male Yes Thur Lunch 2 200 18.71 4.00 Male Yes Thur Lunch 3 204 20.53 4.00 Male Yes Thur Lunch 4 206 26.59 3.41 Male Yes Sat Dinner 3 207 38.73 3.00 Male Yes Sat Dinner 4 208 24.27 2.03 Male Yes Sat Dinner 2 210 30.06 2.00 Male Yes Sat Dinner 3 211 25.89 5.16 Male Yes Sat Dinner 4 212 48.33 9.00 Male No Sat Dinner 4 216 28.15 3.00 Male Yes Sat Dinner 5 217 11.59 1.50 Male Yes Sat Dinner 2 218 7.74 1.44 Male Yes Sat Dinner 2 220 12.16 2.20 Male Yes Fri Lunch 2 222 8.58 1.92 Male Yes Fri Lunch 1 224 13.42 1.58 Male Yes Fri Lunch 2 227 20.45 3.00 Male No Sat Dinner 4 228 13.28 2.72 Male No Sat Dinner 2 230 24.01 2.00 Male Yes Sat Dinner 4 231 15.69 3.00 Male Yes Sat Dinner 3 232 11.61 3.39 Male No Sat Dinner 2 233 10.77 1.47 Male No Sat Dinner 2 234 15.53 3.00 Male Yes Sat Dinner 2 235 10.07 1.25 Male No Sat Dinner 2 236 12.60 1.00 Male Yes Sat Dinner 2 237 32.83 1.17 Male Yes Sat Dinner 2 239 29.03 5.92 Male No Sat Dinner 3 241 22.67 2.00 Male Yes Sat Dinner 2 242 17.82 1.75 Male No Sat Dinner 2 [157 rows x 7 columns] Female total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4 11 35.26 5.00 Female No Sun Dinner 4 14 14.83 3.02 Female No Sun Dinner 2 16 10.33 1.67 Female No Sun Dinner 3 18 16.97 3.50 Female No Sun Dinner 3 21 20.29 2.75 Female No Sat Dinner 2 22 15.77 2.23 Female No Sat Dinner 2 29 19.65 3.00 Female No Sat Dinner 2 32 15.06 3.00 Female No Sat Dinner 2 33 20.69 2.45 Female No Sat Dinner 4 37 16.93 3.07 Female No Sat Dinner 3 51 10.29 2.60 Female No Sun Dinner 2 52 34.81 5.20 Female No Sun Dinner 4 57 26.41 1.50 Female No Sat Dinner 2 66 16.45 2.47 Female No Sat Dinner 2 67 3.07 1.00 Female Yes Sat Dinner 1 71 17.07 3.00 Female No Sat Dinner 3 72 26.86 3.14 Female Yes Sat Dinner 2 73 25.28 5.00 Female Yes Sat Dinner 2 74 14.73 2.20 Female No Sat Dinner 2 82 10.07 1.83 Female No Thur Lunch 1 85 34.83 5.17 Female No Thur Lunch 4 92 5.75 1.00 Female Yes Fri Dinner 2 93 16.32 4.30 Female Yes Fri Dinner 2 94 22.75 3.25 Female No Fri Dinner 2 100 11.35 2.50 Female Yes Fri Dinner 2 101 15.38 3.00 Female Yes Fri Dinner 2 102 44.30 2.50 Female Yes Sat Dinner 3 103 22.42 3.48 Female Yes Sat Dinner 2 .. ... ... ... ... ... ... ... 155 29.85 5.14 Female No Sun Dinner 5 157 25.00 3.75 Female No Sun Dinner 4 158 13.39 2.61 Female No Sun Dinner 2 162 16.21 2.00 Female No Sun Dinner 3 164 17.51 3.00 Female Yes Sun Dinner 2 168 10.59 1.61 Female Yes Sat Dinner 2 169 10.63 2.00 Female Yes Sat Dinner 2 178 9.60 4.00 Female Yes Sun Dinner 2 186 20.90 3.50 Female Yes Sun Dinner 3 188 18.15 3.50 Female Yes Sun Dinner 3 191 19.81 4.19 Female Yes Thur Lunch 2 197 43.11 5.00 Female Yes Thur Lunch 4 198 13.00 2.00 Female Yes Thur Lunch 2 201 12.74 2.01 Female Yes Thur Lunch 2 202 13.00 2.00 Female Yes Thur Lunch 2 203 16.40 2.50 Female Yes Thur Lunch 2 205 16.47 3.23 Female Yes Thur Lunch 3 209 12.76 2.23 Female Yes Sat Dinner 2 213 13.27 2.50 Female Yes Sat Dinner 2 214 28.17 6.50 Female Yes Sat Dinner 3 215 12.90 1.10 Female Yes Sat Dinner 2 219 30.14 3.09 Female Yes Sat Dinner 4 221 13.42 3.48 Female Yes Fri Lunch 2 223 15.98 3.00 Female No Fri Lunch 3 225 16.27 2.50 Female Yes Fri Lunch 2 226 10.09 2.00 Female Yes Fri Lunch 2 229 22.12 2.88 Female Yes Sat Dinner 2 238 35.83 4.67 Female No Sat Dinner 3 240 27.18 2.00 Female Yes Sat Dinner 2 243 18.78 3.00 Female No Thur Dinner 2 [87 rows x 7 columns]
In [15]: tips.groupby(tips["sex"]).size() # size方法可返回各分组的大小
Out[15]: sex Male 157 Female 87 dtype: int64
In [16]: tips.groupby(tips["sex"]).count()
Out[16]:
total_bill
tip
smoker
day
time
size
sex
Male
157
157
157
157
157
157
Female
87
87
87
87
87
87 按照列名分组
In [19]: smoker_mean = tips.groupby("smoker").mean() smoker_mean
Out[19]:
total_bill
tip
size
smoker
Yes
20.756344
3.008710
2.408602
No
19.188278
2.991854
2.668874
In [21]: smoker_mean["tip"].plot(kind="bar")
Out[21]:
In [24]: size_mean1 = tips["tip"].groupby(tips["size"]).mean() size_mean1
Out[24]: size 1 1.437500 2 2.582308 3 3.393158 4 4.135405 5 4.028000 6 5.225000 Name: tip, dtype: float64
In [25]: size_mean2 = tips.groupby("size")["tip"].mean() #语法糖 size_mean2
Out[25]: size 1 1.437500 2 2.582308 3 3.393158 4 4.135405 5 4.028000 6 5.225000 Name: tip, dtype: float64
In [27]: size_mean2.plot()
Out[27]:
In [29]: df = DataFrame(np.arange(16).reshape(4,4)) df
Out[29]:
0
1
2
3
0
0
1
2
3
1
4
5
6
7
2
8
9
10
11
3
12
13
14
15 按列表或元组分组
In [30]: list1 = ["a","b","a","b"]
In [32]: df.groupby(list1).sum()
Out[32]:
0
1
2
3
a
8
10
12
14
b
16
18
20
22 按字典分组
In [33]: df = DataFrame(np.random.normal(size=(6,6)),index=["a","b","c","A","B","C"]) df
Out[33]:
0
1
2
3
4
5
a
0.031512
-0.896280
-0.000981
0.558886
-1.574150
0.030435
b
0.774907
0.020968
0.575220
-0.566894
1.326251
0.775521
c
1.437972
-0.699240
-1.064924
0.235661
1.841803
1.238480
A
-1.756554
0.652186
1.149668
0.192652
2.202044
0.366539
B
-0.575227
0.299196
-0.120483
-2.665255
0.432872
1.627597
C
0.481407
-0.983928
1.270371
-1.581129
-1.568339
-2.122324
In [34]: dict1 = { "a":"one", "A":"one", "b":"two", "B":"two", "c":"three", "C":"three" }
In [35]: df.groupby(dict1).sum()
Out[35]:
0
1
2
3
4
5
one
-1.725042
-0.244095
1.148687
0.751538
0.627894
0.396974
three
1.919380
-1.683169
0.205448
-1.345468
0.273464
-0.883844
two
0.199680
0.320164
0.454738
-3.232148
1.759122
2.403117 按函数分组
In [37]: df = DataFrame(np.random.randn(4,4)) df
Out[37]:
0
1
2
3
0
0.803694
-1.242886
0.393840
-1.137829
1
1.048137
-0.931402
-0.262153
0.609839
2
0.135432
0.739250
-1.685265
1.562063
3
-0.863777
-0.687589
1.901485
-0.224359
In [38]: def jug(x): if x >= 0: return "a" else: return "b"
In [41]: df[3].groupby(df[3].map(jug)).sum()
Out[41]: 3 a 2.171902 b -1.362188 Name: 3, dtype: float64
In [42]: df = DataFrame(np.arange(16).reshape(4,4), index=[["one","one","two","two"],["a","b","a","b"]], columns=[["apple","apple","orange","orange"],["red","green","red","green"]]) """层次化索引,可通过级别进行分组,通过level参数,输入编号或名称即可""" df
Out[42]:
apple
orange
red
green
red
green
one
a
0
1
2
3
b
4
5
6
7
two
a
8
9
10
11
b
12
13
14
15
In [43]: df.groupby(level=1).sum()
Out[43]:
apple
orange
red
green
red
green
a
8
10
12
14
b
16
18
20
22
In [44]: df.groupby(level=1,axis=1).sum() # 在列上进行分组(axis=1)
Out[44]:
green
red
one
a
4
2
b
12
10
two
a
20
18
b
28
26 聚合运算聚合函数
In [47]: max_tip = tips.groupby("sex")["tip"].max() # 通过性别分组,计算小费的最大值 max_tip
Out[47]: sex Male 10.0 Female 6.5 Name: tip, dtype: float64
In [48]: max_tip.plot(kind="bar")
Out[48]:
In [50]: df = DataFrame(np.arange(16).reshape(4,4)) df
Out[50]:
0
1
2
3
0
0
1
2
3
1
4
5
6
7
2
8
9
10
11
3
12
13
14
15
In [53]: list1 = ["a","b","a","b"] df.groupby(list1).quantile(0.5) # quantile分位数函数
Out[53]:
0.5
0
1
2
3
a
4.0
5.0
6.0
7.0
b
8.0
9.0
10.0
11.0
In [4]: def get_range(x): return x.max()-x.min()
In [5]: tips_range = tips.groupby("sex")["tip"].agg(get_range) """常用于调用groupby()函数之后,对数据做一些聚合操作,包括sum,min,max以及其他一些聚合函数""" tips_range
Out[5]: sex Male 9.0 Female 5.5 Name: tip, dtype: float64
In [6]: tips_range.plot(kind="bar")
Out[6]:
多函数应用
In [13]: tips.groupby(["sex","smoker"])["tip"].agg(["mean","std",get_range]) # 对agg参数传入多函数列表,即可完成一列的多函数运算
Out[13]:
mean
std
get_range
sex
smoker
Male
Yes
3.051167
1.500120
9.00
No
3.113402
1.489559
7.75
Female
Yes
2.931515
1.219916
5.50
No
2.773519
1.128425
4.20
In [15]: tips.groupby(["sex","smoker"])["tip"].agg([("tip_mean","mean"),("Range",get_range)]) # 不想使用默认的运算函数列名,可以元组的形式传入,前面为名称,后面为聚合函数
Out[15]:
tip_mean
Range
sex
smoker
Male
Yes
3.051167
9.00
No
3.113402
7.75
Female
Yes
2.931515
5.50
No
2.773519
4.20
In [16]: tips.groupby(["day","time"])["total_bill","tip"].agg([("tip_mean","mean"),("Range",get_range)]) # 对多列进行多聚合函数运算时,会产生层次化索引
Out[16]:
total_bill
tip
tip_mean
Range
tip_mean
Range
day
time
Thur
Lunch
17.664754
35.60
2.767705
5.45
Dinner
18.780000
0.00
3.000000
0.00
Fri
Lunch
12.845714
7.69
2.382857
1.90
Dinner
19.663333
34.42
2.940000
3.73
Sat
Dinner
20.441379
47.74
2.993103
9.00
Sun
Dinner
21.410000
40.92
3.255132
5.49
In [17]: tips.groupby(["day","time"])["total_bill","tip"].agg({"total_bill":"sum","tip":"mean"}) # 对不同列使用不同的函数运算,可以通过字典来定义映射关系
Out[17]:
total_bill
tip
day
time
Thur
Lunch
1077.55
2.767705
Dinner
18.78
3.000000
Fri
Lunch
89.92
2.382857
Dinner
235.96
2.940000
Sat
Dinner
1778.40
2.993103
Sun
Dinner
1627.16
3.255132
In [18]: tips.groupby(["day","time"])["total_bill","tip"].agg({"total_bill":["sum","mean"],"tip":"mean"})
Out[18]:
total_bill
tip
sum
mean
mean
day
time
Thur
Lunch
1077.55
17.664754
2.767705
Dinner
18.78
18.780000
3.000000
Fri
Lunch
89.92
12.845714
2.382857
Dinner
235.96
19.663333
2.940000
Sat
Dinner
1778.40
20.441379
2.993103
Sun
Dinner
1627.16
21.410000
3.255132
In [23]: no_index = tips.groupby(["sex","smoker"],as_index=False)["tip"].mean() # 希望返回的结果不以分组键为索引,通过as_index=False可以完成 no_index
Out[23]:
sex
smoker
tip
0
Male
Yes
3.051167
1
Male
No
3.113402
2
Female
Yes
2.931515
3
Female
No
2.773519
In [24]: tips
Out[24]:
total_bill
tip
sex
smoker
day
time
size
0
16.99
1.01
Female
No
Sun
Dinner
2
1
10.34
1.66
Male
No
Sun
Dinner
3
2
21.01
3.50
Male
No
Sun
Dinner
3
3
23.68
3.31
Male
No
Sun
Dinner
2
4
24.59
3.61
Female
No
Sun
Dinner
4
5
25.29
4.71
Male
No
Sun
Dinner
4
6
8.77
2.00
Male
No
Sun
Dinner
2
7
26.88
3.12
Male
No
Sun
Dinner
4
8
15.04
1.96
Male
No
Sun
Dinner
2
9
14.78
3.23
Male
No
Sun
Dinner
2
10
10.27
1.71
Male
No
Sun
Dinner
2
11
35.26
5.00
Female
No
Sun
Dinner
4
12
15.42
1.57
Male
No
Sun
Dinner
2
13
18.43
3.00
Male
No
Sun
Dinner
4
14
14.83
3.02
Female
No
Sun
Dinner
2
15
21.58
3.92
Male
No
Sun
Dinner
2
16
10.33
1.67
Female
No
Sun
Dinner
3
17
16.29
3.71
Male
No
Sun
Dinner
3
18
16.97
3.50
Female
No
Sun
Dinner
3
19
20.65
3.35
Male
No
Sat
Dinner
3
20
17.92
4.08
Male
No
Sat
Dinner
2
21
20.29
2.75
Female
No
Sat
Dinner
2
22
15.77
2.23
Female
No
Sat
Dinner
2
23
39.42
7.58
Male
No
Sat
Dinner
4
24
19.82
3.18
Male
No
Sat
Dinner
2
25
17.81
2.34
Male
No
Sat
Dinner
4
26
13.37
2.00
Male
No
Sat
Dinner
2
27
12.69
2.00
Male
No
Sat
Dinner
2
28
21.70
4.30
Male
No
Sat
Dinner
2
29
19.65
3.00
Female
No
Sat
Dinner
2
...
...
...
...
...
...
...
...
214
28.17
6.50
Female
Yes
Sat
Dinner
3
215
12.90
1.10
Female
Yes
Sat
Dinner
2
216
28.15
3.00
Male
Yes
Sat
Dinner
5
217
11.59
1.50
Male
Yes
Sat
Dinner
2
218
7.74
1.44
Male
Yes
Sat
Dinner
2
219
30.14
3.09
Female
Yes
Sat
Dinner
4
220
12.16
2.20
Male
Yes
Fri
Lunch
2
221
13.42
3.48
Female
Yes
Fri
Lunch
2
222
8.58
1.92
Male
Yes
Fri
Lunch
1
223
15.98
3.00
Female
No
Fri
Lunch
3
224
13.42
1.58
Male
Yes
Fri
Lunch
2
225
16.27
2.50
Female
Yes
Fri
Lunch
2
226
10.09
2.00
Female
Yes
Fri
Lunch
2
227
20.45
3.00
Male
No
Sat
Dinner
4
228
13.28
2.72
Male
No
Sat
Dinner
2
229
22.12
2.88
Female
Yes
Sat
Dinner
2
230
24.01
2.00
Male
Yes
Sat
Dinner
4
231
15.69
3.00
Male
Yes
Sat
Dinner
3
232
11.61
3.39
Male
No
Sat
Dinner
2
233
10.77
1.47
Male
No
Sat
Dinner
2
234
15.53
3.00
Male
Yes
Sat
Dinner
2
235
10.07
1.25
Male
No
Sat
Dinner
2
236
12.60
1.00
Male
Yes
Sat
Dinner
2
237
32.83
1.17
Male
Yes
Sat
Dinner
2
238
35.83
4.67
Female
No
Sat
Dinner
3
239
29.03
5.92
Male
No
Sat
Dinner
3
240
27.18
2.00
Female
Yes
Sat
Dinner
2
241
22.67
2.00
Male
Yes
Sat
Dinner
2
242
17.82
1.75
Male
No
Sat
Dinner
2
243
18.78
3.00
Female
No
Thur
Dinner
2
244 rows 7 columns 分组运算transform方法
In [28]: df = DataFrame(tips.groupby("sex")["tip"].mean()) df
Out[28]:
tip
sex
Male
3.089618
Female
2.833448
In [29]: new_tips = pd.merge(tips,df,left_on="sex",right_index=True) # 先聚合运算,然后再将其合并 new_tips.head()
Out[29]:
total_bill
tip_x
sex
smoker
day
time
size
tip_y
0
16.99
1.01
Female
No
Sun
Dinner
2
2.833448
4
24.59
3.61
Female
No
Sun
Dinner
4
2.833448
11
35.26
5.00
Female
No
Sun
Dinner
4
2.833448
14
14.83
3.02
Female
No
Sun
Dinner
2
2.833448
16
10.33
1.67
Female
No
Sun
Dinner
3
2.833448
In [32]: tips.groupby("sex")["tip"].transform("mean") # transform方法可以使运算分布到每一行
Out[32]: 0 2.833448 1 3.089618 2 3.089618 3 3.089618 4 2.833448 5 3.089618 6 3.089618 7 3.089618 8 3.089618 9 3.089618 10 3.089618 11 2.833448 12 3.089618 13 3.089618 14 2.833448 15 3.089618 16 2.833448 17 3.089618 18 2.833448 19 3.089618 20 3.089618 21 2.833448 22 2.833448 23 3.089618 24 3.089618 25 3.089618 26 3.089618 27 3.089618 28 3.089618 29 2.833448 ... 214 2.833448 215 2.833448 216 3.089618 217 3.089618 218 3.089618 219 2.833448 220 3.089618 221 2.833448 222 3.089618 223 2.833448 224 3.089618 225 2.833448 226 2.833448 227 3.089618 228 3.089618 229 2.833448 230 3.089618 231 3.089618 232 3.089618 233 3.089618 234 3.089618 235 3.089618 236 3.089618 237 3.089618 238 2.833448 239 3.089618 240 2.833448 241 3.089618 242 3.089618 243 2.833448 Name: tip, Length: 244, dtype: float64apply方法
In [10]: def top(x,n=5): return x.sort_values(by="tip",ascending=False)[-n:]
In [11]: tips.groupby("sex").apply(top)
Out[11]:
total_bill
tip
sex
smoker
day
time
size
sex
Male
43
9.68
1.32
Male
No
Sun
Dinner
2
235
10.07
1.25
Male
No
Sat
Dinner
2
75
10.51
1.25
Male
No
Sat
Dinner
2
237
32.83
1.17
Male
Yes
Sat
Dinner
2
236
12.60
1.00
Male
Yes
Sat
Dinner
2
Female
215
12.90
1.10
Female
Yes
Sat
Dinner
2
0
16.99
1.01
Female
No
Sun
Dinner
2
111
7.25
1.00
Female
No
Sat
Dinner
1
67
3.07
1.00
Female
Yes
Sat
Dinner
1
92
5.75
1.00
Female
Yes
Fri
Dinner
2
In [12]: tips.groupby("sex",group_keys=False).apply(top) # 希望返回的结果不以分组键为索引,通过group_keys=False可以完成
Out[12]:
total_bill
tip
sex
smoker
day
time
size
43
9.68
1.32
Male
No
Sun
Dinner
2
235
10.07
1.25
Male
No
Sat
Dinner
2
75
10.51
1.25
Male
No
Sat
Dinner
2
237
32.83
1.17
Male
Yes
Sat
Dinner
2
236
12.60
1.00
Male
Yes
Sat
Dinner
2
215
12.90
1.10
Female
Yes
Sat
Dinner
2
0
16.99
1.01
Female
No
Sun
Dinner
2
111
7.25
1.00
Female
No
Sat
Dinner
1
67
3.07
1.00
Female
Yes
Sat
Dinner
1
92
5.75
1.00
Female
Yes
Fri
Dinner
2
In [18]: data = { "name":["张三", "李四", "peter", "王五", "小明", "小红"], "sex":["female", "female", "male", "male","male","female"], "math":[67, 72, np.nan, 82, 90, np.nan] } df = DataFrame(data) df["math"] = df["math"] df
Out[18]:
math
name
sex
0
67.0
张三
female
1
72.0
李四
female
2
NaN
peter
male
3
82.0
王五
male
4
90.0
小明
male
5
NaN
小红
female
In [19]: df.fillna(df["math"].mean()) # 通过平均值对缺失值进行填充
Out[19]:
math
name
sex
0
67.00
张三
female
1
72.00
李四
female
2
77.75
peter
male
3
82.00
王五
male
4
90.00
小明
male
5
77.75
小红
female
In [20]: f = lambda x: x.fillna(x.mean()) # lambda匿名函数,分组后,再进行插值 df.groupby("sex").apply(f)
Out[20]:
math
name
sex
sex
female
0
67.0
张三
female
1
72.0
李四
female
5
69.5
小红
female
male
2
86.0
peter
male
3
82.0
王五
male
4
90.0
小明
male 数据透视表透视表
In [25]: tips.pivot_table? # 查询数据透视表帮助文档
In [22]: tips.pivot_table(values="tip",index="sex",columns="smoker") # value代表的是值,index为行,columns为例 # 计算为平均值(默认)
Out[22]:
smoker
Yes
No
sex
Male
3.051167
3.113402
Female
2.931515
2.773519
In [23]: tips.pivot_table(values="tip",index="sex",columns="smoker",aggfunc="sum") # aggfunc参数来指定计算方式
Out[23]:
smoker
Yes
No
sex
Male
183.07
302.00
Female
96.74
149.77
In [24]: tips.pivot_table(values="tip",index="sex",columns="smoker",aggfunc="sum",margins=True) #margins分项小计
Out[24]:
smoker
Yes
No
All
sex
Male
183.07
302.00
485.07
Female
96.74
149.77
246.51
All
279.81
451.77
731.58 交叉表交叉表是一种用于计算分组频率的特殊透视表
In [33]: cross_table = pd.crosstab(index=tips["day"],columns=tips["size"]) cross_table
Out[33]:
size
1
2
3
4
5
6
day
Thur
1
48
4
5
1
3
Fri
1
16
1
1
0
0
Sat
2
53
18
13
1
0
Sun
0
39
15
18
3
1
In [36]: df = cross_table.p(cross_table.sum(1),axis=0) # 通过p函数,可以使得每行的和为1,频率百分比 df
Out[36]:
size
1
2
3
4
5
6
day
Thur
0.016129
0.774194
0.064516
0.080645
0.016129
0.048387
Fri
0.052632
0.842105
0.052632
0.052632
0.000000
0.000000
Sat
0.022989
0.609195
0.206897
0.149425
0.011494
0.000000
Sun
0.000000
0.513158
0.197368
0.236842
0.039474
0.013158
In [37]: df.plot(kind="bar",stacked = True) # 柱形图通过stacked=True可以绘制堆积图
Out[37]:
癌症病人该怎么吃(3)癌细胞能被饿死吗?饮食影响癌症的方方面面,包括肿瘤的发生发展和对治疗的反应。据估计,多达三分之一的常见癌症是可以预防的,部分是通过饮食调整。此外,饮食对肿瘤进展有什么影响呢?如何将这些知识转化为抗癌
热水比冷水结冰更快?这个曾入选十大科学骗局的现象正在被证实一杯热水,一杯冷水,把它们都放进冰箱,哪一个先结冰?常识告诉我们,冷水会先结冰。但包括亚里士多德勒内笛卡尔和弗朗西斯培根在内的很多杰出人士都观察到,实际上热水可能更快结冰。经验丰富
云闪付APP一键查卡功能面向境内所有省市开放入境隔离改为73行程卡取消星号暑运航旅市场有望强势复苏频道财经来源北京青年报客户端日期202206291617文章摘要6月29日,工信部网宣布取消通信行程卡星号标记后,据去哪儿平台
行程码摘星,餐饮旅游板块大涨!说走就走的旅行来了?本文共2000字阅读完约6分钟金融投资报记者薛蕾6月29日,工信部发文称,即日起将取消通信行程卡星号标记。这是继卫健委28日公布新型冠状病毒肺炎防控方案(第九版)中将入境者和密接者
游云南正当时缤纷夏日待你来常常有人问什么时候去云南最合适?小布的回答是就!现!在!不同的时候来到云南你会寻觅到不一样的美好有清新奇妙有闲适淡然有超高饱和度的视觉冲击也有直击心灵的治愈瞬间云南已经做好全方位准
江川荷花待放等你来入夏之后赏荷就成了令人期待的夏日主题在玉溪江川河咀社区荷塘里的荷花初醒含苞待放以迷人的姿态迎接游客的到来夏日,荷塘里荷叶层层升起,一枝枝花骨朵羞答答地从荷叶下冒了出来,一副蓄势待发
酒店机票搜索热度猛增暑期旅游业有望迎来复苏央视网消息通信行程卡星号标记取消的消息发布后,6月29日下午,各大旅游平台上酒店机票的搜索热度立刻出现了明显上涨,暑期旅游业有望迎来复苏。消息发布后,携程平台上多个热门旅游目的地的
NO。85如何做好文旅地产?这三个成功的产品模型不要错过(下)本文大约4372字,阅读需要5分钟编者话上篇(点此回看),作者重点解析了他提出的模型一全国性的旅游目的地全国性的房地产市场模型二全国性的旅游目的地区域性的文旅房地产市场,以及从模型
在马六甲,看中国前主席和总理们参观过的古迹过去几百年,中国葡萄牙西班牙英国的航船都给马六甲(Malacca,马来语Melaka)送来过客人或殖民主。马六甲的古址遗迹保留得好,因此于2008年和槟城乔治市(GeorgeTow
晚上十点后翻开我的生活日记透过车窗就能看到外面热热热闹闹,我下车的地方是个景区,小商贩手里拿着夜光棒花头饰,人群三三两两聊的甚欢,路边电动车骑车很拥挤,两个男孩坐在共享单车上,是在等人吧,如
绕道华胥咥勾魂早上在鸡窝子吃过饭后,就打算游过净业寺后驱车绕道而行,到华胥镇美味卤肉店去咥蒜汁蘸面!好多年前被一帮同学带到华胥杏花谷阿氏村摘大黄杏吃,当时就在镇上美味卤肉店吃了一碗蒜汁蘸面,把人