import pandas as pd
import numpy as np
n = 1000
df = pd.DataFrame({'Store': np.random.choice(['Store_1', 'Store_2'], n),
'Product': np.random.choice(['Product_1', 'Product_2', 'Product_3'], n),
'Revenue': (np.random.random(n)*50+10).round(2),
'Quantity': np.random.randint(1, 10, size=n)})
これまでだと↓のような感じで書く必要があったのが、(もしかするともっと良い書き方があるかも…?)
(
df
.groupby(['Store', 'Product'])
.sum()
.assign(rpq = lambda d: d.Revenue / d.Quantity)
.rpq
.unstack()
.round(2)
)
Product | Product_1 | Product_2 | Product_3 |
---|---|---|---|
Store | |||
Store_1 | 7.53 | 7.07 | 6.93 |
Store_2 | 7.00 | 7.52 | 6.83 |
↓だけでよくなった
(
df
.groupby(['Store', 'Product'])
.pipe(lambda grp: grp.Revenue.sum() / grp.Quantity.sum())
.unstack()
.round(2)
)
Product | Product_1 | Product_2 | Product_3 |
---|---|---|---|
Store | |||
Store_1 | 7.53 | 7.07 | 6.93 |
Store_2 | 7.00 | 7.52 | 6.83 |
count
ではなく size
を使う¶見逃しててずっと count
で頑張っていたが、
df.groupby(['Store', 'Product']).Product.count()
Store Product Store_1 Product_1 163 Product_2 188 Product_3 153 Store_2 Product_1 163 Product_2 173 Product_3 160 Name: Product, dtype: int64
↓ こう書けば良い
df.groupby(['Store', 'Product']).size()
Store Product Store_1 Product_1 163 Product_2 188 Product_3 153 Store_2 Product_1 163 Product_2 173 Product_3 160 dtype: int64
assign
内部で前に(定義|更新)したカラムを参照できるようになった(v0.23より)¶(
df
.assign(
rpq = lambda d: d.Revenue / d.Quantity,
rpq_squared = lambda d: d.rpq ** 2
)
.head()
)
Store | Product | Revenue | Quantity | rpq | rpq_squared | |
---|---|---|---|---|---|---|
0 | Store_1 | Product_2 | 24.76 | 1 | 24.760000 | 613.057600 |
1 | Store_2 | Product_2 | 53.47 | 4 | 13.367500 | 178.690056 |
2 | Store_1 | Product_2 | 21.24 | 4 | 5.310000 | 28.196100 |
3 | Store_1 | Product_2 | 51.53 | 9 | 5.725556 | 32.781986 |
4 | Store_2 | Product_2 | 50.65 | 8 | 6.331250 | 40.084727 |
assign と同じことを inplace でやりたいなら eval
で似たようなことができる
c.f. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.eval.html
df2 = df
df2.eval('rpq = Revenue / Quantity', inplace = True)
df2.eval('rpq_squared = rpq * rpq', inplace = True)
df2.head()
Store | Product | Revenue | Quantity | rpq | rpq_squared | |
---|---|---|---|---|---|---|
0 | Store_1 | Product_2 | 24.76 | 1 | 24.760000 | 613.057600 |
1 | Store_2 | Product_2 | 53.47 | 4 | 13.367500 | 178.690056 |
2 | Store_1 | Product_2 | 21.24 | 4 | 5.310000 | 28.196100 |
3 | Store_1 | Product_2 | 51.53 | 9 | 5.725556 | 32.781986 |
4 | Store_2 | Product_2 | 50.65 | 8 | 6.331250 | 40.084727 |