A common need is bound to arise where you will need to look at an aggregate view of a
DataFrame by a certain value. This is where grouping comes in.
CashBox has asked that we produce a list of the top 10 users who have been on the receiving side of transactions the most. They would like to see the user's first and last name, their email, and the total number of transactions where the user was the receiver.
We can acheive this by grouping our data. We need to take a look at the
DataFrame and group it by the
receiver field, which is the username.
Grouping on a value returns a new type of object called the
Let's first explore how to create one of these, and then how to wield it's power.
# Setup import os import pandas as pd pd.options.display.max_rows = 10 users = pd.read_csv(os.path.join('data', 'users.csv'), index_col=0) transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0) # Sanity check (len(users), len(transactions))
Let's remind ourselves about the types of data we have in the
sender object receiver object amount float64 sent_date object dtype: object
Grouping by a specific column is pretty straight forward. We want to group by the receiver, so we use the
grouped_by_receiver = transactions.groupby('receiver') # Let's see what type of object we got back type(grouped_by_receiver)
# Returns a Series of total number of rows grouped_by_receiver.size()
receiver aaron 6 acook 1 adam.saunders 2 adrian 3 adrian.blair 7 .. wilson 2 wking 2 wright3590 4 young 2 zachary.neal 4 Length: 410, dtype: int64
Similarly, we can use the
DataFrameGroupBy.count method to see counts of how many non missing data points we have across each column in our group across the columns of our
410 rows × 3 columns
GroupBy object provides aggregate functions that makes getting calculations quick and seamless. For instance, if we use the
GroupBy.sum method we can see each numeric column summed up for each grouping. In our case there is only one numeric column
410 rows × 1 columns
Now where were we? Oh right, we're trying to figure out the people who was received the most transactions. So why don't we use this group to create a new column on our
# Create a new column in users called transaction count, and set the values to the size of the matching group users['transaction_count'] = grouped_by_receiver.size() # Not every user has made a transaction, let's see what kind of missing data we are dealing with len(users[users.transaction_count.isna()])
Since we don't have a transaction record for everyone, not every user will be in our grouping. So when we created the new column, we ended up adding some
np.nan. Let's fix that.
# Set all missing data to 0, since in reality, there have been 0 received transactions for this user users.transaction_count.fillna(0, inplace=True) users
475 rows × 8 columns
Check it out! There's our column, but it's a floating point number, we don't need that. Let's convert it!
# Convert from the default type of float64 to int64 (no precision needed) users.transaction_count = users.transaction_count.astype('int64')
Finally we want to get the user with the highest transaction count, so let's sort by that descending.
# Sort our values by the new field descending (so the largest comes first), and then by first name ascending users.sort_values( ['transaction_count', 'first_name'], ascending=[False, True], inplace=True ) # Take a look at our top 10 receivers, showing only the columns we want users.loc[:, ['first_name', 'last_name', 'email', 'transaction_count']].head(10)
Here they are, the Top 10 Receivers! Nice work putting all those skills together!