Imagine you're a Data Analyst at a bank, and your manager just dropped a bombshell on you. It seems that the bank's financial reports are a mess, and they need your expertise to sort it out. The problem is that the data is scattered across multiple files, with different formats and inconsistencies. Your task is to use your Pandas skills to wrangle the data, identify the issues, and provide insights to help the bank get back on track.
Data Unification: Combine the 'transactions.csv' and 'customers.xlsx' files into a single dataframe, handling missing values and data type inconsistencies. Ensure that the 'customer_id' column is of numeric type (int or float) and the 'date' column is in datetime format.
Transaction Categorization: Create a new column 'transaction_category' and categorize each transaction as either 'deposit', 'withdrawal', or 'transfer' based on the 'transaction_type' column. Calculate the total value and average amount for each transaction category.
Customer Insights: Group by 'gender' and 'location', and calculate the average account balance (sum of 'amount' column) and transaction frequency (count of transactions) for each group. Output the results as a new dataframe.
# import libraries
import pandas as pd
import numpy as np
import sys
print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
print('Numpy version ' + np.__version__)
Python version 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)] Pandas version 2.2.1 Numpy version 1.26.4
The dataset contains financial transaction records for 1,000 bank customers, including 10,000 transactions (deposits, withdrawals, and transfers) across two files (transactions.csv and customers.xlsx), with demographic information about each customer, such as age, gender, and location.
# set the seed
np.random.seed(0)
transactions = pd.DataFrame({
'customer_id': np.random.randint(1, 1000, 10000),
'transaction_type': np.random.choice(['deposit', 'withdrawal', 'transfer'], 10000),
'amount': np.random.uniform(100, 1000, 10000),
'date': pd.date_range('2024-01-01', periods=10000)
})
# select 100 unique customer IDs
missing_ids = np.random.choice(transactions['customer_id'].unique(), 100, replace=False)
# for each of the selected customer IDs, set a random 'amount' value to null
for customer_id in missing_ids:
idx = np.random.choice(transactions[transactions['customer_id'] == customer_id].index)
transactions.loc[idx, 'amount'] = np.nan
customers = pd.DataFrame({
'customer_id': np.arange(1, 1001),
'age': np.random.randint(25, 65, 1000),
'gender': np.random.choice(['male', 'female'], 1000),
'location': np.random.choice(['urban', 'rural'], 1000)
})
# save data to disc
transactions.to_csv('transactions.csv', index=False)
customers.to_excel('customers.xlsx', index=False)
Let us start by reading in the CSV and Excel file. We will then look at their data types to ensure they are correct.
transactions_df = pd.read_csv('transactions.csv')
transactions_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customer_id 10000 non-null int64 1 transaction_type 10000 non-null object 2 amount 9900 non-null float64 3 date 10000 non-null object dtypes: float64(1), int64(1), object(2) memory usage: 312.6+ KB
From the table above, we can see we need to update the date
column so it is represented as a date object. Currently it is represented as a string object.
The easiest way is to use the pd.to_datetime
method.
transactions_df['date'] = pd.to_datetime(transactions_df['date'])
transactions_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customer_id 10000 non-null int64 1 transaction_type 10000 non-null object 2 amount 9900 non-null float64 3 date 10000 non-null datetime64[ns] dtypes: datetime64[ns](1), float64(1), int64(1), object(1) memory usage: 312.6+ KB
Let us now import the Excel file and perform the same process as we did with the CSV import.
I am not seeing any issues with the datatypes after importing from Excel. Yay!
customers_df = pd.read_excel('customers.xlsx')
customers_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customer_id 1000 non-null int64 1 age 1000 non-null int64 2 gender 1000 non-null object 3 location 1000 non-null object dtypes: int64(2), object(2) memory usage: 31.4+ KB
Now we can try to merge the dataframes. The common key between them is customer_id
and we will use the .merge()
method to accomplish this portion of the task.
df = pd.merge(transactions_df, customers_df, on='customer_id')
df.head()
customer_id | transaction_type | amount | date | age | gender | location | |
---|---|---|---|---|---|---|---|
0 | 685 | withdrawal | 189.852038 | 2024-01-01 | 37 | female | rural |
1 | 560 | withdrawal | 626.720139 | 2024-01-02 | 36 | male | urban |
2 | 630 | withdrawal | 657.841588 | 2024-01-03 | 28 | male | rural |
3 | 193 | deposit | 327.129016 | 2024-01-04 | 48 | male | rural |
4 | 836 | transfer | 798.239145 | 2024-01-05 | 41 | male | urban |
Using the 'transaction_type' column, calculate the total value and average amount for each transaction category.
The transaction_type column contains the strings 'deposit', 'withdrawal', and 'transfer'. Since we are looking for total values in addition to the averages of these three categories, let us see if a pivot table would work for this task.
Notes on the df.pivot_table
:
values
index
with columns
df.pivot_table(index=['transaction_type'], values=['amount'], aggfunc=['sum','mean'])
sum | mean | |
---|---|---|
amount | amount | |
transaction_type | ||
deposit | 1.794246e+06 | 547.694277 |
transfer | 1.843336e+06 | 548.611931 |
withdrawal | 1.785630e+06 | 547.067914 |
Group by 'gender' and 'location', and calculate the average account balance (sum of 'amount' column) and transaction frequency (count of transactions) for each group. Output the results as a new dataframe.
The agg()
method gives us a really nice API to work with. We not only can use multiple aggregate functions, but Pandas allows us to choose and name the column.
# create group object
group = df.groupby(['gender','location'])
# get the total and average of the group
group.agg(
amount_sum=pd.NamedAgg(column="amount", aggfunc="sum"),
amount_count=pd.NamedAgg(column="amount", aggfunc="count")
)
amount_sum | amount_count | ||
---|---|---|---|
gender | location | ||
female | rural | 1.447087e+06 | 2620 |
urban | 1.280509e+06 | 2298 | |
male | rural | 1.312961e+06 | 2419 |
urban | 1.382655e+06 | 2563 |
In this tutorial, you have completed three tasks and learned valuable skills along the way. You unified data by combining multiple files into a single DataFrame, categorized transactions to calculate totals and averages, and gained customer insights by grouping data by gender and location. Through these tasks, you have learned how to work with Pandas, including saving and loading data, checking and converting data types, merging DataFrames, using pivot tables, and grouping data to calculate aggregate values.
Here are the important skills you learned in this tutorial:
This tutorial was created by HEDARO