You are a fintech professional working for a bank. Your team has collected a dataset of credit card transactions from various customers, including transaction dates, amounts, categories, and merchant names. However, the data is messy and needs to be cleaned and analyzed to extract valuable insights.
Use pandas to clean and analyze the dataset, answering the following questions:
Identify the top 3 categories with the highest average transaction amount on weekends (Saturday and Sunday).
Answer: Travel, Entertainment, Food (in that order)
# import libraries
import pandas as pd
import numpy as np
The code below generates a sample dataset with 105 rows and 4 columns (date, amount, category, and merchant). The data includes some missing values and duplicates to simulate real data in the wild.
# set the seed
np.random.seed(0)
# hold the values to randomly pick from
categories = ['Food', 'Entertainment', 'Shopping', 'Travel']
merchants = ['McDonalds', 'Amazon', 'Netflix', 'Expedia', 'Starbucks']
dates = pd.date_range('2024-01-01', '2024-01-31')
# generate the data
data = {
'date': np.random.choice(dates, size=100),
'amount': np.random.uniform(10, 100, size=100),
'category': np.random.choice(categories, size=100),
'merchant': np.random.choice(merchants, size=100)
}
# add the data to our dataframe
df = pd.DataFrame(data)
# introduce some missing values
# we randomly select 5 index values and set the column "amount" to null
df.loc[np.random.choice(df.index, size=5), 'amount'] = np.nan
# introduce some duplicates
# we select the first 5 rows and append them to the dataframe
dupes = df.iloc[:5]
df = pd.concat([df, dupes])
df.head()
date | amount | category | merchant | |
---|---|---|---|---|
0 | 2024-01-13 | 67.084665 | Food | Starbucks |
1 | 2024-01-16 | 96.305434 | Shopping | Expedia |
2 | 2024-01-22 | 68.751129 | Travel | Expedia |
3 | 2024-01-01 | 67.155299 | Travel | Amazon |
4 | 2024-01-04 | NaN | Entertainment | Amazon |
The data types of the columns look good.
Do you see the Non-Null count column and the value of 99? This is how we can determine if our data has any null values.
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 105 entries, 0 to 4 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 105 non-null datetime64[ns] 1 amount 99 non-null float64 2 category 105 non-null object 3 merchant 105 non-null object dtypes: datetime64[ns](1), float64(1), object(2) memory usage: 4.1+ KB
Let's take care of the duplicates in the data
# here is how we can identify the duplicates
df[df.duplicated()]
date | amount | category | merchant | |
---|---|---|---|---|
0 | 2024-01-13 | 67.084665 | Food | Starbucks |
1 | 2024-01-16 | 96.305434 | Shopping | Expedia |
2 | 2024-01-22 | 68.751129 | Travel | Expedia |
3 | 2024-01-01 | 67.155299 | Travel | Amazon |
4 | 2024-01-04 | NaN | Entertainment | Amazon |
# here is how we can pinpoint one of the duplicates
df[df.loc[:,'amount'].apply(lambda x: np.isclose(x,67.084665))]
date | amount | category | merchant | |
---|---|---|---|---|
0 | 2024-01-13 | 67.084665 | Food | Starbucks |
0 | 2024-01-13 | 67.084665 | Food | Starbucks |
# pandas makes it very easy to get rid of the duplicates
df = df.drop_duplicates()
df.head()
date | amount | category | merchant | |
---|---|---|---|---|
0 | 2024-01-13 | 67.084665 | Food | Starbucks |
1 | 2024-01-16 | 96.305434 | Shopping | Expedia |
2 | 2024-01-22 | 68.751129 | Travel | Expedia |
3 | 2024-01-01 | 67.155299 | Travel | Amazon |
4 | 2024-01-04 | NaN | Entertainment | Amazon |
Here I am just going to fill any missing values with the previous known value. If I don't find any, I will simply get rid of the row.
# these are the null values
df[df.loc[:,'amount'].isna()]
date | amount | category | merchant | |
---|---|---|---|---|
4 | 2024-01-04 | NaN | Entertainment | Amazon |
23 | 2024-01-15 | NaN | Food | Expedia |
24 | 2024-01-25 | NaN | Travel | McDonalds |
49 | 2024-01-08 | NaN | Entertainment | McDonalds |
87 | 2024-01-24 | NaN | Shopping | McDonalds |
Let's see if we can look at the null rows and the previous row so we can see if forward fill will work for all the rows.
As you can see from the table below, we don't have a previous not-null value for any of our missing data. This means we cannot make a guess on what should be there and we just need to remove those rows.
# make sure data is ordered
df = df.sort_values(by=['date','category','merchant'])
# make sure index values are in order
# drop=True, this is to completely get rid of the old index and prevent it from being added as a column
df = df.reset_index(drop=True)
# get null rows
null_rows = df[df['amount'].isnull()]
# get null indexes
null_indices = null_rows.index
# get previous rows
previous_rows = df.loc[null_indices - 1]
# combine them
pd.concat([null_rows,previous_rows]).sort_index()
date | amount | category | merchant | |
---|---|---|---|---|
11 | 2024-01-03 | 70.789517 | Food | Netflix |
12 | 2024-01-04 | NaN | Entertainment | Amazon |
28 | 2024-01-08 | 18.313650 | Entertainment | Amazon |
29 | 2024-01-08 | NaN | Entertainment | McDonalds |
47 | 2024-01-14 | 71.979506 | Travel | McDonalds |
48 | 2024-01-15 | NaN | Food | Expedia |
80 | 2024-01-24 | 50.604329 | Shopping | Expedia |
81 | 2024-01-24 | NaN | Shopping | McDonalds |
85 | 2024-01-25 | 77.425285 | Entertainment | Netflix |
86 | 2024-01-25 | NaN | Travel | McDonalds |
Let's remove the null rows under the amount column
# drop nulls
df = df.dropna()
# verify
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 95 entries, 0 to 99 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 95 non-null datetime64[ns] 1 amount 95 non-null float64 2 category 95 non-null object 3 merchant 95 non-null object dtypes: datetime64[ns](1), float64(1), object(2) memory usage: 3.7+ KB
# What is the total transaction amount for each category (e.g., Food, Entertainment, Shopping)?
group = df.groupby('category')
group.sum(numeric_only=True)
amount | |
---|---|
category | |
Entertainment | 1576.840628 |
Food | 1063.905275 |
Shopping | 1060.962022 |
Travel | 1822.804450 |
# Which merchant has the highest total transaction amount?
group = df.groupby('merchant')
group.sum(numeric_only=True).sort_values(by='amount', ascending=False).head(1)
amount | |
---|---|
merchant | |
McDonalds | 1453.475029 |
# What is the average transaction amount per day for each category?
group = df.groupby(['date','category'])
# show top 7 rows
group.mean(numeric_only=True).head(7)
amount | ||
---|---|---|
date | category | |
2024-01-01 | Entertainment | 29.657538 |
Shopping | 73.336355 | |
Travel | 46.258544 | |
2024-01-02 | Entertainment | 19.257702 |
Food | 35.962879 | |
Shopping | 89.060697 | |
Travel | 89.037412 |
Identify the top 3 categories with the highest average transaction amount on weekends (Saturday and Sunday).
This tutorial was created by HEDARO