Budgeting App: An Exploratory Analysis

We used to write down all the transactions we made using a notebook or spreadsheet to understand how we spent our money. These approaches can be pretty time-consuming and require much effort. Budgeting apps help achieve financial goals by automating the tediousness of tracking personal budgets.

This notebook conducts an exploratory analysis to inform the business strategy of a financial firm that operates a budgeting app. The firm has a mission of making budgeting accessible and stress-free. The app helps achieve financial goals by automating the tediousness of tracking personal budgets.

I use a dataset of 634 recurrences (transactions) that come in two types: savings (N = 153) and monthly expenses (N = 481). The dataset contains:

information about 170 customers
amounts used for savings and expenses
days of the month when the money gets returned to customers to pay for expenses due

My findings indicate that most recurrences are expenses (76%), while the remaining (24%) are savings. The typical capacity of recurrences is about 1,420. But savings recurrences typically involve much larger capacity compared to expense recurrences. The most frequent and largest expense recurrences revolve around the beginning and end of the month, typically involving rent and car expenses. The most frequent savings recurrence involved emergency funds (22%). Interestingly, customers with the highest expense recurrences (top 10) are not the ones who have the highest savings recurrences. I hope these findings help further understand the app's partners and provide opportunities to improve the business strategy.

In [ ]:

#import Python libraries
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import json 
import numpy as np
from wordcloud import WordCloud

In [ ]:

#read in the dataset into dataframe and display randomly selected rows
data = pd.read_csv('data.csv') 
data.sample(3)

Out[ ]:

	id	partnerID	name	recurrence	capacity
482	555	574	BCU Loan	{"timing":3,"dayOfTheMonth":19}	{"cents":{"numberOfDimes":0,"numberOfPennies":...
412	451	387	Honda	{"timing":3,"dayOfTheMonth":20}	{"cents":{"numberOfDimes":0,"numberOfPennies":...
600	675	812	Target	{"timing":3,"dayOfTheMonth":16}	{"cents":{"numberOfDimes":0,"numberOfPennies":...

In [ ]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 634 entries, 0 to 633
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          634 non-null    int64 
 1   partnerID   634 non-null    int64 
 2   name        634 non-null    object
 3   recurrence  634 non-null    object
 4   capacity    634 non-null    object
dtypes: int64(2), object(3)
memory usage: 24.9+ KB

1. Investigation of Recurrences¶

To explore recurrences, I retrieved the type of recurrence from JSON strings in the recurrence column and saved it to a new type column. Using this new column, I created a pie plot to depict the frequency distribution of the recurrence types. The plot reveals that most recurrences are expenses (76%), the remaining (24%) are savings.

To obtain the day of the month, I split the dataframe by recurrence type into two dataframes: expenses and savings (since only expense recurrences have a value for the date). Then, for the expenses dataframe, I retrieved values for dayOfTheMonth into a new day column. I created a frequency table and a count plot using this new column. The table and plot reveal that the most significant number of expense recurrences occurs on the 1st day of the month (26%). The second most frequent day for expense recurrences is the 28th day of the month (9%), while the 15th day is the third most frequent day for expense recurrences.

The fact that a typical payday occurs during the beginning, end, and middle of the month may explain this observation. Also, the due date for housing rent is generally at the start of the month. Later in the notebook, I show that the average capacity for expense recurrences is the highest on the first day of the month. Also, rent is the most frequent type of expense.

In [ ]:

#parse JSON strings, convert them into Python dictionaries, and retrieve recurrence types from the recurrence column and values from the capacity column
data['type'] = data.recurrence.apply(lambda x: json.loads(x)['timing'])
data['capacity'] = data.capacity.apply(lambda x: json.loads(x)['dollars'])

In [ ]:

#create pie plot for the type column 
round(data.type.value_counts(normalize=True)*100,1).plot.pie(labels=['expenses', 'savings'],
                                                             figsize=(9,9),autopct='%.1f%%', 
                                                             fontsize=20)
plt.title('Relative Frequency of Recurrence by Type',fontsize=20)
plt.ylabel('')
plt.show()

In [ ]:

#split dataframe into two: expenses and savings
expenses = data[data.recurrence.str.contains('"timing":3')].copy()
savings = data[data.recurrence.str.contains('"timing":0')].copy()

#retrieve values for dayOfTheMonth into the day column
expenses['day'] = expenses.recurrence.apply(lambda x: json.loads(x)['dayOfTheMonth']).copy() 

In [ ]:

#create count plot to show counts of expenses recurrences in each day
plt.figure(figsize=(10, 7))
sns.countplot(x='day', data=expenses)
plt.title('Number of Expense Recurrences per Day', fontsize=20)
plt.show()

In [ ]:

#create a frequency distribution table for the day column
round(expenses.day.value_counts(normalize=True)*100, 1)

Out[ ]:

1     26.2
28     9.4
15     7.5
10     4.4
4      3.3
7      3.3
5      3.3
3      3.1
25     2.9
8      2.7
16     2.7
13     2.7
22     2.7
12     2.7
18     2.5
24     2.3
21     2.3
26     2.1
17     1.9
20     1.9
27     1.7
23     1.5
2      1.5
14     1.5
9      1.2
11     1.0
6      1.0
19     0.8
Name: day, dtype: float64

2. Relationship between Recurrence and Capacity¶

I created a categorical plot to investigate the relationship between recurrence and capacity. The plot shows that, on average, values for savings recurrences (mean=347) are much higher than those for expense recurrences (mean=4,792). I also computed a pairwise correlation with a significant negative Pearson correlation coefficient (r=-0.60; savings and expenses were coded 0 and 3, respectively). These observations indicate that savings recurrences have a larger capacity than expense recurrences.

In [ ]:

#create cat plot for capacity and type columns
g = sns.catplot(x='type',y='capacity', data=data)
g.set_xticklabels(['savings', 'expenses'], fontsize=15)
g.set_axis_labels('', 'capacity', fontsize=15)
plt.title('Capacity by Type of Recurrence', fontsize=20)

plt.axhline(data.capacity.mean(), label='mean')
plt.legend()
plt.gcf().set_size_inches(10, 8)
plt.show()

In [ ]:

#compute a pairwise correlation between type and capacity columns
data[['type','capacity']].corr()

Out[ ]:

	type	capacity
type	1.000000	-0.597794
capacity	-0.597794	1.000000

3. Exploration of Expense Purposes¶

To conduct an exploratory analysis using the name column, I converted all uppercase characters to lowercase and removed all emoji characters from text strings.

I used a frequency table to demonstrate frequencies of recurrence purposes. The frequency table indicates that rent (15%), emergency fund (5%), car payment (5%), car (4%), and car insurance (3%) are the most frequent description of the purpose for the envelope. To depict this pattern visually, I generated a word cloud where the text font size is associated with the frequency of the purpose (the larger the font size, the higher the frequency).

To dig a little deeper, I identified all card expense recurrences and their relative frequency. Interestingly, only 4% of expense recurrences contain a "card" to describe the name of an envelope. I expected this percentage to be much higher. However, other names were used that may refer to credit card names (e.g., "discover," "capital one") that were not included in the total count because they did contain the "card" keyword. As such, further data cleaning is necessary for more accurate results.

In [ ]:

#convert all uppercase characters to lowercase
data.name = data.name.str.lower()
expenses.name = expenses.name.str.lower()
savings.name = savings.name.str.lower()

#use remove_emoji and regex to remove emoji characters
import re                          
def remove_emoji(string):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  # dingbats
                               u"\u3030"
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

data.name = data.name.apply(remove_emoji)
expenses.name = expenses.name.apply(remove_emoji)
savings.name = savings.name.apply(remove_emoji)

#remove white space 
data.name = data.name.str.strip()
expenses.name = expenses.name.str.strip()
savings.name = savings.name.str.strip()

In [ ]:

#create a frequency table for the ten most frequent strings in the name column
round(data.name.value_counts(normalize=True)[:10]*100,0)

Out[ ]:

rent              15.0
emergency fund     5.0
car payment        5.0
car                4.0
car insurance      3.0
internet           2.0
phone              2.0
electric           2.0
student loans      2.0
credit card        1.0
Name: name, dtype: float64

In [ ]:

# Generate a wordcloud
text = data.name.value_counts().to_dict()
wc = WordCloud(background_color="white",  max_font_size=100, max_words=100, 
               contour_width=1, contour_color='yellow').generate_from_frequencies(text)


# show
plt.figure(figsize=[10,10])
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

In [ ]:

#identify all card payments
expenses[expenses.name.str.contains('card')]

Out[ ]:

	id	partnerID	name	recurrence	capacity	type	day
52	85	144	discover card	{"timing":3,"dayOfTheMonth":18}	100	3	18
53	86	144	capital card	{"timing":3,"dayOfTheMonth":3}	100	3	3
69	102	259	credit card	{"timing":3,"dayOfTheMonth":16}	100	3	16
70	103	259	credit card	{"timing":3,"dayOfTheMonth":23}	100	3	23
99	134	314	discover card	{"timing":3,"dayOfTheMonth":28}	300	3	28
100	135	314	apple card	{"timing":3,"dayOfTheMonth":28}	200	3	28
133	168	350	credit card	{"timing":3,"dayOfTheMonth":11}	500	3	11
240	278	489	citizens credit card	{"timing":3,"dayOfTheMonth":12}	75	3	12
242	280	489	amex card	{"timing":3,"dayOfTheMonth":12}	100	3	12
373	412	158	credit card	{"timing":3,"dayOfTheMonth":13}	300	3	13
377	416	33	apple card	{"timing":3,"dayOfTheMonth":28}	250	3	28
441	514	416	credit card	{"timing":3,"dayOfTheMonth":1}	350	3	1
464	537	221	credit card	{"timing":3,"dayOfTheMonth":20}	30	3	20
471	544	547	credit card debt	{"timing":3,"dayOfTheMonth":10}	600	3	10
501	576	640	zales credit card	{"timing":3,"dayOfTheMonth":13}	118	3	13
502	577	640	chase credit card	{"timing":3,"dayOfTheMonth":14}	40	3	14
503	578	640	discover credit card	{"timing":3,"dayOfTheMonth":15}	35	3	15
526	601	747	credit card	{"timing":3,"dayOfTheMonth":21}	250	3	21
549	624	728	credit card	{"timing":3,"dayOfTheMonth":9}	80	3	9
566	641	745	credit card	{"timing":3,"dayOfTheMonth":4}	180	3	4
627	702	798	macys credit card	{"timing":3,"dayOfTheMonth":10}	29	3	10

In [ ]:

#calculate an average capacity
card_percentage = round((len(expenses[expenses.name.str.contains('card')]) / len(expenses))*100, 0)
print("Percentage of card payments: {:.0f}".format(card_percentage))

Percentage of card payments: 4

4. Exploration of Capacity¶

The dataset has an average capacity of 1,420. Yet, savings capacity (mean=4,793 is substantially higher than expense capacity (mean=347).

The box plot shows that the capacity column has many outlying values with exceptionally high values, ranging from 10 to 23,000. The histogram indicates that the distribution of capacity values has a long right tail; this observation also occurs when histograms are created separately for expenses and savings recurrences.

It is interesting to note that the highest expense recurrences tend to occur on the 1st and 28th of the month, as shown by the strip plot. As discussed earlier, the frequency of recurrences is also the highest on those days.

I also explored expense and savings recurrences per customer. Interestingly, the frequency tables show that customers with the highest expense recurrences (top 10) are not the ones who have the highest savings recurrences.

In [ ]:

# generate descriptive statistics for capacity column
data.describe()['capacity']

Out[ ]:

count      634.000000
mean      1419.911672
std       3184.468184
min         10.000000
25%         95.000000
50%        300.000000
75%       1000.000000
max      23000.000000
Name: capacity, dtype: float64

In [ ]:

print("Overall average capacity: {:.0f}".format(data.capacity.mean()))
print("Average expenses capacity: {:.0f}".format(expenses.capacity.mean()))
print("Average savings capacity: {:.0f}".format(savings.capacity.mean()))

Overall average capacity: 1420
Average expenses capacity: 347
Average savings capacity: 4793

In [ ]:

#create box plot for capacity column
plt.figure(figsize=(10, 7))
sns.boxplot(data=data, y='capacity')
plt.title('Distribution of Capacity', fontsize=15)
plt.show()

In [ ]:

#create histogram for capacity column
data.capacity.plot.hist(figsize=(7,7), bins=10)
plt.title('', fontsize=16)
plt.title('Distribution of Capacity Using Bins', fontsize=15)
plt.xlabel('Capacity')
plt.show()

In [ ]:

#create histogram for capacity column for expense recurrences
expenses.capacity.plot.hist(figsize=(7,7), bins=10)
plt.title('', fontsize=16)
plt.title('Distribution of Capacity Using Bins for Expenses Recurrences', fontsize=15)
plt.xlabel('Capacity')
plt.show()

In [ ]:

#create histogram for capacity column for savings recurrences
savings.capacity.plot.hist(figsize=(7,7), bins=5)
plt.title('', fontsize=16)
plt.title('Distribution of Capacity Using Bins for Savings Recurrences', fontsize=15)
plt.xlabel('Capacity')
plt.show()

In [ ]:

#create strip plot for capacity per day
plt.figure(figsize=(12, 7))
sns.stripplot(x='day', y='capacity', data=expenses)
plt.title('Distribution of Capacity per Day', fontsize=15)
plt.show()

In [ ]:

#create frequency table of partners who have top 10 average expenses recurrences
partner_exp = expenses.groupby('partnerID')
partner_exp.capacity.mean().sort_values(ascending=False).head(10)

Out[ ]:

partnerID
410    2876.0
4      1650.0
275    1649.0
439    1617.0
206    1580.0
135    1536.0
46     1046.5
493    1040.0
271    1030.0
277    1000.0
Name: capacity, dtype: float64

In [ ]:

#create frequency table of partners who have top 10 average savings recurrences
partner_sav = savings.groupby('partnerID')
partner_sav.capacity.mean().sort_values(ascending=False).head(10)

Out[ ]:

partnerID
387    23000.0
477    20000.0
574    15642.0
723    15251.0
213    15000.0
852    10989.0
311    10000.0
33     10000.0
746     8799.0
249     8500.0
Name: capacity, dtype: float64

Conclusion¶

I explored the behaviors of 170 customers. Most recurrences are expenses (76%), while the remaining (24%) are savings. The typical capacity of recurrences is about 1,420. But savings recurrences typically involve much larger capacity compared to expense recurrences. The most frequent and largest expense recurrences revolve around the beginning and end of the month, typically involving rent and car expenses. The most frequent savings recurrence involved emergency funds (22%). Interestingly, customers with the highest expense recurrences (top 10) are not the ones who have the highest savings recurrences. I hope these findings help further understand the app's partners and provide opportunities to improve the business strategy.