Rocket League: Tracking 505 Matches

I've been playing Rocket League (on console) for more than 5 years. Am I a Pro? NOPE! I'm an Average Joe in terms of rankings, but I love the competitiveness. Most of my friends still play the game so the social aspect is another reason I continue to play.

The game offers stats but it's an aggregate of all my lifetime stats. It's doesn't offer any sort of trends. I decided to record at least 500 games to track my performance over time. After each match concluded, I took a photo with my phone of the scoreboard, and manually plugged in the stats into an excel spreadsheet. #automation

I wanted to answer questions as simple as:

  • After tracking Game 1 to Game 505, did I improve in terms of overall stats (Total Points)?
  • What is my win/loss percentage trend?
  • How has my offense improved (Goals and Assists per game)?
  • How has my defense improved (Saves per game)?
In [1]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#show plots inline
%matplotlib inline
In [2]:
#read the file
rl_data = pd.read_csv('Rocket_League_Stats.csv')

Inspecting Data

We'll inspect the data using the following methods -

  • df.info()
  • df.describe()
  • df.shape
  • df['strings'].value_counts()
  • df.head() & df.tail()

The methods will be explained within the comments section below!

In [3]:
#summary information about our index, columns, and memory usage
rl_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 505 entries, 0 to 504
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Games         505 non-null    object
 1   Match Result  505 non-null    int64 
 2   Goals         505 non-null    int64 
 3   Assists       505 non-null    int64 
 4   Saves         505 non-null    int64 
 5   Shots         505 non-null    int64 
 6   Points        505 non-null    int64 
dtypes: int64(6), object(1)
memory usage: 27.7+ KB
In [4]:
#determine the number of rows and columns in our df 
rl_data.shape
Out[4]:
(505, 7)
In [5]:
#statistical information per column (by default this method only includes numeric columns)
rl_data.describe(include='all')
Out[5]:
Games Match Result Goals Assists Saves Shots Points
count 505 505.000000 505.000000 505.000000 505.000000 505.000000 505.000000
unique 505 NaN NaN NaN NaN NaN NaN
top Game 91 NaN NaN NaN NaN NaN NaN
freq 1 NaN NaN NaN NaN NaN NaN
mean NaN 0.552475 1.392079 0.629703 1.259406 3.045545 437.158416
std NaN 0.497732 1.264886 0.881641 1.167532 1.872923 202.399158
min NaN 0.000000 0.000000 0.000000 0.000000 0.000000 36.000000
25% NaN 0.000000 0.000000 0.000000 0.000000 2.000000 284.000000
50% NaN 1.000000 1.000000 0.000000 1.000000 3.000000 414.000000
75% NaN 1.000000 2.000000 1.000000 2.000000 4.000000 577.000000
max NaN 1.000000 7.000000 9.000000 9.000000 9.000000 1021.000000
In [6]:
#returns the first 5 rows of our data
rl_data.head()
Out[6]:
Games Match Result Goals Assists Saves Shots Points
0 Game 1 1 1 1 0 1 301
1 Game 2 0 0 2 2 5 553
2 Game 3 0 0 0 2 0 196
3 Game 4 1 0 0 2 1 210
4 Game 5 0 0 0 2 3 249
In [7]:
#returns the last 5 rows of our data
rl_data.tail()
Out[7]:
Games Match Result Goals Assists Saves Shots Points
500 Game 501 0 1 0 3 5 622
501 Game 502 1 3 1 1 8 806
502 Game 503 0 3 0 1 5 613
503 Game 504 0 2 0 0 3 416
504 Game 505 1 1 4 1 4 696
In [8]:
#returns a series with the frequency per value
rl_data['Match Result'].value_counts()
Out[8]:
1    279
0    226
Name: Match Result, dtype: int64

Data Cleaning

This is my personal dataset and I'm fairly confident it's "clean". However, I'm human prone to errors so it's best to confirm.

Removing Non-Digit Characters

Currently, in the Games column the values output reads as Game 1, Game 2, and so on. Luckily there is clear pattern - all values includes characters followed by digits. I will remove the non-digit characters, which starts with Game and keep the digit. Then, I will convert the column to numeric dtypes.

I'll use the Series.str.replace() method, which will remove all the quote characters. In addition, I'll use the Series.str.split() method to strip whitespace from each string.

In [9]:
rl_data['Games'] = rl_data['Games'].str.replace('Game','').str.strip()

Converting Columns to Numeric Dtypes

Now we can convert the columns to a numeric dtype. I'll use the Series.astype() method. We don't need to store decimals so I'll select int dtype.

Using df.head() we'll inspect the first 5 rows along with df.info() to confirm the column was successfully converted to int dtype.

In [10]:
rl_data['Games'] = rl_data['Games'].astype(int)
In [11]:
rl_data.head()
Out[11]:
Games Match Result Goals Assists Saves Shots Points
0 1 1 1 1 0 1 301
1 2 0 0 2 2 5 553
2 3 0 0 0 2 0 196
3 4 1 0 0 2 1 210
4 5 0 0 0 2 3 249
In [12]:
rl_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 505 entries, 0 to 504
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   Games         505 non-null    int64
 1   Match Result  505 non-null    int64
 2   Goals         505 non-null    int64
 3   Assists       505 non-null    int64
 4   Saves         505 non-null    int64
 5   Shots         505 non-null    int64
 6   Points        505 non-null    int64
dtypes: int64(7)
memory usage: 27.7 KB

Success!

...for now

Create A New Column Based On Existing Column Values (Part 1)

The Match Result column is currently two values - 1 equals won and 0 which translates to loss. I want to create a new column, based on Match Result values, with new string values W and L. I'll keep the original for reference.

I'll use Series.map() method, but first I'll create a function. If the value is greater than 0, then return W else return L. I'll label the new column Results W/L.

I'll inspect using df.head() to confirm the returned results.

In [13]:
def label(element):
    if element > 0:
        return 'W'
    else:
        return 'L'
    
rl_data['Result_W/L'] = rl_data['Match Result'].map(label)
In [14]:
rl_data.head()
Out[14]:
Games Match Result Goals Assists Saves Shots Points Result_W/L
0 1 1 1 1 0 1 301 W
1 2 0 0 2 2 5 553 L
2 3 0 0 0 2 0 196 L
3 4 1 0 0 2 1 210 W
4 5 0 0 0 2 3 249 L

Create A New Column Based On Existing Column Values (Part 2)

Thinking ahead about how I want to explore the data using visualizations (more on that ahead), I want to create a new column that labels every 100 games in a certain bucket. For example, games 1-100 will be labeled as 0-100...and so on.

I'll use Series.map() method again, but I must create my function will mulitple if statements.

In [15]:
def fn(x):
    if x < 101:
        return '0-100'
    if 100 < x < 201:
        return '101-200'
    if 200 < x < 301:
        return '201-300'
    if 300 < x < 401:
        return '301-400'
    if 400 < x < 506:
        return '401-505'
    return 0

rl_data['Interval_Games'] = rl_data['Games'].map(fn)

I'll inspect using df[Series].value_counts() because I need to confirm there is a total of 100 games per bucket and 105 values in 401-505.

In [16]:
rl_data['Interval_Games'].value_counts()
Out[16]:
401-505    105
101-200    100
201-300    100
0-100      100
301-400    100
Name: Interval_Games, dtype: int64
In [17]:
rl_data.head()
Out[17]:
Games Match Result Goals Assists Saves Shots Points Result_W/L Interval_Games
0 1 1 1 1 0 1 301 W 0-100
1 2 0 0 2 2 5 553 L 0-100
2 3 0 0 0 2 0 196 L 0-100
3 4 1 0 0 2 1 210 W 0-100
4 5 0 0 0 2 3 249 L 0-100
In [18]:
rl_data.tail()
Out[18]:
Games Match Result Goals Assists Saves Shots Points Result_W/L Interval_Games
500 501 0 1 0 3 5 622 L 401-505
501 502 1 3 1 1 8 806 W 401-505
502 503 0 3 0 1 5 613 L 401-505
503 504 0 2 0 0 3 416 L 401-505
504 505 1 1 4 1 4 696 W 401-505

Great, this is working as intended!

Visualizing The Data

Point Distribution

In my opinion, scoring at least 300 points per game is good. So let us see how I faired in these 505 games!

In [19]:
plt.hist(rl_data['Points'])
plt.title('Distribution of Points')
plt.xlabel('Points')
plt.ylabel('Frequency')
plt.show()

Based on the frequency, looks like most of my games are between 350 and 500.

Earlier I crearted an interval for every 100 games. I wanted to see how my first 100 games compared to my last 105 (remember I logged 505 games in total), and everything in between. I'll continue using a histogram, but this time for each interval and total points.

In [20]:
g  =  sns.FacetGrid(rl_data , col = "Interval_Games", height=3.5, aspect=.75)
In [21]:
g.map(sns.histplot, "Points")
g.set_axis_labels("Points", "Frequency")
Out[21]:
<seaborn.axisgrid.FacetGrid at 0x7f907b5186d0>

Very interesting data! The closest chart to a symmertical distribution is interval 101-200. However, interval 401-505 stands out for a couple of reasons. First, sub 250 point games were more frequent. Second, higher scoring games were much more frequent as well.

In [22]:
sns.scatterplot(data=rl_data, x="Points", y="Goals", hue="Match Result", size="Match Result", sizes=(200,125))
Out[22]:
<AxesSubplot:xlabel='Points', ylabel='Goals'>
In [ ]: