I've been playing Rocket League (on console) for more than 5 years. Am I a Pro? NOPE! I'm an Average Joe in terms of rankings, but I love the competitiveness. Most of my friends still play the game so the social aspect is another reason I continue to play.
The game offers stats but it's an aggregate of all my lifetime stats. It's doesn't offer any sort of trends. I decided to record at least 500 games to track my performance over time. After each match concluded, I took a photo with my phone of the scoreboard, and manually plugged in the stats into an excel spreadsheet. #automation
I wanted to answer questions as simple as:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#show plots inline
%matplotlib inline
#read the file
rl_data = pd.read_csv('Rocket_League_Stats.csv')
Inspecting Data =
We'll inspect the data using the following methods -
df.info()
df.describe()
df.shape
df['strings'].value_counts()
df.head()
& df.tail()
The methods will be explained within the comments section below!
#summary information about our index, columns, and memory usage
rl_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 505 entries, 0 to 504 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Games 505 non-null object 1 Match Result 505 non-null int64 2 Goals 505 non-null int64 3 Assists 505 non-null int64 4 Saves 505 non-null int64 5 Shots 505 non-null int64 6 Points 505 non-null int64 dtypes: int64(6), object(1) memory usage: 27.7+ KB
#determine the number of rows and columns in our df
rl_data.shape
(505, 7)
#statistical information per column (by default this method only includes numeric columns)
rl_data.describe(include='all')
Games | Match Result | Goals | Assists | Saves | Shots | Points | |
---|---|---|---|---|---|---|---|
count | 505 | 505.000000 | 505.000000 | 505.000000 | 505.000000 | 505.000000 | 505.000000 |
unique | 505 | NaN | NaN | NaN | NaN | NaN | NaN |
top | Game 91 | NaN | NaN | NaN | NaN | NaN | NaN |
freq | 1 | NaN | NaN | NaN | NaN | NaN | NaN |
mean | NaN | 0.552475 | 1.392079 | 0.629703 | 1.259406 | 3.045545 | 437.158416 |
std | NaN | 0.497732 | 1.264886 | 0.881641 | 1.167532 | 1.872923 | 202.399158 |
min | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 36.000000 |
25% | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 284.000000 |
50% | NaN | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 3.000000 | 414.000000 |
75% | NaN | 1.000000 | 2.000000 | 1.000000 | 2.000000 | 4.000000 | 577.000000 |
max | NaN | 1.000000 | 7.000000 | 9.000000 | 9.000000 | 9.000000 | 1021.000000 |
#returns the first 5 rows of our data
rl_data.head()
Games | Match Result | Goals | Assists | Saves | Shots | Points | |
---|---|---|---|---|---|---|---|
0 | Game 1 | 1 | 1 | 1 | 0 | 1 | 301 |
1 | Game 2 | 0 | 0 | 2 | 2 | 5 | 553 |
2 | Game 3 | 0 | 0 | 0 | 2 | 0 | 196 |
3 | Game 4 | 1 | 0 | 0 | 2 | 1 | 210 |
4 | Game 5 | 0 | 0 | 0 | 2 | 3 | 249 |
#returns the last 5 rows of our data
rl_data.tail()
Games | Match Result | Goals | Assists | Saves | Shots | Points | |
---|---|---|---|---|---|---|---|
500 | Game 501 | 0 | 1 | 0 | 3 | 5 | 622 |
501 | Game 502 | 1 | 3 | 1 | 1 | 8 | 806 |
502 | Game 503 | 0 | 3 | 0 | 1 | 5 | 613 |
503 | Game 504 | 0 | 2 | 0 | 0 | 3 | 416 |
504 | Game 505 | 1 | 1 | 4 | 1 | 4 | 696 |
#returns a series with the frequency per value
rl_data['Match Result'].value_counts()
1 279 0 226 Name: Match Result, dtype: int64
Data Cleaning =
This is my personal dataset and I'm fairly confident it's "clean". However, I'm human prone to errors so it's best to confirm.
Currently, in the Games
column the values output reads as Game 1
, Game 2
, and so on. Luckily there is clear pattern - all values includes characters followed by digits. I will remove the non-digit characters, which starts with Game
and keep the digit. Then, I will convert the column to numeric dtypes.
I'll use the Series.str.replace()
method, which will remove all the quote characters. In addition, I'll use the Series.str.split()
method to strip whitespace from each string.
rl_data['Games'] = rl_data['Games'].str.replace('Game','').str.strip()
Converting Columns to Numeric Dtypes
Now we can convert the columns to a numeric dtype. I'll use the Series.astype()
method. We don't need to store decimals so I'll select int
dtype.
Using df.head()
we'll inspect the first 5 rows along with df.info()
to confirm the column was successfully converted to int
dtype.
rl_data['Games'] = rl_data['Games'].astype(int)
rl_data.head()
Games | Match Result | Goals | Assists | Saves | Shots | Points | |
---|---|---|---|---|---|---|---|
0 | 1 | 1 | 1 | 1 | 0 | 1 | 301 |
1 | 2 | 0 | 0 | 2 | 2 | 5 | 553 |
2 | 3 | 0 | 0 | 0 | 2 | 0 | 196 |
3 | 4 | 1 | 0 | 0 | 2 | 1 | 210 |
4 | 5 | 0 | 0 | 0 | 2 | 3 | 249 |
rl_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 505 entries, 0 to 504 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Games 505 non-null int64 1 Match Result 505 non-null int64 2 Goals 505 non-null int64 3 Assists 505 non-null int64 4 Saves 505 non-null int64 5 Shots 505 non-null int64 6 Points 505 non-null int64 dtypes: int64(7) memory usage: 27.7 KB
Success!
...for now
Create A New Column Based On Existing Column Values (Part 1)
The Match Result
column is currently two values - 1
equals won and 0
which translates to loss. I want to create a new column, based on Match Result
values, with new string values W
and L
. I'll keep the original for reference.
I'll use Series.map()
method, but first I'll create a function. If the value is greater than 0, then return W
else return L
. I'll label the new column Results W/L
.
I'll inspect using df.head()
to confirm the returned results.
def label(element):
if element > 0:
return 'W'
else:
return 'L'
rl_data['Result_W/L'] = rl_data['Match Result'].map(label)
rl_data.head()
Games | Match Result | Goals | Assists | Saves | Shots | Points | Result_W/L | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 1 | 1 | 0 | 1 | 301 | W |
1 | 2 | 0 | 0 | 2 | 2 | 5 | 553 | L |
2 | 3 | 0 | 0 | 0 | 2 | 0 | 196 | L |
3 | 4 | 1 | 0 | 0 | 2 | 1 | 210 | W |
4 | 5 | 0 | 0 | 0 | 2 | 3 | 249 | L |
Create A New Column Based On Existing Column Values (Part 2)
Thinking ahead about how I want to explore the data using visualizations (more on that ahead), I want to create a new column that labels every 100 games in a certain bucket. For example, games 1-100 will be labeled as 0-100...and so on.
I'll use Series.map()
method again, but I must create my function will mulitple if statements.
def fn(x):
if x < 101:
return '0-100'
if 100 < x < 201:
return '101-200'
if 200 < x < 301:
return '201-300'
if 300 < x < 401:
return '301-400'
if 400 < x < 506:
return '401-505'
return 0
rl_data['Interval_Games'] = rl_data['Games'].map(fn)
I'll inspect using df[Series].value_counts()
because I need to confirm there is a total of 100 games per bucket and 105 values in 401-505.
rl_data['Interval_Games'].value_counts()
401-505 105 101-200 100 201-300 100 0-100 100 301-400 100 Name: Interval_Games, dtype: int64
rl_data.head()
Games | Match Result | Goals | Assists | Saves | Shots | Points | Result_W/L | Interval_Games | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 1 | 1 | 0 | 1 | 301 | W | 0-100 |
1 | 2 | 0 | 0 | 2 | 2 | 5 | 553 | L | 0-100 |
2 | 3 | 0 | 0 | 0 | 2 | 0 | 196 | L | 0-100 |
3 | 4 | 1 | 0 | 0 | 2 | 1 | 210 | W | 0-100 |
4 | 5 | 0 | 0 | 0 | 2 | 3 | 249 | L | 0-100 |
rl_data.tail()
Games | Match Result | Goals | Assists | Saves | Shots | Points | Result_W/L | Interval_Games | |
---|---|---|---|---|---|---|---|---|---|
500 | 501 | 0 | 1 | 0 | 3 | 5 | 622 | L | 401-505 |
501 | 502 | 1 | 3 | 1 | 1 | 8 | 806 | W | 401-505 |
502 | 503 | 0 | 3 | 0 | 1 | 5 | 613 | L | 401-505 |
503 | 504 | 0 | 2 | 0 | 0 | 3 | 416 | L | 401-505 |
504 | 505 | 1 | 1 | 4 | 1 | 4 | 696 | W | 401-505 |
Great, this is working as intended!
Visualizing The Data =
In my opinion, scoring at least 300 points per game is good. So let us see how I faired in these 505 games!
plt.hist(rl_data['Points'])
plt.title('Distribution of Points')
plt.xlabel('Points')
plt.ylabel('Frequency')
plt.show()
Based on the frequency, looks like most of my games are between 350 and 500.
Earlier I crearted an interval for every 100 games. I wanted to see how my first 100 games compared to my last 105 (remember I logged 505 games in total), and everything in between. I'll continue using a histogram, but this time for each interval and total points.
g = sns.FacetGrid(rl_data , col = "Interval_Games", height=3.5, aspect=.75)
g.map(sns.histplot, "Points")
g.set_axis_labels("Points", "Frequency")
<seaborn.axisgrid.FacetGrid at 0x7f907b5186d0>
Very interesting data! The closest chart to a symmertical distribution is interval 101-200. However, interval 401-505 stands out for a couple of reasons. First, sub 250 point games were more frequent. Second, higher scoring games were much more frequent as well.
sns.scatterplot(data=rl_data, x="Points", y="Goals", hue="Match Result", size="Match Result", sizes=(200,125))
<AxesSubplot:xlabel='Points', ylabel='Goals'>