This dataset contains 1295 records of American colleges and their properties, collected by the US Department of Education.
import pandas as pd
import lux
df = pd.read_csv("../data/college.csv")
df
We see that the information about ACTMedian and SATAverage has a very strong correlation. This means that we could probably just keep one of the columns and still get about the same information. So let's drop the ACTMedian column.
df = df.drop(columns=["ACTMedian"])
df
From the Category tab, we see that there are few records where PredominantDegree
is "Certificate". In addition, there are not a lot of colleges with "Private For-Profit" as FundingModel
.
We can take a look at this by inspecting the Series
corresponding to the column PredominantDegree
. Note that Lux not only helps with visualizing dataframes, but also displays visualizations of Series objects.
df["PredominantDegree"]
df[df["PredominantDegree"]=="Certificate"].to_pandas()
Upon inspection, there is only a single record for Certificate, we look at the webpage for programs offered at Cleveland State Community College and it looks like there is a large number of associate as well as certificate degrees offered. So we decide that this is more appropriately labelled as "Associate" for the PredominantDegree
field.
df.loc[df["PredominantDegree"]=="Certificate","PredominantDegree"] = "Associate"
By inspecting the subset of 9 colleges that are "Private For-Profit", we do not find any commonalities across them, so we can just leave the data as-is for now.
df[df["FundingModel"]=="Private For-Profit"]
Back to looking at the entire dataset:
df
We are interested in picking a college to attend and want to understand the AverageCost
of attending different colleges and how that relates to other information in the dataset.
df.intent = ["AverageCost"]
df
We see that there are a large number of colleges that cost around $20000 per year. We also see that Bachelor degree colleges and colleges in New England and large cities tend to have a higher AverageCost
than its counterparts.
We are interested in the trend of AverageCost
v.s. SATAverage
since there is a rough upwards relationship above AverageCost
of $30000, but below that the trend is less clear.
df.intent = ["AverageCost","SATAverage"]
df
By adding the FundingModel
, we see that the cluster of points on the left can clearly be attributed to public colleges, whereas private colleges more or less follow a trend that shows that colleges with higher SATAverage
tends to have higher AverageCost
.