This dataset contains 1295 records of American colleges and their properties, collected by the US Department of Education.

In [ ]:

import pandas as pd
import lux

In [ ]:

df = pd.read_csv("../data/college.csv")
df

We see that the information about ACTMedian and SATAverage has a very strong correlation. This means that we could probably just keep one of the columns and still get about the same information. So let's drop the ACTMedian column.

In [ ]:

df = df.drop(columns=["ACTMedian"])
df

From the Category tab, we see that there are few records where PredominantDegree is "Certificate". In addition, there are not a lot of colleges with "Private For-Profit" as FundingModel.

We can take a look at this by inspecting the Series corresponding to the column PredominantDegree. Note that Lux not only helps with visualizing dataframes, but also displays visualizations of Series objects.

In [ ]:

df["PredominantDegree"]

In [ ]:

df[df["PredominantDegree"]=="Certificate"].to_pandas()

Upon inspection, there is only a single record for Certificate, we look at the webpage for programs offered at Cleveland State Community College and it looks like there is a large number of associate as well as certificate degrees offered. So we decide that this is more appropriately labelled as "Associate" for the PredominantDegree field.

In [ ]:

df.loc[df["PredominantDegree"]=="Certificate","PredominantDegree"] = "Associate"

By inspecting the subset of 9 colleges that are "Private For-Profit", we do not find any commonalities across them, so we can just leave the data as-is for now.

In [ ]:

df[df["FundingModel"]=="Private For-Profit"]

Back to looking at the entire dataset:

In [ ]:

df

We are interested in picking a college to attend and want to understand the AverageCost of attending different colleges and how that relates to other information in the dataset.

In [ ]:

df.intent = ["AverageCost"]
df

We see that there are a large number of colleges that cost around $20000 per year. We also see that Bachelor degree colleges and colleges in New England and large cities tend to have a higher AverageCost than its counterparts.

We are interested in the trend of AverageCost v.s. SATAverage since there is a rough upwards relationship above AverageCost of $30000, but below that the trend is less clear.

In [ ]:

df.intent = ["AverageCost","SATAverage"]
df

By adding the FundingModel, we see that the cluster of points on the left can clearly be attributed to public colleges, whereas private colleges more or less follow a trend that shows that colleges with higher SATAverage tends to have higher AverageCost.