N.B. A Neural Network model would not be valid with such a small dataset.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
github_data_url = "https://github.com/md12g12/ML_UK_university_funding_project/blob/main/funding_ML_df.csv?raw=true"
uni_df = pd.read_csv(github_data_url, index_col=[0])
uni_df.head()
Overall score | Entry standards | Student satisfaction | Research quality | Research intensity | Academic services spend | Facilities spend | Degree completion | Student -staff ratio | Graduate prospects – outcomes | Graduate prospects – on track | Amount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
oxford | 1000 | 200 | NaN | 3.34 | 0.87 | 2842 | 599 | 99.1 | 10.1 | 90.4 | 84.7 | 2.734851e+08 |
cambridge | 989 | 205 | NaN | 3.33 | 0.95 | 2718 | 1043 | 99.1 | 11.4 | 90.0 | 86.0 | 2.288673e+08 |
school economics and political science | 963 | 177 | 3.98 | 3.35 | 0.85 | 2051 | 853 | 96.5 | 12.4 | 90.6 | 83.3 | 3.923843e+07 |
st andrews | 947 | 208 | 4.30 | 3.13 | 0.82 | 2650 | 746 | 95.7 | 11.1 | 79.9 | 79.6 | 1.328509e+07 |
imperial | 895 | 194 | 3.99 | 3.36 | 0.92 | 2982 | 755 | 97.5 | 11.1 | 95.1 | 86.7 | 1.611291e+08 |
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5, weights="distance")
imputed_values = imputer.fit_transform(uni_df.drop(['Amount'], axis=1))
model_data = pd.DataFrame(imputed_values, index=uni_df.index, columns=uni_df.drop(["Amount"], axis=1).columns)
model_data["Amount"] = uni_df["Amount"]
model_data.dropna(inplace=True)
model_data.head()
Overall score | Entry standards | Student satisfaction | Research quality | Research intensity | Academic services spend | Facilities spend | Degree completion | Student -staff ratio | Graduate prospects – outcomes | Graduate prospects – on track | Amount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
oxford | 1000.0 | 200.0 | 4.054083 | 3.34 | 0.87 | 2842.0 | 599.0 | 99.1 | 10.1 | 90.4 | 84.7 | 2.734851e+08 |
cambridge | 989.0 | 205.0 | 4.062915 | 3.33 | 0.95 | 2718.0 | 1043.0 | 99.1 | 11.4 | 90.0 | 86.0 | 2.288673e+08 |
school economics and political science | 963.0 | 177.0 | 3.980000 | 3.35 | 0.85 | 2051.0 | 853.0 | 96.5 | 12.4 | 90.6 | 83.3 | 3.923843e+07 |
st andrews | 947.0 | 208.0 | 4.300000 | 3.13 | 0.82 | 2650.0 | 746.0 | 95.7 | 11.1 | 79.9 | 79.6 | 1.328509e+07 |
imperial | 895.0 | 194.0 | 3.990000 | 3.36 | 0.92 | 2982.0 | 755.0 | 97.5 | 11.1 | 95.1 | 86.7 | 1.611291e+08 |
total = model_data['Amount'].sum()
print("Total grants awarded in 2022 = £", f"{np.round(total):,}")
Total grants awarded in 2022 = £ 3,384,888,591.0
no_of_institutes = 10
top_grants = model_data[0:no_of_institutes]["Amount"].sum()
print("Total grants to the top", no_of_institutes, "institutes = £", f"{np.round(top_grants):,}",":", np.round((top_grants/total)*100), "% of the total value awarded")
Total grants to the top 10 institutes = £ 1,196,473,009.0 : 35.0 % of the total value awarded
Top 10% of institutions account for 35% of the grants awarded suggesting an exponential distribution.
plt.figure(figsize=(14, 6))
plt.bar(np.arange(0, len(model_data.index)), model_data['Amount'].sort_values(ascending=False).cumsum()/total*100)
plt.ylabel("Cumulative grant value / %")
plt.xlabel("Cumulitive number of institutes")
plt.title("Cumulitive grant distribution")
Text(0.5, 1.0, 'Cumulitive grant distribution')
model_data.corr()
Overall score | Entry standards | Student satisfaction | Research quality | Research intensity | Academic services spend | Facilities spend | Degree completion | Student -staff ratio | Graduate prospects – outcomes | Graduate prospects – on track | Amount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Overall score | 1.000000 | 0.830395 | -0.048085 | 0.791632 | 0.786027 | 0.464568 | 0.309395 | 0.860164 | -0.739297 | 0.778188 | 0.637087 | 0.692222 |
Entry standards | 0.830395 | 1.000000 | -0.025228 | 0.661894 | 0.662392 | 0.375319 | 0.044269 | 0.687244 | -0.517294 | 0.699312 | 0.539226 | 0.634056 |
Student satisfaction | -0.048085 | -0.025228 | 1.000000 | -0.344004 | -0.244334 | -0.182484 | 0.061939 | -0.103081 | 0.076510 | -0.225550 | -0.150424 | -0.210969 |
Research quality | 0.791632 | 0.661894 | -0.344004 | 1.000000 | 0.753589 | 0.398224 | 0.120818 | 0.643021 | -0.543658 | 0.659321 | 0.430653 | 0.578435 |
Research intensity | 0.786027 | 0.662392 | -0.244334 | 0.753589 | 1.000000 | 0.320625 | 0.255144 | 0.667069 | -0.526707 | 0.663537 | 0.470625 | 0.598534 |
Academic services spend | 0.464568 | 0.375319 | -0.182484 | 0.398224 | 0.320625 | 1.000000 | 0.028073 | 0.273375 | -0.464846 | 0.286751 | 0.199192 | 0.484833 |
Facilities spend | 0.309395 | 0.044269 | 0.061939 | 0.120818 | 0.255144 | 0.028073 | 1.000000 | 0.202390 | -0.129627 | 0.173368 | 0.276751 | 0.050622 |
Degree completion | 0.860164 | 0.687244 | -0.103081 | 0.643021 | 0.667069 | 0.273375 | 0.202390 | 1.000000 | -0.625222 | 0.708796 | 0.624710 | 0.592201 |
Student -staff ratio | -0.739297 | -0.517294 | 0.076510 | -0.543658 | -0.526707 | -0.464846 | -0.129627 | -0.625222 | 1.000000 | -0.446725 | -0.359070 | -0.561956 |
Graduate prospects – outcomes | 0.778188 | 0.699312 | -0.225550 | 0.659321 | 0.663537 | 0.286751 | 0.173368 | 0.708796 | -0.446725 | 1.000000 | 0.899597 | 0.638724 |
Graduate prospects – on track | 0.637087 | 0.539226 | -0.150424 | 0.430653 | 0.470625 | 0.199192 | 0.276751 | 0.624710 | -0.359070 | 0.899597 | 1.000000 | 0.506158 |
Amount | 0.692222 | 0.634056 | -0.210969 | 0.578435 | 0.598534 | 0.484833 | 0.050622 | 0.592201 | -0.561956 | 0.638724 | 0.506158 | 1.000000 |
sns.heatmap(model_data.corr(), cmap="viridis")
<AxesSubplot:>
model_data.describe().transpose()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Overall score | 113.0 | 6.209469e+02 | 1.405445e+02 | 304.00 | 521.00 | 599.00 | 715.00 | 1.000000e+03 |
Entry standards | 113.0 | 1.343274e+02 | 2.679800e+01 | 95.00 | 115.00 | 126.00 | 149.00 | 2.080000e+02 |
Student satisfaction | 113.0 | 4.026168e+00 | 9.300077e-02 | 3.72 | 3.97 | 4.03 | 4.07 | 4.300000e+00 |
Research quality | 113.0 | 2.730442e+00 | 3.941387e-01 | 1.40 | 2.51 | 2.74 | 3.06 | 3.360000e+00 |
Research intensity | 113.0 | 4.950442e-01 | 2.791445e-01 | 0.08 | 0.24 | 0.39 | 0.78 | 9.500000e-01 |
Academic services spend | 113.0 | 1.749690e+03 | 4.289448e+02 | 884.00 | 1448.00 | 1729.00 | 2011.00 | 2.982000e+03 |
Facilities spend | 113.0 | 6.933805e+02 | 2.781958e+02 | 189.00 | 505.00 | 683.00 | 808.00 | 1.939000e+03 |
Degree completion | 113.0 | 8.556549e+01 | 7.217548e+00 | 53.10 | 81.00 | 85.00 | 91.30 | 9.910000e+01 |
Student -staff ratio | 113.0 | 1.592124e+01 | 2.752286e+00 | 10.10 | 14.20 | 15.50 | 17.30 | 2.350000e+01 |
Graduate prospects – outcomes | 113.0 | 7.201770e+01 | 9.008361e+00 | 53.10 | 66.00 | 71.00 | 79.00 | 9.510000e+01 |
Graduate prospects – on track | 113.0 | 7.669115e+01 | 5.157580e+00 | 60.50 | 73.60 | 77.00 | 80.00 | 9.000000e+01 |
Amount | 113.0 | 2.995477e+07 | 5.192786e+07 | 26623.80 | 2397769.16 | 9099444.15 | 34000459.19 | 2.734851e+08 |
sns.jointplot(x=model_data["Student satisfaction"], y=model_data["Amount"])
<seaborn.axisgrid.JointGrid at 0x1e1cdb42c70>
sns.jointplot(x=model_data["Facilities spend"], y=model_data["Amount"])
<seaborn.axisgrid.JointGrid at 0x1e1cdd45430>
sns.pairplot(model_data)
<seaborn.axisgrid.PairGrid at 0x1e1cddadd30>
sns.histplot(np.log(model_data["Amount"]))
<AxesSubplot:xlabel='Amount', ylabel='Count'>
sns.scatterplot(x=model_data["Overall score"]/1000, y=np.log(model_data["Amount"]), color="b", alpha=0.5)
<AxesSubplot:xlabel='Overall score', ylabel='Amount'>
N.B. With 2D data as in the first model a simple linear regression could be used, but ElasticNet was used for consistency.
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score
from sklearn.linear_model import ElasticNetCV
from sklearn.metrics import mean_squared_error, r2_score
X_train_lin, X_test_lin, y_train_lin, y_test_lin = train_test_split(model_data["Overall score"], np.log(model_data["Amount"]), test_size=0.2, random_state=0)
lin_EN = ElasticNetCV(cv=5, random_state=0, n_alphas=50)
lin_EN.fit(X_train_lin.values.reshape(-1, 1), y_train_lin)
print("R2 test score =", lin_EN.score(X_test_lin.values.reshape(-1, 1), y_test_lin))
print("MSE test value = ", mean_squared_error(lin_EN.predict(X_test_lin.values.reshape(-1, 1)), y_test_lin))
print("MSE final train value =", np.mean(lin_EN.mse_path_, axis=1)[-1])
R2 test score = 0.6699829208555639 MSE test value = 1.130851928183498 MSE final train value = 1.7579898886242507
sns.lineplot(x=np.arange(0,len(lin_EN.mse_path_)), y=(np.mean(lin_EN.mse_path_, axis=1)))
plt.xlabel("regularization iteration")
plt.ylabel("MSE")
plt.title("MSE vs iteration")
Text(0.5, 1.0, 'MSE vs iteration')
sns.scatterplot(x=model_data["Overall score"], y=np.log(model_data["Amount"]), color="b", alpha=0.5, label="raw data")
sns.scatterplot(x=X_test_lin, y= lin_EN.predict(X_test_lin.values.reshape(-1, 1)), color="r", label="predicted values")
sns.scatterplot(x=X_test_lin, y=y_test_lin, color="k", label="test values")
plt.xlabel("Overall score")
plt.ylabel("Predicted Amount values / log(£)")
plt.title("Pred vs Test")
Text(0.5, 1.0, 'Pred vs Test')
X_train_feat, X_test_feat, y_train_feat, y_test_feat = train_test_split(model_data.drop(["Amount"], axis=1), np.log(model_data["Amount"]), test_size=0.2, random_state=0)
l1_ratio=[.01, .1, .5, .7, .9, .95, .99, 1]
lin_ENfeat = ElasticNetCV(cv=5, random_state=0, n_alphas=1000, l1_ratio=l1_ratio, eps=1e-3)
lin_ENfeat.fit(X_train_feat, y_train_feat)
print("R2 =", lin_ENfeat.score(X_test_feat, y_test_feat))
print("MSE test score =", mean_squared_error(y_test_feat, lin_ENfeat.predict(X_test_feat)))
print("MSE train score =", np.mean(np.mean(lin_ENfeat.mse_path_, axis=1), axis=1))
R2 = 0.5890819066709618 MSE test score = 1.4080711197472693 MSE train score = [2.23830911 2.15193097 2.13616893 2.13488121 2.13415446 2.13401964 2.13392145 2.13389809]
for l1_num, l1_results in enumerate(lin_ENfeat.mse_path_):
sns.lineplot(x=np.arange(0,len(l1_results)), y=(np.mean(l1_results, axis=1)), label=l1_ratio[l1_num])
plt.xlabel("regularization iteration")
plt.ylabel("MSE")
plt.title("MSE vs iteration for different l1 ratios")
Text(0.5, 1.0, 'MSE vs iteration for different l1 ratios')
plt.plot(np.arange(13,19,0.01), np.arange(13,19,0.01))
sns.scatterplot(x=y_test_feat, y=lin_ENfeat.predict(X_test_feat))
plt.xlabel("Actual Amount values / log(£)")
plt.ylabel("Predicted Amount values / log(£)")
plt.title("Pred vs Test")
Text(0.5, 1.0, 'Pred vs Test')
print("r2 and MSE for log(Amount) models")
print("MSE 1 feature :", mean_squared_error(y_test_lin, lin_EN.predict(X_test_lin.values.reshape(-1, 1))))
print("R2 1 feature", r2_score(y_test_lin, lin_EN.predict(X_test_lin.values.reshape(-1, 1))))
print("MSE all features :", mean_squared_error(y_test_feat, lin_ENfeat.predict(X_test_feat)))
print("R2 all features", r2_score(y_test_feat, lin_ENfeat.predict(X_test_feat)))
r2 and MSE for log(Amount) models MSE 1 feature : 1.130851928183498 R2 1 feature 0.6699829208555639 MSE all features : 1.4080711197472693 R2 all features 0.5890819066709618
print("r2 and MSE for exp(log(Amount)) models")
print("MSE 1 feature :", format(mean_squared_error(np.exp(y_test_lin), np.exp(lin_EN.predict(X_test_lin.values.reshape(-1, 1)))), '.1E'))
print("R2 1 feature", r2_score(np.exp(y_test_lin), np.exp(lin_EN.predict(X_test_lin.values.reshape(-1, 1)))))
print("MSE all features :", format(mean_squared_error(np.exp(y_test_feat), np.exp(lin_ENfeat.predict(X_test_feat))), '.1E'))
print("R2 all features", r2_score(np.exp(y_test_feat), np.exp(lin_ENfeat.predict(X_test_feat))))
r2 and MSE for exp(log(Amount)) models MSE 1 feature : 3.7E+15 R2 1 feature 0.022296716973354225 MSE all features : 4.7E+15 R2 all features -0.23935269846859497
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
X_train, X_test, y_train, y_test = train_test_split(model_data.drop(["Amount"], axis=1), model_data["Amount"], test_size=0.2, random_state=40)
parameters = {'svr__kernel':(['rbf']), 'svr__gamma': (['scale']), 'svr__C':[0.1, 1, 10, 100, 1000], 'svr__epsilon':[100, 10, 1, 0.1, 0.01]}
#change gamma to scale
svr = SVR()
scaler = StandardScaler()
pipe = Pipeline([("scaler", scaler), ("svr", svr)])
search = GridSearchCV(pipe, parameters, cv=2)
search.fit(X_train, y_train)
print(search.best_params_)
print("r2", r2_score(y_test, search.predict(X_test)))
print("MSE", format(mean_squared_error(y_test, search.predict(X_test)), '.1E'))
{'svr__C': 1000, 'svr__epsilon': 100, 'svr__gamma': 'scale', 'svr__kernel': 'rbf'} r2 -0.2829307124286562 MSE 2.3E+15
sns.scatterplot(x=model_data["Overall score"], y=(model_data["Amount"]), color="b", alpha=0.5, label="raw data")
sns.scatterplot(x=X_test["Overall score"], y= search.predict(X_test), color="r", label="predicted values")
sns.scatterplot(x=X_test["Overall score"], y=y_test, color="k", label="test values")
plt.xlabel("Overall score")
plt.ylabel("Predicted Amount values / log(£)")
plt.title("Pred vs Test")
Text(0.5, 1.0, 'Pred vs Test')
#plt.plot(np.arange(0, 1e7, 1e4), np.arange(0, 1e7, 1e4))
sns.scatterplot(x=y_test, y=search.predict(X_test))
plt.xlabel("Actual Amount values / £")
plt.ylabel("Predicted Amount values / £")
plt.title("Pred vs Test")
Text(0.5, 1.0, 'Pred vs Test')
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_predict
X_train, X_test, y_train, y_test = train_test_split(model_data.drop(["Amount"], axis=1), model_data["Amount"], test_size=0.2, random_state=40)
forest_regr = RandomForestRegressor(n_estimators=25,
max_depth=2,
min_samples_split=2)
forest_regr.fit(X_train, y_train)
print("r2 test", r2_score(forest_regr.predict(X_test), y_test))
print("r2 train", r2_score(forest_regr.predict(X_train), y_train))
print("MSE test", mean_squared_error(forest_regr.predict(X_test), y_test))
print("MSE train", mean_squared_error(forest_regr.predict(X_train), y_train))
r2 test 0.7201305434966744 r2 train 0.7294809841686116 MSE test 597675257666315.1 MSE train 475993331595102.3
sns.scatterplot(x=y_test, y=forest_regr.predict(X_test), color="b", alpha=0.5, label="predicted vs actual")
plt.plot(np.arange(0,1.6e8, 1e5), np.arange(0,1.6e8, 1e5))
plt.xlabel("Test Amount values / £")
plt.ylabel("Predicted Amount values / £")
plt.title("Pred vs Test")
plt.legend()
<matplotlib.legend.Legend at 0x1e1d6b6b8e0>
It seems to perform very well, easily beating out the linear regression models, and it seems unlikely to be overfitting given the train and test R2 scores are the same, however some parameter tuning is still necessary.
parameters = {"n_estimators": [10, 25, 50, 75, 100, 125], "max_depth": [1,2,3,4,5,6,7,8,9], "min_samples_split": [2,3,4,5], "min_samples_leaf": [1,2,3,4,5,6]}
regr = RandomForestRegressor()
grid_regr = GridSearchCV(regr, parameters, cv=5)
grid_regr.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=RandomForestRegressor(), param_grid={'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'min_samples_leaf': [1, 2, 3, 4, 5, 6], 'min_samples_split': [2, 3, 4, 5], 'n_estimators': [10, 25, 50, 75, 100, 125]})
print(grid_regr.best_params_)
print("r2 test :", grid_regr.score(X_test, y_test))
print("r2 train :", grid_regr.score(X_train, y_train))
{'max_depth': 8, 'min_samples_leaf': 1, 'min_samples_split': 3, 'n_estimators': 10} 0.5113925206532943 0.9324792016795899
forest_regr = RandomForestRegressor(n_estimators=10,
max_depth=8,
min_samples_split=3,
min_samples_leaf=1)
forest_regr.fit(X_train, y_train)
print("r2 test :", r2_score(forest_regr.predict(X_test), y_test))
print("r2 train :", r2_score(forest_regr.predict(X_train), y_train))
print("MSE test :", mean_squared_error(forest_regr.predict(X_test), y_test))
print("MSE train :", mean_squared_error(forest_regr.predict(X_train), y_train))
r2 test : 0.7394697638101138 r2 train : 0.8893858787103363 MSE test : 945056703206409.2 MSE train : 276875372373278.2
sns.scatterplot(x=y_test, y=forest_regr.predict(X_test), color="b", alpha=0.5, label="predicted vs actual")
plt.plot(np.arange(0,1.6e8, 1e5), np.arange(0,1.6e8, 1e5))
plt.xlabel("Test Amount values / £")
plt.ylabel("Predicted Amount values / £")
plt.title("Pred vs Test")
plt.legend()
<matplotlib.legend.Legend at 0x1e1d7082a30>
Linear regression using the log("Amount") vales provides a quick and simple approximation, with a reasonable $R^2=0.67$.
Including all the features did not improve the model's results
When trying to predeict the actual values by taking the exponent on the results (exp(log(predicted amount)) these results are worsened significantly.
Due to the small dataset size these results are very dependent on the train/test splitting. A single entry at the top end of the "Amount" feature can significantly change the results leading to a different conclusion.
Support Vector Regression was used with a rbf kernel to try and use the raw "Amount" values. A working model could not be found through hyperparameter fitting, likely due to the inherent noise in the data.
Finally a Random Forest Regression model was optimized using a grid search. This proved to be the best model out of those discussed here, with $R^2=0.73$, better than the single feature regression on log("Amount"), and with a significantly smaller MSE ($3.7e^{14}$ vs $9.4e^{14}$).
Again, despite using cross validation, the small dataset size and range of expected results (spanning > 4 orders of magnitude) creates quite sensitive models. Running the random forest multiple times and averaging the results might provide the best solution, despite it already being an ensemble.