%matplotlib inline
import time
from pprint import pprint
import matplotlib.pyplot as plt
import nivapy3 as nivapy
import numpy as np
import pandas as pd
import seaborn as sn
import utils
from matplotlib import colors
from sklearn import model_selection
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
sn.set(style="ticks")
plt.style.use("ggplot")
This notebook provides a basic framework for:
Scikit-Learn provides a wide range of algorithms for supervised classification (see here for an overview). It should be possible to use any of Sklearn's classification (not regression) algorithms in this notebook. I've imported seven suggested algorithms in the code above, but feel free to add others if you wish.
Tip: If you're trying out a new algorithm, start off with a small training dataset (maybe 10,000-ish rows) and, if it works, scale things up. If you go straight for millions of rows, you'll likely get lots of crashes (plus 5 to 10 minute delays as the cluster recovers before you can restart).
Note: To run this notebook in its present form, you will need to sign in to the JupyterHub on one of the larger machines: either "High memory" (48 GB RAM; 8 CPUs) or "High memory & high CPU" (60 GB RAM; 16 CPUs). For long-running computations, please remember to check progress periodically and shut down your server when you're finished. (This doesn't mean you need to be tied to your computer; you can shut down your server pretty easily from your phone).
# Specify area(s) from which to build train/test dataset
area_list = [1, 3, 6]
# Specify area for 'final' test image (should NOT be in the list above)
pred_area = 2
# Number of samples to use for training and testing
train_size = 1e4
test_size = 1e4
# For repeatability of random numbers
seed = 42
# Scoring metric for model comparison
metric = "accuracy"
# Number of folds for CV
n_folds = 10
# Whether to equalise images using linear histogram stretch
equalise = False
assert (
pred_area not in area_list
), "'pred_area' should be an independent test i.e. not included in 'area_list'."
# Define colourbar
# Class codes for "truth" datasets
class_codes = {
-1: "Other",
0: "No data",
1: "Brown algae",
2: "Green algae",
3: "Red algae",
4: "Eelgrass",
5: "Rock",
6: "Sand",
7: "Lichen",
8: "Lichen (2)",
9: "Terrestrial vegetation",
10: "Beach/shingle",
}
# Define colours for classes. Ordered -1 to 10 (i.e. same as in 'class_codes', above)
cmap = colors.ListedColormap(
[
"lightcyan",
"black",
"tan",
"lawngreen",
"red",
"none",
"lightgrey",
"gold",
"magenta",
"none",
"mediumseagreen",
"orange",
]
)
bounds = np.arange(-1.5, 11.5)
norm = colors.BoundaryNorm(bounds, cmap.N)
We begin by reading all the data from several different areas, so that our training dataset has a mixture of pixels from different parts of the study region. The three areas chosen above together have ~19 million pixels, which is too large and slow for the kind of experimentation we want to do here. I've therefore set both train_size
and test_size
to 2 million rows. This should be more manageable, while still providing a realistic impression of performance.
We also read in a fourth image, which is kept separately and used for final testing and visualisation.
You are welcome to use data from other areas and choose different scoring metrics (see Sklearn's documentation). Also consider setting equalise=True
in the options above. This should have no effect on "non-parametric" methods (like random forest) but could affect some of the other classifiers.
# Read train/test image data
df_list = []
for area in area_list:
df = utils.image_to_sample_df(area, equalise=equalise)
df_list.append(df)
df = pd.concat(df_list, axis="rows")
del df_list
# Split into training and evaluation
cols = [str(i) for i in range(1, 9)]
X_train, X_eval, y_train, y_eval = train_test_split(
df[cols],
df["y"],
train_size=int(train_size),
test_size=int(test_size),
random_state=seed,
shuffle=True,
)
assert len(X_train) == len(y_train)
assert len(X_eval) == len(y_eval)
print(f"Total number of samples in dataset: {(len(df) / 1E6):.2f} million.")
print(f"Number of samples in training dataset: {(len(X_train) / 1E6):.2f} million.")
print(f"Number of samples in evaluation dataset: {(len(X_eval) / 1E6):.2f} million.")
Total number of samples in dataset: 18.87 million. Number of samples in training dataset: 0.01 million. Number of samples in evaluation dataset: 0.01 million.
# Read bands for 'final' test image
pred_df = utils.image_to_sample_df(pred_area, equalise=equalise, dropna=False)
del pred_df["y"]
# Read manually classified "truth" datset
man_path = f"/home/jovyan/shared/drones/frisk_oslofjord/raster/ne_akeroya_5cm_area_{pred_area}_man_class.tif"
man_img, ndv, man_epsg, extent = nivapy.spatial.read_raster(man_path, band_no=1)
print(f"Total number of pixels in test image: {(len(pred_df) / 1E6):.2f} million.")
Total number of pixels in test image: 13.32 million.
At the top of this notebook I've imported seven classifiers as a starting point (feel free to change them if you like):
The first of these is essentially the same as the method you implemented with Guri (which I think was multinomial logistic regression?). This is a linear approach, as is LinearDiscriminantAnalysis
and the RidgeClassifier
. GaussianNB
is a very simple method, but it can be effective in some circumstances. The KNeighborsClassifier
is also conceptually simple, but often surprisingly powerful. The last two are tree-based classifiers, where the RandomForestClassifier
is just an ensemble of DecisionTreeClassifiers
. These two - together with KNN - are non-parametric methods, which makes them both flexible and powerful.
As a starting point, I recommend reading the online documentation for each of these models. Most of the algorithms will have a general introduction (linked from here) plus a more detailed page describing the API and hyperparameter options (e.g. here for the RandomForestClassifier
). For each algorithm, aim to:
Get a rough understanding of what it does and what assumptions it makes, and
Write down which hyperparameters you think might be sensible to tune
Then, for each of the models in turn...
Change the code below to match the model you want to work with.
# Create the base model to tune
#model = RandomForestClassifier(n_jobs=-1)
model = SVC(kernel='rbf')
Change the code below to create a "grid" of hyperparameters to search. These hyperparameters must match the options documented in the API for each algorithm (see above; the code below applies only to the RandomForestClassifier
).
Be conservative with the number of combinations you choose to begin with until you get an idea of how long each model takes to fit (e.g. start with a grid of just one or two parameter combinations and work from there). Bear in mind that the combinations multiply quickly and runtimes can become large.
## Number of trees in random forest
#n_estimators = [10]
#
## Number of features to consider at every split
#max_features = ["auto"] # "sqrt"
#
## Maximum number of levels in tree
#max_depth = [2, 6, 10, 20]
#
## Create the random grid
#random_grid = {
# "max_features": max_features,
# "max_depth": max_depth,
# "n_estimators": n_estimators,
#}
C = [10, 100, 200]
gamma = [0.00005, 0.0001, 'auto', 'scale']
random_grid = {
"C": C,
"gamma": gamma,
}
pprint(random_grid)
{'C': [10, 100, 200], 'gamma': [5e-05, 0.0001, 'auto', 'scale']}
The code below trains the model multiple times, exploring the hyperparameter grid.
%%time
# Random search of parameters, using cross-validation,
model_random = RandomizedSearchCV(
estimator=model,
param_distributions=random_grid,
n_iter=50,
cv=n_folds,
verbose=2,
random_state=seed,
n_jobs=-1,
)
# Fit the random search model
model_random.fit(X_train, y_train)
Fitting 10 folds for each of 12 candidates, totalling 120 fits
/opt/conda/lib/python3.8/site-packages/sklearn/model_selection/_search.py:278: UserWarning: The total space of parameters 12 is smaller than n_iter=50. Running 12 iterations. For exhaustive searches, use GridSearchCV. warnings.warn( [Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers. [Parallel(n_jobs=-1)]: Done 25 tasks | elapsed: 27.4s [Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed: 1.8min finished
CPU times: user 2.16 s, sys: 121 ms, total: 2.28 s Wall time: 1min 52s
RandomizedSearchCV(cv=10, estimator=SVC(), n_iter=50, n_jobs=-1, param_distributions={'C': [10, 100, 200], 'gamma': [5e-05, 0.0001, 'auto', 'scale']}, random_state=42, verbose=2)
The code below picks the "best" model from among the ones specified in your hyperparameter grid and makes predictions for the testing dataset. Make a note of the "best" hyperparameter values for each model, so you can use them again later.
The code also prints model skill statistics. It's probably a good idea to modify this to save the output for each model.
%%time
# Predict classes for the evaluation data using the 'best' model found
best = model_random.best_estimator_
best_preds = best.predict(X_eval)
print("The 'best' model identified has the following parameters:\n")
print(best, "\n")
The 'best' model identified has the following parameters: SVC(C=200, gamma=5e-05) CPU times: user 791 ms, sys: 1.12 ms, total: 792 ms Wall time: 791 ms
# Skill summary
# Only use relevant labels from the training dataset
class_labels = [1, 2, 3, 5, 6, 7, 9, 10]
class_names = [class_codes[i] for i in class_labels]
utils.classification_report(
y_eval,
best_preds,
class_labels,
class_names,
)
Classification report: precision recall f1-score support Brown algae 0.94 0.95 0.95 3921 Green algae 0.54 0.35 0.42 84 Red algae 0.82 0.20 0.33 44 Rock 0.86 0.94 0.90 3416 Sand 0.86 0.88 0.87 1234 Lichen 0.71 0.41 0.52 58 Terrestrial vegetation 0.87 0.80 0.83 652 Beach/shingle 0.82 0.54 0.65 591 accuracy 0.89 10000 macro avg 0.80 0.63 0.68 10000 weighted avg 0.89 0.89 0.89 10000 Classification accuracy: 0.892400
# Tidy up to save memory
del model, model_random
Section 3 should give a good idea of which hyperparameters work best for each algorithm. In this section, you can take the best model of each kind and compare them against one another.
At present, the code below uses the default hyperparameters for each model (except for the random forest, which we've already worked with and therefore know something about). Where possible, I've setup parallel processing (by setting n_jobs=1
), but note that not all algorithms support this.
Ideally, you should set the hyperparameters for each of the models below in a similar way, based on your results above.
# Prepare models
models = []
#models.append(("LogReg", LogisticRegression(n_jobs=-1)))
#models.append(("RidReg", RidgeClassifier()))
#models.append(("LDA", LinearDiscriminantAnalysis()))
#models.append(("KNN", KNeighborsClassifier(n_jobs=-1)))
#models.append(("CART", DecisionTreeClassifier()))
models.append(("RF", RandomForestClassifier(max_depth=20, n_estimators=10, n_jobs=-1)))
#models.append(("NB", GaussianNB()))
models.append(('SVC', SVC(C=200, gamma=5e-05)))
# Containers for results
df_list = []
pred_img_dict = {"Manual": man_img}
# Evaluate each model in turn
for name, model in models:
print(f"Fitting {model}")
print(" Cross-validating...")
start = time.time()
kfold = model_selection.KFold(
n_splits=n_folds,
random_state=seed,
shuffle=True,
)
cv_results = model_selection.cross_val_score(
model, X_train, y_train, cv=kfold, scoring=metric, n_jobs=-1
)
mid = time.time()
# Fit the model to the full training dataset and then predict 'final' test image
print(" Refitting and predicting test image...")
model.fit(X_train, y_train)
pred_img = model.predict(pred_df).reshape(man_img.shape)
pred_img[man_img == 0] = 0
pred_img_dict[name] = pred_img
end = time.time()
# Build results dataframe
res_df = pd.DataFrame(
{
"model": name,
"fold": range(1, n_folds + 1),
metric: cv_results,
"train_time_per_fold_s": (mid - start) / n_folds,
"pred_time_s": (end - mid),
}
)
df_list.append(res_df)
print("Done.")
res_df = pd.concat(df_list, axis="rows")
avg_df = (
res_df.groupby("model")
.mean()
.sort_values(metric, ascending=False)[
["accuracy", "train_time_per_fold_s", "pred_time_s"]
]
)
avg_df
Fitting RandomForestClassifier(max_depth=20, n_estimators=10, n_jobs=-1) Cross-validating... Refitting and predicting test image... Fitting SVC(C=200, gamma=5e-05) Cross-validating... Refitting and predicting test image... Done.
accuracy | train_time_per_fold_s | pred_time_s | |
---|---|---|---|
model | |||
SVC | 0.8894 | 0.414523 | 1050.339782 |
RF | 0.8644 | 0.037790 | 7.524469 |
# Boxplots of skill
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10, 6))
sn.boxplot(data=res_df, x="model", y=metric, ax=ax, order=avg_df.index)
_ = ax.set_title("Algorithm skill comparison")
Note the log y-axis on the bar plot below - KNN is very slow to predict!
# Prediction speed
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10, 6))
sn.barplot(
data=avg_df,
x=avg_df.index,
y="pred_time_s",
ax=ax,
order=avg_df.sort_values("pred_time_s").index,
)
_ = ax.set_title("Algorithm prediciton speed")
ax.set_yscale('log')
# Compare images predicted by each algorithm
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(15, 15))
axes = axes.flatten()
fig.delaxes(axes[-1])
for idx, name in enumerate(["Manual"] + list(avg_df.index)):
img = axes[idx].imshow(
pred_img_dict[name], cmap=cmap, norm=norm, interpolation="none"
)
axes[idx].set_title(name, fontsize=12)
axes[idx].axis("off")
plt.tight_layout()
The output above compares the performance of the seven different models. It is also interesting to note how long it takes to fit each model (see the table above). This is important because we have a lot of data available in the full images (tens of millions of pixels) and we want to be able to use it effectively. If an algorithm is already slow here fitting just two million samples, it is unlikely to scale well to the larger dataset.
We can view performance in two ways: model skill (e.g. accuracy) and computational speed. Different algorithms scale differently with the number of data samples, so it's worth exploring this further. The code below repeats the analysis above but for a range of different dataset sizes.
This will take a long time to run (several hours)!
# Number of CV folds
n_folds = 50
# Train sizes
train_sizes = [50e3, 200e3, 500e3, 1e6, 1.5e6, 2e6, 3e6]
# Container for results
df_list = []
# Prepare models
models = []
models.append(("LogReg", LogisticRegression(n_jobs=-1)))
models.append(("RidReg", RidgeClassifier()))
models.append(("LDA", LinearDiscriminantAnalysis()))
models.append(("KNN", KNeighborsClassifier(n_jobs=-1)))
models.append(("CART", DecisionTreeClassifier()))
models.append(("RF", RandomForestClassifier(max_depth=20, n_estimators=10, n_jobs=-1)))
models.append(("NB", GaussianNB()))
for train_size in train_sizes:
print(f"Train size: {train_size}")
# Split into training and evaluation
cols = [str(i) for i in range(1, 9)]
X_train, X_eval, y_train, y_eval = train_test_split(
df[cols],
df["y"],
train_size=int(train_size),
test_size=int(train_size),
random_state=seed,
shuffle=True,
)
# Evaluate each model in turn
for name, model in models:
print(f" Fitting {model}")
start = time.time()
kfold = model_selection.KFold(
n_splits=n_folds,
random_state=seed,
shuffle=True,
)
cv_results = model_selection.cross_val_score(
model, X_train, y_train, cv=kfold, scoring=metric, n_jobs=-1
)
end = time.time()
# Build results dataframe
res_df = pd.DataFrame(
{
"model": name,
"n_samples": len(X_train),
"fold": range(1, n_folds + 1),
metric: cv_results,
"train_time_per_fold_s": (end - start) / n_folds,
}
)
df_list.append(res_df)
print("Done.")
res_df = pd.concat(df_list, axis="rows")
res_df.to_csv(r"../results/model_comparison.csv", index=False)
Train size: 50000.0 Fitting LogisticRegression(n_jobs=-1)
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-18-2792da6ecec9> in <module> 40 shuffle=True, 41 ) ---> 42 cv_results = model_selection.cross_val_score( 43 model, X_train, y_train, cv=kfold, scoring=metric, n_jobs=-1 44 ) /opt/conda/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs) 70 FutureWarning) 71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)}) ---> 72 return f(**kwargs) 73 return inner_f 74 /opt/conda/lib/python3.8/site-packages/sklearn/model_selection/_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score) 399 scorer = check_scoring(estimator, scoring=scoring) 400 --> 401 cv_results = cross_validate(estimator=estimator, X=X, y=y, groups=groups, 402 scoring={'score': scorer}, cv=cv, 403 n_jobs=n_jobs, verbose=verbose, /opt/conda/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs) 70 FutureWarning) 71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)}) ---> 72 return f(**kwargs) 73 return inner_f 74 /opt/conda/lib/python3.8/site-packages/sklearn/model_selection/_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score) 240 parallel = Parallel(n_jobs=n_jobs, verbose=verbose, 241 pre_dispatch=pre_dispatch) --> 242 scores = parallel( 243 delayed(_fit_and_score)( 244 clone(estimator), X, y, scorers, train, test, verbose, None, /opt/conda/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable) 1040 1041 with self._backend.retrieval_context(): -> 1042 self.retrieve() 1043 # Make sure that we get a last message telling us we are done 1044 elapsed_time = time.time() - self._start_time /opt/conda/lib/python3.8/site-packages/joblib/parallel.py in retrieve(self) 919 try: 920 if getattr(self._backend, 'supports_timeout', False): --> 921 self._output.extend(job.get(timeout=self.timeout)) 922 else: 923 self._output.extend(job.get()) /opt/conda/lib/python3.8/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout) 540 AsyncResults.get from multiprocessing.""" 541 try: --> 542 return future.result(timeout=timeout) 543 except CfTimeoutError as e: 544 raise TimeoutError from e /opt/conda/lib/python3.8/concurrent/futures/_base.py in result(self, timeout) 432 return self.__get_result() 433 --> 434 self._condition.wait(timeout) 435 436 if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]: /opt/conda/lib/python3.8/threading.py in wait(self, timeout) 300 try: # restore state no matter what (e.g., KeyboardInterrupt) 301 if timeout is None: --> 302 waiter.acquire() 303 gotit = True 304 else: KeyboardInterrupt:
time_df = res_df.groupby(["model", "n_samples"]).mean().reset_index()
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10, 6))
sn.lineplot(
data=time_df,
x="n_samples",
y="train_time_per_fold_s",
hue="model",
marker="o",
markersize=10,
)
# Get the average scores for the largest training dataset
max_n_samples = res_df["n_samples"].max()
avg_df = res_df.query("n_samples == @max_n_samples")
avg_df = (
avg_df.groupby("model")
.mean()[[metric, "train_time_per_fold_s"]]
.sort_values(metric, ascending=False)
)
avg_df
# Boxplots over all training dataset sizes
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10, 6))
sn.boxplot(data=res_df, x="model", y=metric, ax=ax, order=avg_df.index)
_ = ax.set_title("Algorithm Comparison")
I haven't done much experimenting with this myself yet, but my inital impression is that the RandomForestClassifier
and the KNeighborsClassifier
consistently provide good accuracy and computational performance. As the dataset size grows, I would expect the performance of the random forest to scale better than the KNN algorithm, but it will be interesting to see what you find.
Good luck! :-)