import os
import sys
sys.path.append("..")
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sn
import utils
from sklearn.preprocessing import StandardScaler
plt.style.use("ggplot")
Exploring distributions for single parameters (as in notebook 2) is a reasonable starting point for quality assurance, but a more general approach is to look for "outliers" at the water sample level i.e. samples that are of questionable quality because one or more parameter values are unusual. If the suite of water quality parameters analysed is consistent (i.e. no data gaps), then each sample can be considered as a point in $n$ dimensional space, where $n$ is the number of parameters measured. Rather than looking for "outliers" along a single dimension (as in the distribution plots considered already), we can instead look for "outliers" in this higher dimensional space. A variety of algorithms are available to do this, some of which are explored below.
"Outlier" and "novelty" detection are two different kinds of anomaly detection i.e. where we are interested in detecting abnormal or unusual observations. The difference between the two is important, but often overlooked:
Outlier detection: We have a single dataset that is believed to contain "outliers", which are observations that are "far" from the others (for some chosen definition of "far"). Outlier detection estimators thus try to fit the regions where the data is most concentrated, ignoring the deviant observations.
Novelty detection: We have access to a reference dataset that is not polluted by outliers, and we are interested in detecting whether a new observation (from a second dataset) is an outlier - a "novelty" - or not
In the context of this project, we are primarily interested in novelty detection, because we have a reference historic dataset extracted from Vannmiljø and we would like to gauge whether observations in the "new" dataset are sufficiently unusual/unlikely to warrant further investigation and reanalysis. However, the summary in the previous notebook identified some possible issues in the Vannmiljø data, as well as in the "new" data. I am therefore reluctant to use the historic Vannmiljø dataset as a "reference" without additional cleaning. Instead, to begin with, at least, I will combine the "new" and "historic" results into a single dataset and then perform outlier detection (not novelty detection). This can be revised later if desired.
# Choose dataset to process
lab = "Eurofins"
year = 2022
qtr = 3
version = 1
fold_path = f"../../output/{lab.lower()}_{year}_q{qtr}_v{version}"
# Read from SQLite
stn_df, df = utils.read_data_from_sqlite(lab, year, qtr, version)
# # Subset data to just the quarter of interest
# months_dict = {
# "q1": [1, 2, 3],
# "q2": [4, 5, 6],
# "q3": [7, 8, 9],
# "q4": [10, 11, 12],
# }
# months = months_dict[qtr]
# df = df[df["sample_date"].dt.month.isin(months)]
df.head()
vannmiljo_code | sample_date | lab | period | depth1 | depth2 | ALK_mmol/l | ANC_µekv/l | CA_mg/l | CL_mg/l | ... | N-NO3_µg/l N | N-TOT_µg/l N | NA_mg/l | P-TOT_µg/l P | PH_<ubenevnt> | RAL_µg/l Al | SIO2_µg/l Si | SO4_mg/l | TEMP_°C | TOC_mg/l C | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 002-105961 | 2022-07-06 | Eurofins | new | 0.0 | 0.0 | 0.03 | 95.0 | 2.1 | 1.5 | ... | 5.0 | 260.0 | 0.47 | 17.0 | 5.5 | 67.0 | 701.314913 | 0.40 | NaN | 19.0 |
1 | 002-105961 | 2022-07-19 | Eurofins | new | 0.0 | 0.0 | NaN | NaN | 2.5 | NaN | ... | NaN | NaN | NaN | NaN | 5.9 | NaN | NaN | NaN | 15.0 | NaN |
2 | 002-105961 | 2022-08-03 | Eurofins | new | 0.0 | 0.0 | 0.06 | 130.0 | 2.7 | 1.4 | ... | 5.0 | 250.0 | 0.35 | 17.0 | 6.2 | 48.0 | 607.806258 | 0.27 | 15.0 | 16.0 |
3 | 002-105961 | 2022-08-12 | Eurofins | new | 0.0 | 0.0 | 0.04 | 110.0 | 2.3 | 1.5 | ... | 5.0 | 330.0 | 0.48 | 17.0 | 5.6 | 61.0 | 654.560586 | 0.32 | NaN | 21.0 |
4 | 002-105961 | 2022-08-30 | Eurofins | new | 0.0 | 0.0 | NaN | NaN | 2.6 | NaN | ... | NaN | NaN | NaN | NaN | 5.7 | NaN | NaN | NaN | 13.0 | NaN |
5 rows × 25 columns
In order to perform outlier detection in a multi-dimensional space, it is necessary that all water samples have a complete set of values for all parameters. This is because outlier detection algorithms work by calculating distance metrics between samples, and this is not possible if a sample can't be located along one or more of the dimension axes due to missing values. It is therefore necessary to choose a set of parameters where the data are complete.
The code below calculates the percentage of the time that each lab measures each parameter, where the percentage is calculated as:
100 * number_of_samples_for_par_X_from_lab_Y / total_number_of_samples_from_lab_Y
# Percentage of total samples analysed per parameter, split by lab
pct_df = df.groupby("lab").agg("count")
tot_samps = pct_df["period"].copy()
for col in pct_df.columns:
pct_df[col] = 100 * pct_df[col] / tot_samps
pct_df = pct_df.iloc[:, 6:]
pct_df
ANC_µekv/l | CA_mg/l | CL_mg/l | ILAL_µg/l Al | KOND_mS/m | K_mg/l | LAL_µg/l Al | MG_mg/l | N-NO3_µg/l N | N-TOT_µg/l N | NA_mg/l | P-TOT_µg/l P | PH_<ubenevnt> | RAL_µg/l Al | SIO2_µg/l Si | SO4_mg/l | TEMP_°C | TOC_mg/l C | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
lab | ||||||||||||||||||
Eurofins | 29.250457 | 100.000000 | 29.250457 | 57.678245 | 57.678245 | 29.250457 | 52.376600 | 29.250457 | 29.250457 | 29.250457 | 29.250457 | 29.250457 | 99.908592 | 57.678245 | 29.250457 | 29.250457 | 20.566728 | 57.678245 |
Eurofins (historic) | 27.477785 | 73.957621 | 26.930964 | 56.322625 | 56.664388 | 34.381408 | 0.000000 | 25.563910 | 27.136022 | 25.085441 | 25.153794 | 25.017088 | 93.438141 | 56.527683 | 0.000000 | 27.546138 | 27.614491 | 56.459330 |
NIVA (historic) | 15.176435 | 97.585227 | 15.759569 | 36.326256 | 97.547847 | 15.752093 | 0.052333 | 15.774522 | 15.737141 | 15.774522 | 15.774522 | 15.759569 | 97.742225 | 36.326256 | 11.819677 | 15.759569 | 5.210825 | 15.774522 |
VestfoldLAB (historic) | 17.496321 | 99.873142 | 17.821079 | 28.812097 | 17.816004 | 17.816004 | 28.355407 | 17.816004 | 3.004009 | 17.460801 | 17.816004 | 16.207439 | 99.873142 | 36.763587 | 0.000000 | 17.810930 | 18.176283 | 22.702593 |
# Plot
ax = pct_df.T.plot.bar(figsize=(15, 5))
ax.set_ylabel("Percent of samples")
ax.set_title("Percentage of total samples analysed per parameter, split by lab")
Text(0.5, 1.0, 'Percentage of total samples analysed per parameter, split by lab')
Isolation Forests are a type of random forest well suited to outlier detection. They have the advantage of making few assumptions about the underlying data distribution, which is useful in situations where - as here - most/all of the variables are strongly skewed.
Isolation forests have a contamination
parameter, which can be broadly interpreted as the "expected proportion of outliers in the dataset". In other words, setting contamination=0.01
roughly translates to finding the most unusual 1% of data values. Without a strong theoretical basis for setting the contamination
parameter, it must be found either by manual tuning or be fixed based on practical considerations (e.g. how many water samples can we realistically afford to reanalyse).
CA
and PH
only¶The code below applies the isolation forest algorithm to CA
and PH
. Since these are almost always measured, this includes virtually all water samples (both new and historic) in the dataset.
# Columns of interest
key_cols = ["vannmiljo_code", "sample_date", "lab", "period", "depth1", "depth2"]
par_cols = ["CA_mg/l", "PH_<ubenevnt>"]
# Run algorithm
data = df[key_cols + par_cols].dropna()
data = utils.isolation_forest(data, par_cols, contamination=0.01)
# Summarise results
all_out = data.query("pred == 'outlier'")
his_out = data.query("(pred == 'outlier') and (period == 'historic')")
new_out = data.query("(pred == 'outlier') and (period == 'new')")
csv_path = os.path.join(fold_path, "isoforest_ca_ph.csv")
new_out.to_csv(csv_path, index=False)
print(f"The total number of samples in the dataset is: {len(data)}.\n")
print(
f"The total number of outliers detected is {len(all_out)}:\n"
f" {len(his_out)} in the 'historic' period\n"
f" {len(new_out)} in the 'new' period\n"
)
new_out
/opt/conda/lib/python3.10/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but IsolationForest was fitted with feature names warnings.warn(
The total number of samples in the dataset is: 34852. The total number of outliers detected is 349: 320 in the 'historic' period 29 in the 'new' period
vannmiljo_code | sample_date | lab | period | depth1 | depth2 | CA_mg/l | PH_<ubenevnt> | pred | |
---|---|---|---|---|---|---|---|---|---|
3930 | 021-28450 | 2022-07-08 | Eurofins | new | 0.0 | 0.0 | 7.7 | 6.7 | outlier |
3931 | 021-28450 | 2022-07-28 | Eurofins | new | 0.0 | 0.0 | 7.6 | 6.8 | outlier |
3933 | 021-28450 | 2022-08-17 | Eurofins | new | 0.0 | 0.0 | 12.0 | 6.8 | outlier |
3935 | 021-28450 | 2022-09-09 | Eurofins | new | 0.0 | 0.0 | 8.6 | 6.6 | outlier |
4871 | 021-46373 | 2022-07-05 | Eurofins | new | 0.0 | 0.0 | 6.1 | 7.2 | outlier |
4872 | 021-46373 | 2022-07-19 | Eurofins | new | 0.0 | 0.0 | 7.9 | 7.4 | outlier |
4873 | 021-46373 | 2022-08-11 | Eurofins | new | 0.0 | 0.0 | 7.4 | 7.4 | outlier |
5752 | 022-32018 | 2022-09-23 | Eurofins | new | 0.0 | 0.0 | 9.6 | 7.5 | outlier |
5847 | 022-32019 | 2022-08-01 | Eurofins | new | 0.0 | 0.0 | 5.4 | 7.0 | outlier |
5849 | 022-32019 | 2022-09-23 | Eurofins | new | 0.0 | 0.0 | 7.4 | 6.9 | outlier |
5940 | 022-32020 | 2022-07-04 | Eurofins | new | 0.0 | 0.0 | 7.4 | 6.8 | outlier |
5941 | 022-32020 | 2022-07-22 | Eurofins | new | 0.0 | 0.0 | 6.0 | 7.0 | outlier |
5942 | 022-32020 | 2022-07-26 | Eurofins | new | 0.0 | 0.0 | 6.2 | 6.8 | outlier |
5943 | 022-32020 | 2022-08-01 | Eurofins | new | 0.0 | 0.0 | 6.5 | 6.7 | outlier |
6243 | 022-45769 | 2022-07-19 | Eurofins | new | 0.0 | 0.0 | 6.5 | 6.6 | outlier |
6245 | 022-45769 | 2022-08-17 | Eurofins | new | 0.0 | 0.0 | 5.9 | 6.5 | outlier |
6246 | 022-45769 | 2022-08-29 | Eurofins | new | 0.0 | 0.0 | 5.7 | 6.5 | outlier |
6762 | 022-58904 | 2022-07-04 | Eurofins | new | 0.0 | 0.0 | 6.0 | 6.7 | outlier |
6763 | 022-58904 | 2022-07-22 | Eurofins | new | 0.0 | 0.0 | 5.7 | 6.8 | outlier |
6764 | 022-58904 | 2022-07-26 | Eurofins | new | 0.0 | 0.0 | 5.4 | 6.8 | outlier |
6765 | 022-58904 | 2022-08-01 | Eurofins | new | 0.0 | 0.0 | 5.6 | 6.8 | outlier |
6770 | 022-58904 | 2022-09-29 | Eurofins | new | 0.0 | 0.0 | 5.8 | 6.8 | outlier |
14226 | 027-79278 | 2022-07-05 | Eurofins | new | 0.0 | 0.0 | 4.8 | 7.1 | outlier |
14227 | 027-79278 | 2022-08-02 | Eurofins | new | 0.0 | 0.0 | 4.9 | 7.2 | outlier |
17196 | 030-58838 | 2022-08-16 | Eurofins | new | 0.0 | 0.0 | 7.3 | 6.2 | outlier |
17197 | 030-58838 | 2022-08-30 | Eurofins | new | 0.0 | 0.0 | 5.6 | 6.4 | outlier |
19593 | 036-58751 | 2022-09-07 | Eurofins | new | 0.0 | 0.0 | 11.0 | 7.9 | outlier |
31621 | 079-58878 | 2022-09-13 | Eurofins | new | 0.0 | 0.0 | 6.6 | 6.2 | outlier |
33482 | 082-58870 | 2022-09-06 | Eurofins | new | 0.0 | 0.0 | 10.0 | 7.3 | outlier |
This initial approach identifies the strangest 1% of the dataset overall.
Most of the unusual values (300 out of 326) are actually in the historic dataset (i.e. they are already in Vannmiljø).
The 26 samples in the 'new' dataset that have been classified as outliers are predominantly those with unusually high concentrations of CA
(greater than around 5 mg/l; see plots below).
# Plot all samples
fig, ax = plt.subplots(figsize=(6, 6))
sn.scatterplot(
data=data,
x="PH_<ubenevnt>",
y="CA_mg/l",
hue="pred",
ax=ax,
hue_order=["outlier", "inlier"],
)
_ = ax.set_title("All data")
# Plot just the 'new' samples
data_new = data.query("period == 'new'")
fig, ax = plt.subplots(figsize=(6, 6))
sn.scatterplot(
data=data_new,
x="PH_<ubenevnt>",
y="CA_mg/l",
hue="pred",
ax=ax,
hue_order=["outlier", "inlier"],
)
_ = ax.set_title("'New' data only")
plt.tight_layout()
png_path = os.path.join(fold_path, "isoforest_ca_ph_plot.png")
plt.savefig(png_path, dpi=200)
CA
, PH
, ILAL
and RAL
¶# Columns of interest
key_cols = ["vannmiljo_code", "sample_date", "lab", "period", "depth1", "depth2"]
par_cols = ["CA_mg/l", "PH_<ubenevnt>", "ILAL_µg/l Al", "RAL_µg/l Al"]
# Run algorithm
data = df[key_cols + par_cols].dropna()
data = utils.isolation_forest(data, par_cols, contamination=0.01)
# Summarise results
all_out = data.query("pred == 'outlier'")
his_out = data.query("(pred == 'outlier') and (period == 'historic')")
new_out = data.query("(pred == 'outlier') and (period == 'new')")
print(f"The total number of samples in the dataset is: {len(data)}.\n")
print(
f"The total number of outliers detected is {len(all_out)}:\n"
f" {len(his_out)} in the 'historic' period\n"
f" {len(new_out)} in the 'new' period\n"
)
new_out
/opt/conda/lib/python3.10/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but IsolationForest was fitted with feature names warnings.warn(
The total number of samples in the dataset is: 11439. The total number of outliers detected is 115: 100 in the 'historic' period 15 in the 'new' period
vannmiljo_code | sample_date | lab | period | depth1 | depth2 | CA_mg/l | PH_<ubenevnt> | ILAL_µg/l Al | RAL_µg/l Al | pred | |
---|---|---|---|---|---|---|---|---|---|---|---|
1620 | 019-101022 | 2022-09-06 | Eurofins | new | 0.0 | 0.0 | 1.40 | 5.7 | 130.0 | 180.0 | outlier |
2205 | 019-79148 | 2022-09-06 | Eurofins | new | 0.0 | 0.0 | 1.70 | 5.8 | 150.0 | 190.0 | outlier |
4073 | 021-29246 | 2022-09-30 | Eurofins | new | 0.0 | 0.0 | 1.20 | 5.1 | 140.0 | 240.0 | outlier |
5752 | 022-32018 | 2022-09-23 | Eurofins | new | 0.0 | 0.0 | 9.60 | 7.5 | 9.7 | 11.0 | outlier |
5846 | 022-32019 | 2022-07-22 | Eurofins | new | 0.0 | 0.0 | 4.60 | 6.9 | 5.1 | 5.9 | outlier |
5847 | 022-32019 | 2022-08-01 | Eurofins | new | 0.0 | 0.0 | 5.40 | 7.0 | 5.4 | 9.2 | outlier |
5849 | 022-32019 | 2022-09-23 | Eurofins | new | 0.0 | 0.0 | 7.40 | 6.9 | 17.0 | 21.0 | outlier |
5941 | 022-32020 | 2022-07-22 | Eurofins | new | 0.0 | 0.0 | 6.00 | 7.0 | 6.2 | 6.7 | outlier |
6763 | 022-58904 | 2022-07-22 | Eurofins | new | 0.0 | 0.0 | 5.70 | 6.8 | 5.0 | 5.0 | outlier |
8335 | 024-58894 | 2022-09-07 | Eurofins | new | 0.0 | 0.0 | 4.20 | 6.9 | 5.0 | 5.0 | outlier |
14226 | 027-79278 | 2022-07-05 | Eurofins | new | 0.0 | 0.0 | 4.80 | 7.1 | 5.4 | 7.3 | outlier |
14227 | 027-79278 | 2022-08-02 | Eurofins | new | 0.0 | 0.0 | 4.90 | 7.2 | 5.0 | 6.9 | outlier |
14228 | 027-79278 | 2022-09-06 | Eurofins | new | 0.0 | 0.0 | 4.60 | 7.0 | 5.0 | 5.0 | outlier |
19873 | 036-58752 | 2022-08-23 | Eurofins | new | 0.0 | 0.0 | 0.64 | 5.3 | 140.0 | 170.0 | outlier |
34645 | 082-58874 | 2022-07-05 | Eurofins | new | 0.0 | 0.0 | 0.54 | 5.6 | 130.0 | 170.0 | outlier |
LAL
and TEMP
¶# Columns of interest
key_cols = ["vannmiljo_code", "sample_date", "lab", "period", "depth1", "depth2"]
excl_cols = ["TEMP_°C", "LAL_µg/l Al"]
par_cols = [col for col in df.columns if col not in (key_cols + excl_cols)]
# Run algorithm
data = df[key_cols + par_cols].dropna()
data = utils.isolation_forest(data, par_cols, contamination=0.01)
# Summarise results
all_out = data.query("pred == 'outlier'")
his_out = data.query("(pred == 'outlier') and (period == 'historic')")
new_out = data.query("(pred == 'outlier') and (period == 'new')")
print(f"The total number of samples in the dataset is: {len(data)}.\n")
print(
f"The total number of outliers detected is {len(all_out)}:\n"
f" {len(his_out)} in the 'historic' period\n"
f" {len(new_out)} in the 'new' period\n"
)
new_out
/opt/conda/lib/python3.10/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but IsolationForest was fitted with feature names warnings.warn(
The total number of samples in the dataset is: 1759. The total number of outliers detected is 18: 5 in the 'historic' period 13 in the 'new' period
vannmiljo_code | sample_date | lab | period | depth1 | depth2 | ALK_mmol/l | ANC_µekv/l | CA_mg/l | CL_mg/l | ... | N-NO3_µg/l N | N-TOT_µg/l N | NA_mg/l | P-TOT_µg/l P | PH_<ubenevnt> | RAL_µg/l Al | SIO2_µg/l Si | SO4_mg/l | TOC_mg/l C | pred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3930 | 021-28450 | 2022-07-08 | Eurofins | new | 0.0 | 0.0 | 0.15 | 79.0 | 7.7 | 35.0 | ... | 270.0 | 530.0 | 16.0 | 14.0 | 6.7 | 11.0 | 1309.121172 | 4.66 | 4.6 | outlier |
3932 | 021-28450 | 2022-08-09 | Eurofins | new | 0.0 | 0.0 | 0.09 | 69.0 | 5.0 | 26.0 | ... | 270.0 | 3600.0 | 13.0 | 22.0 | 6.6 | 31.0 | 1215.612517 | 3.90 | 7.0 | outlier |
3935 | 021-28450 | 2022-09-09 | Eurofins | new | 0.0 | 0.0 | 0.16 | 110.0 | 8.6 | 33.0 | ... | 720.0 | 830.0 | 15.0 | 8.4 | 6.6 | 9.1 | 1402.629827 | 4.82 | 4.4 | outlier |
6879 | 022-97011 | 2022-07-06 | Eurofins | new | 0.0 | 0.0 | 0.25 | 270.0 | 4.2 | 6.6 | ... | 150.0 | 290.0 | 4.9 | 7.0 | 7.1 | 27.0 | 2477.979361 | 2.47 | 4.9 | outlier |
6881 | 022-97011 | 2022-08-03 | Eurofins | new | 0.0 | 0.0 | 0.25 | 270.0 | 4.1 | 7.1 | ... | 160.0 | 270.0 | 5.5 | 5.7 | 7.0 | 28.0 | 2805.259654 | 3.06 | 4.6 | outlier |
6884 | 022-97011 | 2022-09-07 | Eurofins | new | 0.0 | 0.0 | 0.24 | 230.0 | 3.9 | 7.5 | ... | 140.0 | 300.0 | 4.9 | 4.1 | 7.0 | 29.0 | 2805.259654 | 2.67 | 5.1 | outlier |
8333 | 024-58894 | 2022-07-06 | Eurofins | new | 0.0 | 0.0 | 0.15 | -0.3 | 3.8 | 13.0 | ... | 810.0 | 990.0 | 5.3 | 12.0 | 6.7 | 7.3 | 701.314913 | 3.98 | 3.0 | outlier |
8334 | 024-58894 | 2022-08-02 | Eurofins | new | 0.0 | 0.0 | 0.16 | 120.0 | 4.1 | 11.0 | ... | 1100.0 | 1200.0 | 6.6 | 6.9 | 7.0 | 8.9 | 1122.103862 | 3.93 | 2.5 | outlier |
8335 | 024-58894 | 2022-09-07 | Eurofins | new | 0.0 | 0.0 | 0.16 | 97.0 | 4.2 | 11.0 | ... | 1300.0 | 1600.0 | 6.6 | 4.8 | 6.9 | 5.0 | 1122.103862 | 4.66 | 2.1 | outlier |
14226 | 027-79278 | 2022-07-05 | Eurofins | new | 0.0 | 0.0 | 0.30 | 230.0 | 4.8 | 18.0 | ... | 530.0 | 740.0 | 9.6 | 37.0 | 7.1 | 7.3 | 1683.155792 | 3.67 | 3.6 | outlier |
14227 | 027-79278 | 2022-08-02 | Eurofins | new | 0.0 | 0.0 | 0.30 | 230.0 | 4.9 | 18.0 | ... | 810.0 | 950.0 | 10.0 | 31.0 | 7.2 | 6.9 | 2197.453395 | 3.87 | 3.3 | outlier |
14228 | 027-79278 | 2022-09-06 | Eurofins | new | 0.0 | 0.0 | 0.27 | 240.0 | 4.6 | 17.0 | ... | 590.0 | 790.0 | 9.3 | 33.0 | 7.0 | 5.0 | 1215.612517 | 3.17 | 4.1 | outlier |
17198 | 030-58838 | 2022-09-06 | Eurofins | new | 0.0 | 0.0 | 0.11 | 100.0 | 3.6 | 4.7 | ... | 1700.0 | 2000.0 | 2.6 | 110.0 | 6.2 | 9.1 | 794.823569 | 1.94 | 2.8 | outlier |
13 rows × 24 columns