In [1]:

import os
import sys

sys.path.append("..")

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sn
import utils
from sklearn.preprocessing import StandardScaler

plt.style.use("ggplot")

Tiltaksovervakingen: opsjon for kvalitetskontroll av analysedata¶

Notebook 3: Outlier detection for whole water samples¶

Exploring distributions for single parameters (as in notebook 2) is a reasonable starting point for quality assurance, but a more general approach is to look for "outliers" at the water sample level i.e. samples that are of questionable quality because one or more parameter values are unusual. If the suite of water quality parameters analysed is consistent (i.e. no data gaps), then each sample can be considered as a point in $n$ dimensional space, where $n$ is the number of parameters measured. Rather than looking for "outliers" along a single dimension (as in the distribution plots considered already), we can instead look for "outliers" in this higher dimensional space. A variety of algorithms are available to do this, some of which are explored below.

1. Terminology: "outlier" versus "novelty" detection¶

"Outlier" and "novelty" detection are two different kinds of anomaly detection i.e. where we are interested in detecting abnormal or unusual observations. The difference between the two is important, but often overlooked:

Outlier detection: We have a single dataset that is believed to contain "outliers", which are observations that are "far" from the others (for some chosen definition of "far"). Outlier detection estimators thus try to fit the regions where the data is most concentrated, ignoring the deviant observations.
Novelty detection: We have access to a reference dataset that is not polluted by outliers, and we are interested in detecting whether a new observation (from a second dataset) is an outlier - a "novelty" - or not

In the context of this project, we are primarily interested in novelty detection, because we have a reference historic dataset extracted from Vannmiljø and we would like to gauge whether observations in the "new" dataset are sufficiently unusual/unlikely to warrant further investigation and reanalysis. However, the summary in the previous notebook identified some possible issues in the Vannmiljø data, as well as in the "new" data. I am therefore reluctant to use the historic Vannmiljø dataset as a "reference" without additional cleaning. Instead, to begin with, at least, I will combine the "new" and "historic" results into a single dataset and then perform outlier detection (not novelty detection). This can be revised later if desired.

2. Read data¶

In [2]:

# Choose dataset to process
lab = "Eurofins"
year = 2022
qtr = 3
version = 1

In [3]:

fold_path = f"../../output/{lab.lower()}_{year}_q{qtr}_v{version}"

In [4]:

# Read from SQLite
stn_df, df = utils.read_data_from_sqlite(lab, year, qtr, version)

# # Subset data to just the quarter of interest
# months_dict = {
#     "q1": [1, 2, 3],
#     "q2": [4, 5, 6],
#     "q3": [7, 8, 9],
#     "q4": [10, 11, 12],
# }
# months = months_dict[qtr]
# df = df[df["sample_date"].dt.month.isin(months)]

df.head()

Out[4]:

	vannmiljo_code	sample_date	lab	period	ALK_mmol/l	ANC_µekv/l	CA_mg/l	CL_mg/l	...	N-NO3_µg/l N	N-TOT_µg/l N	NA_mg/l	P-TOT_µg/l P	PH_<ubenevnt>	RAL_µg/l Al	SIO2_µg/l Si	SO4_mg/l	TEMP_°C	TOC_mg/l C
0	002-105961	2022-07-06	Eurofins	new	0.03	95.0	2.1	1.5	...	5.0	260.0	0.47	17.0	5.5	67.0	701.314913	0.40	NaN	19.0
1	002-105961	2022-07-19	Eurofins	new	NaN	NaN	2.5	NaN	...	NaN	NaN	NaN	NaN	5.9	NaN	NaN	NaN	15.0	NaN
2	002-105961	2022-08-03	Eurofins	new	0.06	130.0	2.7	1.4	...	5.0	250.0	0.35	17.0	6.2	48.0	607.806258	0.27	15.0	16.0
3	002-105961	2022-08-12	Eurofins	new	0.04	110.0	2.3	1.5	...	5.0	330.0	0.48	17.0	5.6	61.0	654.560586	0.32	NaN	21.0
4	002-105961	2022-08-30	Eurofins	new	NaN	NaN	2.6	NaN	...	NaN	NaN	NaN	NaN	5.7	NaN	NaN	NaN	13.0	NaN

5 rows × 25 columns

3. Select parameters¶

In order to perform outlier detection in a multi-dimensional space, it is necessary that all water samples have a complete set of values for all parameters. This is because outlier detection algorithms work by calculating distance metrics between samples, and this is not possible if a sample can't be located along one or more of the dimension axes due to missing values. It is therefore necessary to choose a set of parameters where the data are complete.

The code below calculates the percentage of the time that each lab measures each parameter, where the percentage is calculated as:

100 * number_of_samples_for_par_X_from_lab_Y / total_number_of_samples_from_lab_Y

In [5]:

# Percentage of total samples analysed per parameter, split by lab
pct_df = df.groupby("lab").agg("count")
tot_samps = pct_df["period"].copy()

for col in pct_df.columns:
    pct_df[col] = 100 * pct_df[col] / tot_samps

pct_df = pct_df.iloc[:, 6:]

pct_df

Out[5]:

	ANC_µekv/l	CA_mg/l	CL_mg/l	ILAL_µg/l Al	KOND_mS/m	K_mg/l	LAL_µg/l Al	MG_mg/l	N-NO3_µg/l N	N-TOT_µg/l N	NA_mg/l	P-TOT_µg/l P	PH_<ubenevnt>	RAL_µg/l Al	SIO2_µg/l Si	SO4_mg/l	TEMP_°C	TOC_mg/l C
lab
Eurofins	29.250457	100.000000	29.250457	57.678245	57.678245	29.250457	52.376600	29.250457	29.250457	29.250457	29.250457	29.250457	99.908592	57.678245	29.250457	29.250457	20.566728	57.678245
Eurofins (historic)	27.477785	73.957621	26.930964	56.322625	56.664388	34.381408	0.000000	25.563910	27.136022	25.085441	25.153794	25.017088	93.438141	56.527683	0.000000	27.546138	27.614491	56.459330
NIVA (historic)	15.176435	97.585227	15.759569	36.326256	97.547847	15.752093	0.052333	15.774522	15.737141	15.774522	15.774522	15.759569	97.742225	36.326256	11.819677	15.759569	5.210825	15.774522
VestfoldLAB (historic)	17.496321	99.873142	17.821079	28.812097	17.816004	17.816004	28.355407	17.816004	3.004009	17.460801	17.816004	16.207439	99.873142	36.763587	0.000000	17.810930	18.176283	22.702593

In [6]:

# Plot
ax = pct_df.T.plot.bar(figsize=(15, 5))
ax.set_ylabel("Percent of samples")
ax.set_title("Percentage of total samples analysed per parameter, split by lab")

Out[6]:

Text(0.5, 1.0, 'Percentage of total samples analysed per parameter, split by lab')

4. Isolation Forests¶

Isolation Forests are a type of random forest well suited to outlier detection. They have the advantage of making few assumptions about the underlying data distribution, which is useful in situations where - as here - most/all of the variables are strongly skewed.

Isolation forests have a contamination parameter, which can be broadly interpreted as the "expected proportion of outliers in the dataset". In other words, setting contamination=0.01 roughly translates to finding the most unusual 1% of data values. Without a strong theoretical basis for setting the contamination parameter, it must be found either by manual tuning or be fixed based on practical considerations (e.g. how many water samples can we realistically afford to reanalyse).

4.1. `CA` and `PH` only¶

The code below applies the isolation forest algorithm to CA and PH. Since these are almost always measured, this includes virtually all water samples (both new and historic) in the dataset.

In [7]:

# Columns of interest
key_cols = ["vannmiljo_code", "sample_date", "lab", "period", "depth1", "depth2"]
par_cols = ["CA_mg/l", "PH_<ubenevnt>"]

# Run algorithm
data = df[key_cols + par_cols].dropna()
data = utils.isolation_forest(data, par_cols, contamination=0.01)

# Summarise results
all_out = data.query("pred == 'outlier'")
his_out = data.query("(pred == 'outlier') and (period == 'historic')")
new_out = data.query("(pred == 'outlier') and (period == 'new')")

csv_path = os.path.join(fold_path, "isoforest_ca_ph.csv")
new_out.to_csv(csv_path, index=False)

print(f"The total number of samples in the dataset is: {len(data)}.\n")
print(
    f"The total number of outliers detected is {len(all_out)}:\n"
    f"    {len(his_out)} in the 'historic' period\n"
    f"    {len(new_out)} in the 'new' period\n"
)

new_out

/opt/conda/lib/python3.10/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but IsolationForest was fitted with feature names
  warnings.warn(

The total number of samples in the dataset is: 34852.

The total number of outliers detected is 349:
    320 in the 'historic' period
    29 in the 'new' period

Out[7]:

	vannmiljo_code	sample_date	lab	period	CA_mg/l	PH_<ubenevnt>	pred
3930	021-28450	2022-07-08	Eurofins	new	7.7	6.7	outlier
3931	021-28450	2022-07-28	Eurofins	new	7.6	6.8	outlier
3933	021-28450	2022-08-17	Eurofins	new	12.0	6.8	outlier
3935	021-28450	2022-09-09	Eurofins	new	8.6	6.6	outlier
4871	021-46373	2022-07-05	Eurofins	new	6.1	7.2	outlier
4872	021-46373	2022-07-19	Eurofins	new	7.9	7.4	outlier
4873	021-46373	2022-08-11	Eurofins	new	7.4	7.4	outlier
5752	022-32018	2022-09-23	Eurofins	new	9.6	7.5	outlier
5847	022-32019	2022-08-01	Eurofins	new	5.4	7.0	outlier
5849	022-32019	2022-09-23	Eurofins	new	7.4	6.9	outlier
5940	022-32020	2022-07-04	Eurofins	new	7.4	6.8	outlier
5941	022-32020	2022-07-22	Eurofins	new	6.0	7.0	outlier
5942	022-32020	2022-07-26	Eurofins	new	6.2	6.8	outlier
5943	022-32020	2022-08-01	Eurofins	new	6.5	6.7	outlier
6243	022-45769	2022-07-19	Eurofins	new	6.5	6.6	outlier
6245	022-45769	2022-08-17	Eurofins	new	5.9	6.5	outlier
6246	022-45769	2022-08-29	Eurofins	new	5.7	6.5	outlier
6762	022-58904	2022-07-04	Eurofins	new	6.0	6.7	outlier
6763	022-58904	2022-07-22	Eurofins	new	5.7	6.8	outlier
6764	022-58904	2022-07-26	Eurofins	new	5.4	6.8	outlier
6765	022-58904	2022-08-01	Eurofins	new	5.6	6.8	outlier
6770	022-58904	2022-09-29	Eurofins	new	5.8	6.8	outlier
14226	027-79278	2022-07-05	Eurofins	new	4.8	7.1	outlier
14227	027-79278	2022-08-02	Eurofins	new	4.9	7.2	outlier
17196	030-58838	2022-08-16	Eurofins	new	7.3	6.2	outlier
17197	030-58838	2022-08-30	Eurofins	new	5.6	6.4	outlier
19593	036-58751	2022-09-07	Eurofins	new	11.0	7.9	outlier
31621	079-58878	2022-09-13	Eurofins	new	6.6	6.2	outlier
33482	082-58870	2022-09-06	Eurofins	new	10.0	7.3	outlier

This initial approach identifies the strangest 1% of the dataset overall.

Most of the unusual values (300 out of 326) are actually in the historic dataset (i.e. they are already in Vannmiljø).
The 26 samples in the 'new' dataset that have been classified as outliers are predominantly those with unusually high concentrations of CA (greater than around 5 mg/l; see plots below).

In [8]:

# Plot all samples
fig, ax = plt.subplots(figsize=(6, 6))
sn.scatterplot(
    data=data,
    x="PH_<ubenevnt>",
    y="CA_mg/l",
    hue="pred",
    ax=ax,
    hue_order=["outlier", "inlier"],
)
_ = ax.set_title("All data")

In [9]:

# Plot just the 'new' samples
data_new = data.query("period == 'new'")
fig, ax = plt.subplots(figsize=(6, 6))
sn.scatterplot(
    data=data_new,
    x="PH_<ubenevnt>",
    y="CA_mg/l",
    hue="pred",
    ax=ax,
    hue_order=["outlier", "inlier"],
)
_ = ax.set_title("'New' data only")
plt.tight_layout()
png_path = os.path.join(fold_path, "isoforest_ca_ph_plot.png")
plt.savefig(png_path, dpi=200)

4.2. `CA`, `PH`, `ILAL` and `RAL`¶

In [10]:

# Columns of interest
key_cols = ["vannmiljo_code", "sample_date", "lab", "period", "depth1", "depth2"]
par_cols = ["CA_mg/l", "PH_<ubenevnt>", "ILAL_µg/l Al", "RAL_µg/l Al"]

# Run algorithm
data = df[key_cols + par_cols].dropna()
data = utils.isolation_forest(data, par_cols, contamination=0.01)

# Summarise results
all_out = data.query("pred == 'outlier'")
his_out = data.query("(pred == 'outlier') and (period == 'historic')")
new_out = data.query("(pred == 'outlier') and (period == 'new')")

print(f"The total number of samples in the dataset is: {len(data)}.\n")
print(
    f"The total number of outliers detected is {len(all_out)}:\n"
    f"    {len(his_out)} in the 'historic' period\n"
    f"    {len(new_out)} in the 'new' period\n"
)

new_out

/opt/conda/lib/python3.10/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but IsolationForest was fitted with feature names
  warnings.warn(

The total number of samples in the dataset is: 11439.

The total number of outliers detected is 115:
    100 in the 'historic' period
    15 in the 'new' period

Out[10]:

	vannmiljo_code	sample_date	lab	period	CA_mg/l	PH_<ubenevnt>	ILAL_µg/l Al	RAL_µg/l Al	pred
1620	019-101022	2022-09-06	Eurofins	new	1.40	5.7	130.0	180.0	outlier
2205	019-79148	2022-09-06	Eurofins	new	1.70	5.8	150.0	190.0	outlier
4073	021-29246	2022-09-30	Eurofins	new	1.20	5.1	140.0	240.0	outlier
5752	022-32018	2022-09-23	Eurofins	new	9.60	7.5	9.7	11.0	outlier
5846	022-32019	2022-07-22	Eurofins	new	4.60	6.9	5.1	5.9	outlier
5847	022-32019	2022-08-01	Eurofins	new	5.40	7.0	5.4	9.2	outlier
5849	022-32019	2022-09-23	Eurofins	new	7.40	6.9	17.0	21.0	outlier
5941	022-32020	2022-07-22	Eurofins	new	6.00	7.0	6.2	6.7	outlier
6763	022-58904	2022-07-22	Eurofins	new	5.70	6.8	5.0	5.0	outlier
8335	024-58894	2022-09-07	Eurofins	new	4.20	6.9	5.0	5.0	outlier
14226	027-79278	2022-07-05	Eurofins	new	4.80	7.1	5.4	7.3	outlier
14227	027-79278	2022-08-02	Eurofins	new	4.90	7.2	5.0	6.9	outlier
14228	027-79278	2022-09-06	Eurofins	new	4.60	7.0	5.0	5.0	outlier
19873	036-58752	2022-08-23	Eurofins	new	0.64	5.3	140.0	170.0	outlier
34645	082-58874	2022-07-05	Eurofins	new	0.54	5.6	130.0	170.0	outlier

4.3. All parameters except `LAL` and `TEMP`¶

In [11]:

# Columns of interest
key_cols = ["vannmiljo_code", "sample_date", "lab", "period", "depth1", "depth2"]
excl_cols = ["TEMP_°C", "LAL_µg/l Al"]
par_cols = [col for col in df.columns if col not in (key_cols + excl_cols)]

# Run algorithm
data = df[key_cols + par_cols].dropna()
data = utils.isolation_forest(data, par_cols, contamination=0.01)

# Summarise results
all_out = data.query("pred == 'outlier'")
his_out = data.query("(pred == 'outlier') and (period == 'historic')")
new_out = data.query("(pred == 'outlier') and (period == 'new')")

print(f"The total number of samples in the dataset is: {len(data)}.\n")
print(
    f"The total number of outliers detected is {len(all_out)}:\n"
    f"    {len(his_out)} in the 'historic' period\n"
    f"    {len(new_out)} in the 'new' period\n"
)

new_out

/opt/conda/lib/python3.10/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but IsolationForest was fitted with feature names
  warnings.warn(

The total number of samples in the dataset is: 1759.

The total number of outliers detected is 18:
    5 in the 'historic' period
    13 in the 'new' period

Out[11]:

	vannmiljo_code	sample_date	lab	period	ALK_mmol/l	ANC_µekv/l	CA_mg/l	CL_mg/l	...	N-NO3_µg/l N	N-TOT_µg/l N	NA_mg/l	P-TOT_µg/l P	PH_<ubenevnt>	RAL_µg/l Al	SIO2_µg/l Si	SO4_mg/l	TOC_mg/l C	pred
3930	021-28450	2022-07-08	Eurofins	new	0.15	79.0	7.7	35.0	...	270.0	530.0	16.0	14.0	6.7	11.0	1309.121172	4.66	4.6	outlier
3932	021-28450	2022-08-09	Eurofins	new	0.09	69.0	5.0	26.0	...	270.0	3600.0	13.0	22.0	6.6	31.0	1215.612517	3.90	7.0	outlier
3935	021-28450	2022-09-09	Eurofins	new	0.16	110.0	8.6	33.0	...	720.0	830.0	15.0	8.4	6.6	9.1	1402.629827	4.82	4.4	outlier
6879	022-97011	2022-07-06	Eurofins	new	0.25	270.0	4.2	6.6	...	150.0	290.0	4.9	7.0	7.1	27.0	2477.979361	2.47	4.9	outlier
6881	022-97011	2022-08-03	Eurofins	new	0.25	270.0	4.1	7.1	...	160.0	270.0	5.5	5.7	7.0	28.0	2805.259654	3.06	4.6	outlier
6884	022-97011	2022-09-07	Eurofins	new	0.24	230.0	3.9	7.5	...	140.0	300.0	4.9	4.1	7.0	29.0	2805.259654	2.67	5.1	outlier
8333	024-58894	2022-07-06	Eurofins	new	0.15	-0.3	3.8	13.0	...	810.0	990.0	5.3	12.0	6.7	7.3	701.314913	3.98	3.0	outlier
8334	024-58894	2022-08-02	Eurofins	new	0.16	120.0	4.1	11.0	...	1100.0	1200.0	6.6	6.9	7.0	8.9	1122.103862	3.93	2.5	outlier
8335	024-58894	2022-09-07	Eurofins	new	0.16	97.0	4.2	11.0	...	1300.0	1600.0	6.6	4.8	6.9	5.0	1122.103862	4.66	2.1	outlier
14226	027-79278	2022-07-05	Eurofins	new	0.30	230.0	4.8	18.0	...	530.0	740.0	9.6	37.0	7.1	7.3	1683.155792	3.67	3.6	outlier
14227	027-79278	2022-08-02	Eurofins	new	0.30	230.0	4.9	18.0	...	810.0	950.0	10.0	31.0	7.2	6.9	2197.453395	3.87	3.3	outlier
14228	027-79278	2022-09-06	Eurofins	new	0.27	240.0	4.6	17.0	...	590.0	790.0	9.3	33.0	7.0	5.0	1215.612517	3.17	4.1	outlier
17198	030-58838	2022-09-06	Eurofins	new	0.11	100.0	3.6	4.7	...	1700.0	2000.0	2.6	110.0	6.2	9.1	794.823569	1.94	2.8	outlier

13 rows × 24 columns