import os
import sqlite3
import sys
sys.path.append("..")
import nivapy3 as nivapy
import numpy as np
import pandas as pd
import utils
This notebook explores the 2023 Q1 data from Eurofins sent by Marianne Isebakke on 23.04.2023 16.48. The code performs initial data exploration and cleaning, with the aim of creating a tidy dataset in SQLite that can be used for further analysis.
Note: This file uses an updated export from Vannmiljø reflecting corrections made to two station codes in August 2021 - see e-mail from Kjetil received 27.05.2021 at 17:15 for details.
# Choose dataset to process
lab = "Eurofins"
year = 2023
qtr = 1
version = 1
Using a database will provide basic checks on data integrity and consistency. For this project, three tables will be sufficient:
The code below creates a basic database structure, which will be populated later.
# Create database
fold_path = f"../../output/{lab.lower()}_{year}_q{qtr}_v{version}"
if not os.path.exists(fold_path):
os.makedirs(fold_path)
db_path = os.path.join(fold_path, "kalk_data.db")
if os.path.exists(db_path):
os.remove(db_path)
eng = sqlite3.connect(db_path, detect_types=sqlite3.PARSE_DECLTYPES)
# Turn off journal mode for performance
eng.execute("PRAGMA synchronous = OFF")
eng.execute("PRAGMA journal_mode = OFF")
eng.execute("PRAGMA foreign_keys = ON")
# Create stations table
sql = (
"CREATE TABLE stations "
"( "
" fylke text NOT NULL, "
" vassdrag text NOT NULL, "
" station_name text NOT NULL, "
" station_number text, "
" vannmiljo_code text NOT NULL, "
" vannmiljo_name text, "
" utm_east real NOT NULL, "
" utm_north real NOT NULL, "
" utm_zone integer NOT NULL, "
" lon real NOT NULL, "
" lat real NOT NULL, "
" liming_status text NOT NULL, "
" comment text, "
" PRIMARY KEY (vannmiljo_code) "
")"
)
eng.execute(sql)
# Create parameters table
sql = (
"CREATE TABLE parameters_units "
"( "
" vannmiljo_name text NOT NULL UNIQUE, "
" vannmiljo_id text NOT NULL UNIQUE, "
" vannmiljo_unit text NOT NULL, "
" vestfoldlab_name text NOT NULL UNIQUE, "
" vestfoldlab_unit text NOT NULL, "
" vestfoldlab_to_vm_conv_fac real NOT NULL, "
" eurofins_name text NOT NULL UNIQUE, "
" eurofins_unit text NOT NULL, "
" eurofins_to_vm_conv_fac real NOT NULL, "
" min real NOT NULL, "
" max real NOT NULL, "
" PRIMARY KEY (vannmiljo_id) "
")"
)
eng.execute(sql)
# Create chemistry table
sql = (
"CREATE TABLE water_chemistry "
"( "
" vannmiljo_code text NOT NULL, "
" sample_date datetime NOT NULL, "
" lab text NOT NULL, "
" period text NOT NULL, "
" depth1 real, "
" depth2 real, "
" parameter text NOT NULL, "
" flag text, "
" value real NOT NULL, "
" unit text NOT NULL, "
" PRIMARY KEY (vannmiljo_code, sample_date, depth1, depth2, parameter), "
" CONSTRAINT vannmiljo_code_fkey FOREIGN KEY (vannmiljo_code) "
" REFERENCES stations (vannmiljo_code) "
" ON UPDATE NO ACTION ON DELETE NO ACTION, "
" CONSTRAINT parameter_fkey FOREIGN KEY (parameter) "
" REFERENCES parameters_units (vannmiljo_id) "
" ON UPDATE NO ACTION ON DELETE NO ACTION "
")"
)
eng.execute(sql)
<sqlite3.Cursor at 0x7f51bf6bd540>
Station details are stored in ../../data/active_stations_2020.xlsx
, which is a tidied version of Øyvind's original file here:
K:\Prosjekter\langtransporterte forurensninger\Kalk Tiltaksovervåking\12 KS vannkjemi\Vannlokaliteter koordinater_kun aktive stasj 2020.xlsx
Note that corrections (e.g. adjusted station co-ordinates) have been made to the tidied file, but not the original on K:
. The version in this repository should therefore been used as the "master" copy.
# Read station data
stn_df = pd.read_excel(r"../../data/active_stations_2020.xlsx", sheet_name="data")
stn_df = nivapy.spatial.utm_to_wgs84_dd(stn_df)
print("The following stations are missing spatial co-ordinates:")
stn_df.query("lat != lat")
The following stations are missing spatial co-ordinates:
fylke | vassdrag | station_name | station_number | vannmiljo_code | vannmiljo_name | utm_east | utm_north | utm_zone | liming_status | comment | lat | lon |
---|
print("The following stations do not have a code in Vannmiljø:")
stn_df.query("vannmiljo_code != vannmiljo_code")
The following stations do not have a code in Vannmiljø:
fylke | vassdrag | station_name | station_number | vannmiljo_code | vannmiljo_name | utm_east | utm_north | utm_zone | liming_status | comment | lat | lon |
---|
# Map
stn_map = nivapy.spatial.quickmap(
stn_df.dropna(subset=["lat"]),
lat_col="lat",
lon_col="lon",
popup="station_name",
cluster=True,
kartverket=True,
aerial_imagery=True,
)
stn_map.save("../../pages/stn_map.html")
stn_map
# Add to database
stn_df.dropna(subset=["vannmiljo_code", "lat"], inplace=True)
stn_df.to_sql(name="stations", con=eng, if_exists="append", index=False)
222
The file ../../data/parameter_unit_mapping.xlsx
provides a lookup between parameter names & units used by the labs and those in Vannmiljø. It also contains plausible ranges (using Vannmiljø units) for each parameter. These ranges have been chosen by using the values already in Vannmiljø as a reference. However, it looks as though some of the data in Vannmiljø might also be spurious, so it would be good to refine these ranges based on domain knowledge, if possible.
Note: Concentrations reported as exactly zero are likely to be errors, because most (all?) lab methods should report an LOQ instead.
# Read parameter mappings
par_df = utils.get_par_unit_mappings()
# Add to database
par_df.to_sql(name="parameters_units", con=eng, if_exists="append", index=False)
par_df
vannmiljo_name | vannmiljo_id | vannmiljo_unit | vestfoldlab_name | vestfoldlab_unit | vestfoldlab_to_vm_conv_fac | eurofins_name | eurofins_unit | eurofins_to_vm_conv_fac | min | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Temperatur | TEMP | °C | Temp | °C | 1.000000 | Temp | °C | 1.000000 | -10 | 30 |
1 | pH | PH | <ubenevnt> | pH | enh | 1.000000 | pH | enh | 1.000000 | 1 | 10 |
2 | Konduktivitet | KOND | mS/m | Kond | mS/m | 1.000000 | Kond | ms/m | 1.000000 | 0 | 100 |
3 | Total alkalitet | ALK | mmol/l | Alk | mmol/l | 1.000000 | Alk | mmol/l | 1.000000 | 0 | 2 |
4 | Totalfosfor | P-TOT | µg/l P | Tot-P | µg/l | 1.000000 | Tot-P | µg/l | 1.000000 | 0 | 500 |
5 | Totalnitrogen | N-TOT | µg/l N | Tot-N | µg/l | 1.000000 | Tot-N | µg/l | 1.000000 | 0 | 4000 |
6 | Nitrat | N-NO3 | µg/l N | NO3 | µg/l | 1.000000 | NO3 | µg/l | 1.000000 | 0 | 2000 |
7 | Totalt organisk karbon (TOC) | TOC | mg/l C | TOC | mg/l | 1.000000 | TOC | mg/l | 1.000000 | 0 | 100 |
8 | Reaktivt aluminium | RAL | µg/l Al | RAl | µg/l | 1.000000 | RAl | µg/l | 1.000000 | 0 | 500 |
9 | Ikke-labilt aluminium | ILAL | µg/l Al | ILAl | µg/l | 1.000000 | ILAl | µg/l | 1.000000 | 0 | 500 |
10 | Labilt aluminium | LAL | µg/l Al | LAl | µg/l | 1.000000 | LAl | µg/l | 1.000000 | 0 | 500 |
11 | Klorid | CL | mg/l | Cl | mg/l | 1.000000 | Cl | mg/l | 1.000000 | 0 | 100 |
12 | Sulfat | SO4 | mg/l | SO4 | mg/l | 1.000000 | SO4 | mg/l | 1.000000 | 0 | 20 |
13 | Kalsium | CA | mg/l | Ca | mg/l | 1.000000 | Ca | mg/l | 1.000000 | 0 | 500 |
14 | Kalium | K | mg/l | K | mg/l | 1.000000 | K | mg/l | 1.000000 | 0 | 10 |
15 | Magnesium | MG | mg/l | Mg | mg/l | 1.000000 | Mg | mg/l | 1.000000 | 0 | 100 |
16 | Natrium | NA | mg/l | Na | mg/l | 1.000000 | Na | mg/l | 1.000000 | 0 | 50 |
17 | Totalt silikat | SIO2 | µg/l Si | SIO2 | mg/l | 467.543276 | SIO2 | µg/l | 0.467543 | 0 | 7000 |
18 | Syrenøytraliserende kapasitet (ANC) | ANC | µekv/l | ANC | µekv/l | 1.000000 | ANC | µekv/l | 1.000000 | -1000 | 6000 |
The Vannmiljø dataset is large and reading from Excel is slow; the code below takes a couple of minutes to run.
Note from the output below that there are more than 1600 "duplicated" samples in the Vannmiljø dataset i.e. where the station code, sample date, sample depth, lab and parameter name are all the same, but a different value is reported. It would be helpful to know why these duplicates were collected e.g. are these reanalysis values, where only one of the duplicates should be used, or are they genuine (in which case should they be averaged or kept separate?). For the moment, I will ignore these values.
# Read historic data from Vannmiljø
his_df = utils.read_historic_data(
r"../../data/vannmiljo_export_2012-20_2022-08-19.xlsx",
st_yr=2012,
end_yr=2020,
)
# Tidy lab names for clarity
his_df["lab"].replace(
{
"NIVA": "NIVA (historic)",
"VestfoldLAB AS": "VestfoldLAB (historic)",
"Eurofins Environment Testing Norway AS (Moss)": "Eurofins (historic)",
},
inplace=True,
)
# Add label for data period
his_df["period"] = "historic"
# Print summary
n_stns = len(his_df["vannmiljo_code"].unique())
print(f"The number of unique stations with data is: {n_stns}.\n")
# Handle duplicates
his_dup_csv = r"../../output/vannmiljo_historic/vannmiljo_duplicates.csv"
his_df = utils.handle_duplicates(his_df, his_dup_csv, action="drop")
his_df.head()
The number of unique stations with data is: 221. There are 2578 duplicated records (same station_code-date-depth-parameter, but different value). These will be dropped.
vannmiljo_code | sample_date | lab | depth1 | depth2 | par_unit | flag | value | period | |
---|---|---|---|---|---|---|---|---|---|
532 | 019-44498 | 2012-01-02 | NIVA (historic) | 0.0 | 0.0 | K_mg/l | = | 0.19 | historic |
533 | 019-44498 | 2012-01-02 | NIVA (historic) | 0.0 | 0.0 | KOND_mS/m | = | 1.71 | historic |
534 | 019-44498 | 2012-02-15 | NIVA (historic) | 0.0 | 0.0 | CL_mg/l | = | 1.15 | historic |
535 | 019-44498 | 2012-02-15 | NIVA (historic) | 0.0 | 0.0 | KOND_mS/m | = | 1.40 | historic |
536 | 019-44498 | 2012-03-05 | NIVA (historic) | 0.0 | 0.0 | CL_mg/l | = | 1.83 | historic |
The code below reads the Excel template provided by Eurofins and reformats it to the same structure (parameter names, units etc.) as the data in Vannmiljø.
# Read new data
new_df = utils.read_data_template_to_wide(
f"../../data/{lab.lower()}_data_{year}_q{qtr}_v{version}.xlsx",
sheet_name="results",
lab=lab,
)
utils.perform_basic_checks(new_df)
new_df = utils.wide_to_long(new_df, lab)
# Add label for data period
new_df["period"] = "new"
# Handle duplicates
dup_csv = os.path.join(
fold_path, f"{lab.lower()}_{year}_q{qtr}_v{version}_duplicates.csv"
)
new_df = utils.handle_duplicates(new_df, dup_csv, action="drop")
new_df.head()
Checking stations: The following location IDs have inconsistent names within this template: The following location names have multiple IDs within this template: Checking sample dates: The file contains samples from several year quarters. Checking for non-numeric data: Done. Checking for values less than zero: Done. Checking for consistent LOD values: Done. Checking NO3 and TOTN: The following samples have nitrate greater than total nitrogen: vannmiljo_code sample_date depth1 depth2 NO3_µg/l Tot-N_µg/l 79 024-58894 2023-02-06 0 0 500.0 490 511 030-30839 2023-02-27 0 0 270.0 250 593 030-58838 2023-01-02 0 0 630.0 600 596 030-58838 2023-01-30 0 0 830.0 680 597 030-58838 2023-02-06 0 0 1900.0 1800 612 027-79279 2023-02-27 0 0 790.0 730 626 032-58839 2023-02-06 0 0 330.0 310 658 031-58843 2023-01-02 0 0 270.0 250 660 031-58843 2023-01-09 0 0 300.0 270 661 031-58843 2023-01-30 0 0 210.0 200 665 031-58843 2023-02-27 0 0 170.0 150 677 031-38523 2023-02-27 0 0 160.0 140 740 027-58845 2023-01-30 0 0 610.0 600 781 038-58854 2023-02-06 0 0 480.0 430 782 038-58854 2023-02-27 0 0 450.0 430 1002 038-58868 2023-01-30 0 0 270.0 260 1042 063-58804 2023-02-06 0 0 240.0 230 1159 082-58875 2023-01-02 0 0 630.0 580 1202 067-82874 2023-03-01 0 0 230.0 210 1205 067-82877 2023-03-01 0 0 160.0 150 1271 067-82880 2023-03-01 0 0 86.0 79 1277 067-82878 2023-03-01 0 0 130.0 120 1278 064-82800 2023-01-03 0 0 320.0 280 1280 064-82800 2023-01-30 0 0 190.0 180 1283 064-82800 2023-02-27 0 0 220.0 200 1296 064-28997 2023-02-27 0 0 130.0 120 1313 064-81557 2023-01-30 0 0 130.0 120 1361 045-58816 2023-01-03 0 0 590.0 570 1364 045-58816 2023-02-02 0 0 460.0 420 1367 045-58816 2023-02-28 0 0 460.0 420 1372 045-58817 2023-01-03 0 0 370.0 340 1375 045-58817 2023-02-02 0 0 290.0 270 1378 045-58817 2023-02-28 0 0 240.0 210 Checking Al fractions: The following samples have LAl != RAl - ILAl: vannmiljo_code sample_date depth1 depth2 RAl_µg/l ILAl_µg/l \ 334 022-32018 2023-01-05 0 0 53.0 52.0 336 022-32020 2022-10-31 0 0 34.0 34.0 340 022-32019 2022-10-31 0 0 33.0 32.0 529 030-47471 2023-02-06 0 0 9.5 5.4 616 027-79278 2023-02-27 0 0 14.0 8.8 ... ... ... ... ... ... ... 1409 062-58821 2023-01-02 0 0 11.0 10.0 1411 062-58821 2023-02-28 0 0 13.0 10.0 1412 062-58822 2023-01-02 0 0 14.0 12.0 1414 062-58822 2023-02-28 0 0 6.3 5.0 1417 062-58823 2023-02-28 0 0 11.0 7.7 LAl_µg/l LAl_Calc_µg/l 334 5.0 1.0 336 5.0 0.0 340 5.0 1.0 529 5.0 4.1 616 5.0 5.2 ... ... ... 1409 5.0 1.0 1411 5.0 3.0 1412 5.0 2.0 1414 5.0 1.3 1417 5.0 3.3 [61 rows x 8 columns] There are 620 duplicated records (same station_code-date-depth-parameter, but different value). These will be dropped.
/home/jovyan/projects/tiltaksovervakingen/notebooks/eurofins_2023_q1/../utils.py:162: FutureWarning: Passing unit-less datetime64 dtype to .astype is deprecated and will raise in a future version. Pass 'datetime64[ns]' instead df = df.astype(
vannmiljo_code | sample_date | lab | depth1 | depth2 | par_unit | flag | value | period | |
---|---|---|---|---|---|---|---|---|---|
0 | 019-58793 | 2023-01-02 | Eurofins | 0.0 | 0.0 | TEMP_°C | nan | 3.0 | new |
1 | 019-58793 | 2023-01-30 | Eurofins | 0.0 | 0.0 | TEMP_°C | nan | 3.0 | new |
2 | 019-58793 | 2023-02-27 | Eurofins | 0.0 | 0.0 | TEMP_°C | nan | 3.0 | new |
3 | 019-101022 | 2023-01-02 | Eurofins | 0.0 | 0.0 | TEMP_°C | nan | 3.0 | new |
4 | 019-101022 | 2023-01-30 | Eurofins | 0.0 | 0.0 | TEMP_°C | nan | 2.0 | new |
Combine the historic
and new
datasets into a single dataframe in "long" format.
# Combine
df = pd.concat([his_df, new_df], axis="rows")
# Separate par and unit
df[["parameter", "unit"]] = df["par_unit"].str.split("_", n=1, expand=True)
del df["par_unit"]
df.reset_index(drop=True, inplace=True)
df.head()
vannmiljo_code | sample_date | lab | depth1 | depth2 | flag | value | period | parameter | unit | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 019-44498 | 2012-01-02 | NIVA (historic) | 0.0 | 0.0 | = | 0.19 | historic | K | mg/l |
1 | 019-44498 | 2012-01-02 | NIVA (historic) | 0.0 | 0.0 | = | 1.71 | historic | KOND | mS/m |
2 | 019-44498 | 2012-02-15 | NIVA (historic) | 0.0 | 0.0 | = | 1.15 | historic | CL | mg/l |
3 | 019-44498 | 2012-02-15 | NIVA (historic) | 0.0 | 0.0 | = | 1.40 | historic | KOND | mS/m |
4 | 019-44498 | 2012-03-05 | NIVA (historic) | 0.0 | 0.0 | = | 1.83 | historic | CL | mg/l |
# Apply correction to historic SIO2
df["value"] = np.where(
(df["lab"] == "VestfoldLAB (historic)") & (df["parameter"] == "SIO2"),
df["value"] * 467.5432,
df["value"],
)
# Reclassify (nitrate + nitrite) to nitrate
df["parameter"].replace({"N-SNOX": "N-NO3"}, inplace=True)
A simple method for preliminary quality control is to check whether parameter values are within sensible ranges (as defined in the parameters_units
table; see Section 4 above). I believe this screening should be implemented differently for the historic
(i.e. Vannmiljø) and new
datsets, as follows:
For the historic
data in Vannmiljø, values outside the plausible ranges should be removed from the dataset entirely. This is because we intend to use the Vannmiljø data as a reference against which new values will be compared, so it is important the dataset does not contain anything too strange. Ideally, the reference dataset should be carefully manually curated to ensure it is as good as possible, but I'm not sure we have the resouces in this project to thoroughly quality assess the data already in Vannmiljø. Dealing with any obvious issues is a good start, though
For the new
data, values outside the plausible ranges should be highlighted and checked with the reporting lab
Note: At present, my code will remove any concentration values of exactly zero from the historic dataset. Check with Øyvind whether this is too strict.
# Check ranges
df = utils.check_data_ranges(df)
Checking data ranges for the 'historic' period. TEMP: Maximum value of 30.00 is greater than or equal to upper limit (30.00). PH: Minimum value of 0.10 is less than or equal to lower limit (1.00). KOND: Maximum value of 309.00 is greater than or equal to upper limit (100.00). P-TOT: Maximum value of 1200.00 is greater than or equal to upper limit (500.00). N-NO3: Maximum value of 2700.00 is greater than or equal to upper limit (2000.00). LAL: Minimum value of 0.00 is less than or equal to lower limit (0.00). SO4: Minimum value of 0.00 is less than or equal to lower limit (0.00). SO4: Maximum value of 58.00 is greater than or equal to upper limit (20.00). CA: Minimum value of 0.00 is less than or equal to lower limit (0.00). CA: Maximum value of 4800.00 is greater than or equal to upper limit (500.00). K: Maximum value of 23.00 is greater than or equal to upper limit (10.00). SIO2: Maximum value of 2763180.31 is greater than or equal to upper limit (7000.00). ANC: Minimum value of -1000.00 is less than or equal to lower limit (-1000.00). Checking data ranges for the 'new' period. Dropping problem rows from historic data. Dropping rows for TEMP. Dropping rows for PH. Dropping rows for KOND. Dropping rows for P-TOT. Dropping rows for N-NO3. Dropping rows for LAL. Dropping rows for SO4. Dropping rows for CA. Dropping rows for K. Dropping rows for SIO2. Dropping rows for ANC.
# Add to database
df.to_sql(
name="water_chemistry",
con=eng,
if_exists="append",
index=False,
method="multi",
chunksize=1000,
)
197683
eng.close()
Points to note from the initial data exploration:
The following issues were highlighted by the app and fixed:
Column Temp_°C
contains non-numeric values: ["'-1", "'-6"]
Column Tot-P_µg/l
contains multiple LOD values: ['<2,0', '<2']
The following issues were highlighted by the app and need checking:
The template contains data from 2022 Q4
WARNING: The file contains samples from several year quarters (quarters: [1 4])
There are several samples where NO3 > TOTN (see app for details)
There are 6 samples where LAl > 20 µg/l and pH > 6.4, which is considered unlikely (see app for details)