In [1]:

import os
import sqlite3
import sys

sys.path.append("..")

import nivapy3 as nivapy
import numpy as np
import pandas as pd
import utils

Tiltaksovervakingen: opsjon for kvalitetskontroll av analysedata¶

Eurofins 2020 Q4¶

Notebook 1: Initial exploration and data cleaning¶

1. Initial data check¶

Note: This notebook uses an updated export from Vannmiljø reflecting corrections made to two station codes in August 2021 - see e-mail from Kjetil received 27.05.2021 at 17:15 for details.

Initial screening of the Eurofins Excel file shows the following:

The data matches the VestfoldLAB submission template, with the exception that SiO2 is reported by Eurofins in ug/l, not mg/l. I have added additional columns to ../../data/parameter_unit_mapping.xlsx in order to correctly convert the units reported by Eurofins
There are 61 records without a valid Lokalitets ID. These records are associated with 6 stations. 5 of these stations were newly added in 2020 and the correct IDs were only supplied by Kjetil in October 2020, so it makes sense that these should be missing. However, one is an old station (Barstadvassdraget v Liland (1); 026-30849) so it's strange that the code for this was missing. I have added the missing codes based on data in ../../data/active_stations_2020.xlsx, but Eurofins will need to do this too
Some cells are marked ., others -, others N/A and others left blank. Is there any significance to these characters? For now, I have assumed they are all equivalent ways of indicating "no data" and have deleted these entries accordingly. Check with Eurofins
The Eurofins data includes < characters, whereas VestfoldLAB removed these and replaced the original value by half the LOQ. I will keep the < characters, and simply use the LOQ value itself (rather than half the value) in any subsequent analysis
The Excel file mixes . and , as the decimal separator. I have replaced all occurrences of , as a decimal separator with .
Inconsistent flag combinations for Al fractions. Usually LAl is simply calculated as (RAl - ILAl), which leads to negative values being reported in some cases. However, in one example, RAl is reported as 5.3 ug/l and ILAl as < 5 ug/l. For this sample, LAl is reported as < 0,3 ug/l, which doesn't seem correct. Valid values for LAl in this case lie in the range $0.3 < LAl <= 5.3$ ug/l. For now, I have converted this value to 0.3 ug/l (but we could choose the middle of the range

In [2]:

# Choose dataset to process
lab = "Eurofins"
year = 2020
qtr = 4
version = 1

2. Create SQLite database to store results¶

Using a database will provide basic checks on data integrity and consistency. For this project, three tables will be sufficient:

Station locations and metadata
Parameters and units used by VestfoldLAB, Eurofins and Vannmiljø, and conversion factors between these
Water chemistry data

The code below creates a basic database structure, which will be populated later.

In [3]:

# Create database
fold_path = f"../../output/{lab.lower()}_{year}_q{qtr}_v{version}"
if not os.path.exists(fold_path):
    os.makedirs(fold_path)

db_path = os.path.join(fold_path, "kalk_data.db")
if os.path.exists(db_path):
    os.remove(db_path)
eng = sqlite3.connect(db_path, detect_types=sqlite3.PARSE_DECLTYPES)

# Turn off journal mode for performance
eng.execute("PRAGMA synchronous = OFF")
eng.execute("PRAGMA journal_mode = OFF")
eng.execute("PRAGMA foreign_keys = ON")

# Create stations table
sql = (
    "CREATE TABLE stations "
    "( "
    "  fylke text NOT NULL, "
    "  vassdrag text NOT NULL, "
    "  station_name text NOT NULL, "
    "  station_number text, "
    "  vannmiljo_code text NOT NULL, "
    "  vannmiljo_name text, "
    "  utm_east real NOT NULL, "
    "  utm_north real NOT NULL, "
    "  utm_zone integer NOT NULL, "
    "  lon real NOT NULL, "
    "  lat real NOT NULL, "
    "  liming_status text NOT NULL, "
    "  comment text, "
    "  PRIMARY KEY (vannmiljo_code) "
    ")"
)
eng.execute(sql)

# Create parameters table
sql = (
    "CREATE TABLE parameters_units "
    "( "
    "  vannmiljo_name text NOT NULL UNIQUE, "
    "  vannmiljo_id text NOT NULL UNIQUE, "
    "  vannmiljo_unit text NOT NULL, "
    "  vestfoldlab_name text NOT NULL UNIQUE, "
    "  vestfoldlab_unit text NOT NULL, "
    "  vestfoldlab_to_vm_conv_fac real NOT NULL, "
    "  eurofins_name text NOT NULL UNIQUE, "
    "  eurofins_unit text NOT NULL, "
    "  eurofins_to_vm_conv_fac real NOT NULL, "
    "  min real NOT NULL, "
    "  max real NOT NULL, "
    "  PRIMARY KEY (vannmiljo_id) "
    ")"
)
eng.execute(sql)

# Create chemistry table
sql = (
    "CREATE TABLE water_chemistry "
    "( "
    "  vannmiljo_code text NOT NULL, "
    "  sample_date datetime NOT NULL, "
    "  lab text NOT NULL, "
    "  period text NOT NULL, "
    "  depth1 real, "
    "  depth2 real, "
    "  parameter text NOT NULL, "
    "  flag text, "
    "  value real NOT NULL, "
    "  unit text NOT NULL, "
    "  PRIMARY KEY (vannmiljo_code, sample_date, depth1, depth2, parameter), "
    "  CONSTRAINT vannmiljo_code_fkey FOREIGN KEY (vannmiljo_code) "
    "      REFERENCES stations (vannmiljo_code) "
    "      ON UPDATE NO ACTION ON DELETE NO ACTION, "
    "  CONSTRAINT parameter_fkey FOREIGN KEY (parameter) "
    "      REFERENCES parameters_units (vannmiljo_id) "
    "      ON UPDATE NO ACTION ON DELETE NO ACTION "
    ")"
)
eng.execute(sql)

Out[3]:

<sqlite3.Cursor at 0x7fc3687b4810>

3. Explore station data¶

Station details are stored in ../../data/active_stations_2020.xlsx, which is a tidied version of Øyvind's original file here:

K:\Prosjekter\langtransporterte forurensninger\Kalk Tiltaksovervåking\12 KS vannkjemi\Vannlokaliteter koordinater_kun aktive stasj 2020.xlsx

Note that corrections (e.g. adjusted station co-ordinates) have been made to the tidied file, but not the original on K:. The version in this repository should therefore been used as the "master" copy.

In [4]:

# Read station data
stn_df = pd.read_excel(r"../../data/active_stations_2020.xlsx", sheet_name="data")
stn_df = nivapy.spatial.utm_to_wgs84_dd(stn_df)

print("The following stations are missing spatial co-ordinates:")
stn_df.query("lat != lat")

The following stations are missing spatial co-ordinates:

Out[4]:

	fylke	vassdrag	station_name	station_number	vannmiljo_code	vannmiljo_name	utm_east	utm_north	utm_zone	liming_status	comment	lat	lon

In [5]:

print("The following stations do not have a code in Vannmiljø:")
stn_df.query("vannmiljo_code != vannmiljo_code")

The following stations do not have a code in Vannmiljø:

Out[5]:

	fylke	vassdrag	station_name	station_number	vannmiljo_code	vannmiljo_name	utm_east	utm_north	utm_zone	liming_status	comment	lat	lon

In [6]:

# Map
stn_map = nivapy.spatial.quickmap(
    stn_df.dropna(subset=["lat"]),
    lat_col="lat",
    lon_col="lon",
    popup="station_name",
    cluster=True,
    kartverket=True,
    aerial_imagery=True,
)

stn_map.save("../../pages/stn_map.html")

stn_map

Out[6]:

Make this Notebook Trusted to load map: File -> Trust Notebook

In [7]:

# Add to database
stn_df.dropna(subset=["vannmiljo_code", "lat"], inplace=True)
stn_df.to_sql(name="stations", con=eng, if_exists="append", index=False)

Out[7]:

4. Parameters and units of interest¶

The file ../../data/parameter_unit_mapping.xlsx provides a lookup between parameter names & units used by the labs and those in Vannmiljø. It also contains plausible ranges (using Vannmiljø units) for each parameter. These ranges have been chosen by using the values already in Vannmiljø as a reference. However, it looks as though some of the data in Vannmiljø might also be spurious, so it would be good to refine these ranges based on domain knowledge, if possible.

Note: Concentrations reported as exactly zero are likely to be errors, because most (all?) lab methods should report an LOQ instead.

In [8]:

# Read parameter mappings
par_df = utils.get_par_unit_mappings()

# Add to database
par_df.to_sql(name="parameters_units", con=eng, if_exists="append", index=False)

par_df

Out[8]:

	vannmiljo_name	vannmiljo_id	vannmiljo_unit	vestfoldlab_name	vestfoldlab_unit	vestfoldlab_to_vm_conv_fac	eurofins_name	eurofins_unit	eurofins_to_vm_conv_fac	min	max
0	Temperatur	TEMP	°C	Temp	°C	1.000000	Temp	°C	1.000000	-10	30
1	pH	PH	<ubenevnt>	pH	enh	1.000000	pH	enh	1.000000	1	10
2	Konduktivitet	KOND	mS/m	Kond	mS/m	1.000000	Kond	ms/m	1.000000	0	100
3	Total alkalitet	ALK	mmol/l	Alk	mmol/l	1.000000	Alk	mmol/l	1.000000	0	2
4	Totalfosfor	P-TOT	µg/l P	Tot-P	µg/l	1.000000	Tot-P	µg/l	1.000000	0	500
5	Totalnitrogen	N-TOT	µg/l N	Tot-N	µg/l	1.000000	Tot-N	µg/l	1.000000	0	4000
6	Nitrat	N-NO3	µg/l N	NO3	µg/l	1.000000	NO3	µg/l	1.000000	0	2000
7	Totalt organisk karbon (TOC)	TOC	mg/l C	TOC	mg/l	1.000000	TOC	mg/l	1.000000	0	100
8	Reaktivt aluminium	RAL	µg/l Al	RAl	µg/l	1.000000	RAl	µg/l	1.000000	0	500
9	Ikke-labilt aluminium	ILAL	µg/l Al	ILAl	µg/l	1.000000	ILAl	µg/l	1.000000	0	500
10	Labilt aluminium	LAL	µg/l Al	LAl	µg/l	1.000000	LAl	µg/l	1.000000	0	500
11	Klorid	CL	mg/l	Cl	mg/l	1.000000	Cl	mg/l	1.000000	0	100
12	Sulfat	SO4	mg/l	SO4	mg/l	1.000000	SO4	mg/l	1.000000	0	20
13	Kalsium	CA	mg/l	Ca	mg/l	1.000000	Ca	mg/l	1.000000	0	500
14	Kalium	K	mg/l	K	mg/l	1.000000	K	mg/l	1.000000	0	10
15	Magnesium	MG	mg/l	Mg	mg/l	1.000000	Mg	mg/l	1.000000	0	100
16	Natrium	NA	mg/l	Na	mg/l	1.000000	Na	mg/l	1.000000	0	50
17	Totalt silikat	SIO2	µg/l Si	SIO2	mg/l	467.543276	SIO2	µg/l	0.467543	0	7000
18	Syrenøytraliserende kapasitet (ANC)	ANC	µekv/l	ANC	µekv/l	1.000000	ANC	µekv/l	1.000000	-1000	6000

5. Historic data from Vannmiljø¶

The Vannmiljø dataset is large and reading from Excel is slow; the code below takes a couple of minutes to run.

Note from the output below that there are more than 1600 "duplicated" samples in the Vannmiljø dataset i.e. where the station code, sample date, sample depth, lab and parameter name are all the same, but a different value is reported. It would be helpful to know why these duplicates were collected e.g. are these reanalysis values, where only one of the duplicates should be used, or are they genuine (in which case should they be averaged or kept separate?). For the moment, I will ignore these values.

In [9]:

# Read historic data from Vannmiljø
his_df = utils.read_historic_data(
    r"../../data/vannmiljo_export_2012-19_2021-08-16.xlsx"
)

# Tidy lab names for clarity
his_df["lab"].replace(
    {"NIVA": "NIVA (historic)", "VestfoldLAB AS": "VestfoldLAB (historic)"},
    inplace=True,
)

# Add label for data period
his_df["period"] = "historic"

# Print summary
n_stns = len(his_df["vannmiljo_code"].unique())
print(f"The number of unique stations with data is: {n_stns}.\n")

# Handle duplicates
his_dup_csv = r"../../output/vannmiljo_historic/vannmiljo_duplicates.csv"
his_df = utils.handle_duplicates(his_df, his_dup_csv, action="drop")

his_df.head()

The number of unique stations with data is: 211.


There are 1620 duplicated records (same station_code-date-depth-parameter, but different value).
These will be dropped.

Out[9]:

	vannmiljo_code	sample_date	lab	par_unit	flag	value	period
0	027-28435	2012-01-02	NIVA (historic)	ALK_mmol/l	=	0.06	historic
1	027-28435	2012-01-02	NIVA (historic)	CA_mg/l	=	1.23	historic
2	027-28435	2012-01-02	NIVA (historic)	ILAL_µg/l Al	=	18.00	historic
3	027-28435	2012-01-02	NIVA (historic)	KOND_mS/m	=	3.80	historic
4	027-28435	2012-01-02	NIVA (historic)	PH_<ubenevnt>	=	6.24	historic

6. New data from lab¶

The code below reads the Excel template provided by Eurofins and reformats it to the same structure (parameter names, units etc.) as the data in Vannmiljø.

In [10]:

# Read new data
new_df = utils.read_data_template_to_wide(
    f"../../data/{lab.lower()}_data_{year}_q{qtr}_v{version}.xlsx",
    sheet_name="results",
    lab=lab,
)
utils.perform_basic_checks(new_df)
new_df = utils.wide_to_long(new_df, lab)

# Add label for data period
new_df["period"] = "new"

# Handle duplicates
dup_csv = os.path.join(
    fold_path, f"{lab.lower()}_{year}_q{qtr}_v{version}_duplicates.csv"
)
new_df = utils.handle_duplicates(new_df, dup_csv, action="drop")

new_df.head()

The following location IDs in the new data are not in the definitive station list.
{'062-58824', '062-58827', '062-58828', '062-58826', '021-46388 ', '030-58776', '027-47427', '062-58825'}

The following location IDs have inconsistent names within this template:
     025-45779 ['Litlåna oppstrøms doserer (5)']   ==>   ['Logen nedstrøms doserer ' 'Litlåna oppstrøms doserer ']
     025-81555 ['Stakkeland bru (Rv. 42) (9)']   ==>   ['Stakkeland bru (Rv. 42)' 'Stallemobekken v/Underåsen']
     022-54615 ['Logna v/Kyrkjebygda']   ==>   ['Logna v/Kyrkjebygda' 'Logna v/Kyrkjebygda ']
     022-90731 ['Entredalsbekken']   ==>   ['Entredalsbekken' 'Entredalsbekken ']
     022-97011 ['Stemkjerrbekken ved bru']   ==>   ['Stemkjerrbekken ved bru' 'Stemkjerrbekken ved bru ']
     019-58794 ['Fyresvatn 4 dyp (5)']   ==>   ['Fyresvatn ' 'Fyresvatn']
     021-101031 ['Otra oppstrøms doserer Iveland (Dalanekilen inntak)']   ==>   ['Otra nedstrøms doserer Iveland' 'Otra oppstrøms doserer Iveland']
     036-58752 ['Mosåna oppstrøms doserer (23.1)']   ==>   ['Monebekken oppstrøms doserer' 'Monebekken oppstrøms doserer '
 'Mosåna oppstrøms doserer ']
     027-58846 ['Eikeland nedstrøms doserer (51)']   ==>   ['Eikeland oppstrøms doserer ' 'Eikeland nedstrøms doserer ']
     062-58820 ['Raundalselva (1)']   ==>   ['Evanger kraftstasjon nedstrøms' 'Raundalselva ']
     062-58821 ['Strandaelva (2)']   ==>   ['Evanger kraftstasjon nedstrøms' 'Strandaelva ']
     062-58822 ['Teigdalselva (6)']   ==>   ['Evanger kraftstasjon nedstrøms' 'Teigdalselva ']
     062-58823 ['Vossedalselva (18)']   ==>   ['Evanger kraftstasjon nedstrøms' 'Vossedalselva ']

The following location names have multiple IDs within this template:
     Logen nedstrøms doserer  []   ==>   ['025-45779' '067-58826']
     Stallemobekken v/Underåsen ['022-32019']   ==>   ['025-81555' '022-32019']
     Eikeland oppstrøms doserer  []   ==>   ['027-58846' '027-58847']
     Samnanger []   ==>   ['055-58811' '055-58812' '055-58813']
     Evanger kraftstasjon nedstrøms []   ==>   ['062-58819' '062-58820' '062-58821' '062-58822' '062-58823' '062-58824'
 '062-58825' '062-58826' '062-58827' '062-58828']

WARNING: File contains samples from several year quarters.

The following samples have nitrate greater than total nitrogen:
     vannmiljo_code         sample_date depth1 depth2  NO3_µg/l  Tot-N_µg/l
513       027-38543 2020-12-08 12:50:27      0      0     380.0         360
574       030-58776 2020-12-08 07:20:51      0      0     290.0         280
634       030-58838 2020-10-13 07:20:17      0      0     730.0         680
650       027-79278 2020-12-08 07:20:36      0      0     760.0         750
809       038-58854 2020-12-08 12:50:42      0      0     290.0         280
892       036-58748 2020-12-08 07:20:42      0      0     260.0         250
1016      038-58868 2020-12-08 12:50:42      0      0     300.0         290
1349      064-82800 2020-12-08 07:20:42      0      0     150.0         140
1424      045-58816 2020-12-08 12:50:45      0      0     410.0         380
1429      045-58817 2020-12-08 12:50:45      0      0     240.0         220

The following samples have LAl != RAl - ILAl:
     vannmiljo_code         sample_date depth1 depth2  RAl_µg/l  ILAl_µg/l  \
633       030-58838 2020-10-06 07:20:52      0      0      68.0        5.0   
646       027-79278 2020-09-03 07:10:51      0      0       5.2        5.0   
889       036-58748 2020-10-06 11:45:47      0      0       6.6        5.0   
920       036-58752 2020-10-06 11:45:47      0      0       6.6        5.0   
1221      045-58809 2020-09-08 07:15:11      0      0       5.9        5.0   
1326      064-28998 2020-09-09 07:00:55      0      0       8.8        5.0   
1400      045-58814 2020-10-06 07:20:23      0      0      10.0        5.0   
1412      045-58815 2020-10-06 07:20:23      0      0      11.0        5.0   
1446      062-58819 2020-09-09 07:00:55      0      0       5.5        5.0   
1452      062-58825 2020-11-27 07:15:33      0      0       5.2        5.0   
1467      062-58822 2020-10-07 07:20:55      0      0       6.4        5.0   

      LAl_µg/l  LAl_Calc_µg/l  
633        0.0           63.0  
646        0.0            0.2  
889        0.0            1.6  
920        0.0            1.6  
1221       0.0            0.9  
1326       0.0            3.8  
1400       0.0            5.0  
1412       0.0            6.0  
1446       0.0            0.5  
1452       0.0            0.2  
1467       0.0            1.4  

There are 499 duplicated records (same station_code-date-depth-parameter, but different value).
These will be dropped.

Out[10]:

	vannmiljo_code	sample_date	lab	par_unit	flag	value	period
0	025-45779	2020-09-10 00:00:00	Eurofins	TEMP_°C	nan	14.0	new
1	019-44498	2020-09-04 07:00:20	Eurofins	TEMP_°C	nan	17.3	new
2	019-44498	2020-09-10 07:20:15	Eurofins	TEMP_°C	nan	16.0	new
3	019-44498	2020-09-15 07:00:19	Eurofins	TEMP_°C	nan	15.0	new
4	019-44498	2020-10-07 07:20:07	Eurofins	TEMP_°C	nan	13.0	new

7 . Combine¶

Combine the historic and new datasets into a single dataframe in "long" format.

In [11]:

# Combine
df = pd.concat([his_df, new_df], axis="rows")

# Separate par and unit
df[["parameter", "unit"]] = df["par_unit"].str.split("_", n=1, expand=True)
del df["par_unit"]

df.reset_index(drop=True, inplace=True)

df.head()

Out[11]:

	vannmiljo_code	sample_date	lab	flag	value	period	parameter	unit
0	027-28435	2012-01-02	NIVA (historic)	=	0.06	historic	ALK	mmol/l
1	027-28435	2012-01-02	NIVA (historic)	=	1.23	historic	CA	mg/l
2	027-28435	2012-01-02	NIVA (historic)	=	18.00	historic	ILAL	µg/l Al
3	027-28435	2012-01-02	NIVA (historic)	=	3.80	historic	KOND	mS/m
4	027-28435	2012-01-02	NIVA (historic)	=	6.24	historic	PH	<ubenevnt>

In [12]:

# Apply correction to historic SIO2
df["value"] = np.where(
    (df["lab"] == "VestfoldLAB (historic)") & (df["parameter"] == "SIO2"),
    df["value"] * 467.5432,
    df["value"],
)

# Reclassify (nitrate + nitrite) to nitrate
df["parameter"].replace({"N-SNOX": "N-NO3"}, inplace=True)

8. Check data ranges¶

A simple method for preliminary quality control is to check whether parameter values are within sensible ranges (as defined in the parameters_units table; see Section 4 above). I believe this screening should be implemented differently for the historic (i.e. Vannmiljø) and new datsets, as follows:

For the historic data in Vannmiljø, values outside the plausible ranges should be removed from the dataset entirely. This is because we intend to use the Vannmiljø data as a reference against which new values will be compared, so it is important the dataset does not contain anything too strange. Ideally, the reference dataset should be carefully manually curated to ensure it is as good as possible, but I'm not sure we have the resouces in this project to thoroughly quality assess the data already in Vannmiljø. Dealing with any obvious issues is a good start, though
For the new data, values outside the plausible ranges should be highlighted and checked with the reporting lab

Note: At present, my code will remove any concentration values of exactly zero from the historic dataset. Check with Øyvind whether this is too strict.

In [13]:

# Check ranges
df = utils.check_data_ranges(df)

Checking data ranges for the 'historic' period.
    KOND: Maximum value of 309.00 is greater than or equal to upper limit (100.00).
    LAL: Minimum value of 0.00 is less than or equal to lower limit (0.00).
    SO4: Minimum value of 0.00 is less than or equal to lower limit (0.00).
    CA: Minimum value of 0.00 is less than or equal to lower limit (0.00).
    SIO2: Maximum value of 51897.30 is greater than or equal to upper limit (7000.00).

Checking data ranges for the 'new' period.
    TEMP: Maximum value of 30.00 is greater than or equal to upper limit (30.00).
    N-NO3: Maximum value of 2700.00 is greater than or equal to upper limit (2000.00).
    LAL: Minimum value of -1.00 is less than or equal to lower limit (0.00).
    SO4: Maximum value of 58.00 is greater than or equal to upper limit (20.00).
    ANC: Minimum value of -1000.00 is less than or equal to lower limit (-1000.00).

Dropping problem rows from historic data.
    Dropping rows for KOND.
    Dropping rows for LAL.
    Dropping rows for SO4.
    Dropping rows for CA.
    Dropping rows for SIO2.

In [14]:

# Add to database
df.to_sql(
    name="water_chemistry",
    con=eng,
    if_exists="append",
    index=False,
    method="multi",
    chunksize=1000,
)

---------------------------------------------------------------------------
IntegrityError                            Traceback (most recent call last)
Input In [14], in <cell line: 2>()
      1 # Add to database
----> 2 df.to_sql(
      3     name="water_chemistry",
      4     con=eng,
      5     if_exists="append",
      6     index=False,
      7     method="multi",
      8     chunksize=1000,
      9 )

File /opt/conda/lib/python3.9/site-packages/pandas/core/generic.py:2963, in NDFrame.to_sql(self, name, con, schema, if_exists, index, index_label, chunksize, dtype, method)
   2806 """
   2807 Write records stored in a DataFrame to a SQL database.
   2808 
   (...)
   2959 [(1,), (None,), (2,)]
   2960 """  # noqa:E501
   2961 from pandas.io import sql
-> 2963 return sql.to_sql(
   2964     self,
   2965     name,
   2966     con,
   2967     schema=schema,
   2968     if_exists=if_exists,
   2969     index=index,
   2970     index_label=index_label,
   2971     chunksize=chunksize,
   2972     dtype=dtype,
   2973     method=method,
   2974 )

File /opt/conda/lib/python3.9/site-packages/pandas/io/sql.py:697, in to_sql(frame, name, con, schema, if_exists, index, index_label, chunksize, dtype, method, engine, **engine_kwargs)
    692 elif not isinstance(frame, DataFrame):
    693     raise NotImplementedError(
    694         "'frame' argument should be either a Series or a DataFrame"
    695     )
--> 697 return pandas_sql.to_sql(
    698     frame,
    699     name,
    700     if_exists=if_exists,
    701     index=index,
    702     index_label=index_label,
    703     schema=schema,
    704     chunksize=chunksize,
    705     dtype=dtype,
    706     method=method,
    707     engine=engine,
    708     **engine_kwargs,
    709 )

File /opt/conda/lib/python3.9/site-packages/pandas/io/sql.py:2190, in SQLiteDatabase.to_sql(self, frame, name, if_exists, index, index_label, schema, chunksize, dtype, method, **kwargs)
   2180 table = SQLiteTable(
   2181     name,
   2182     self,
   (...)
   2187     dtype=dtype,
   2188 )
   2189 table.create()
-> 2190 return table.insert(chunksize, method)

File /opt/conda/lib/python3.9/site-packages/pandas/io/sql.py:950, in SQLTable.insert(self, chunksize, method)
    947     break
    949 chunk_iter = zip(*(arr[start_i:end_i] for arr in data_list))
--> 950 num_inserted = exec_insert(conn, keys, chunk_iter)
    951 if num_inserted is None:
    952     total_inserted = None

File /opt/conda/lib/python3.9/site-packages/pandas/io/sql.py:1902, in SQLiteTable._execute_insert_multi(self, conn, keys, data_iter)
   1900 data_list = list(data_iter)
   1901 flattened_data = [x for row in data_list for x in row]
-> 1902 conn.execute(self.insert_statement(num_rows=len(data_list)), flattened_data)
   1903 return conn.rowcount

IntegrityError: FOREIGN KEY constraint failed

In [ ]:

eng.close()

9. Version 1 summary¶

Points to note from the initial data exploration:

9.1. Data formatting and missing data¶

The data submitted matches the VestfoldLAB submission template, with the exception that SiO2 is reported by Eurofins in ug/l, not mg/l. This is fine (I have added additional columns to ../../data/parameter_unit_mapping.xlsx in order to correctly convert the units reported by Eurofins)
There are 61 records in the Eurofins dataset without a valid Lokalitets ID. These records are associated with 6 stations. 5 of these were newly added in 2020 and the correct IDs were only supplied by Kjetil in October 2020, so perhaps it makes sense that these should be missing. However, one is an old station (Barstadvassdraget v Liland (1); 026-30849) so it's strange that the code for this was missing. I have added the missing codes based on data in ../../data/active_stations_2020.xlsx, but Eurofins will need to do this too
Some cells are marked ., others -, others N/A and others left blank. Is there any significance to these characters? For now, I have assumed they are all equivalent ways of indicating "no data" and have deleted these entries accordingly. Check with Eurofins
The Eurofins data includes < characters, whereas VestfoldLAB removed these and replaced the original value by half the LOQ. This is fine; I will keep the < characters, and simply use the LOQ value itself (rather than half the value) in subsequent analyses
The Excel file from Eurofins mixes . and , as the decimal separator. I have replaced all occurrences of , as a decimal separator with .. Eurofins will probably need to do the same before submitting to Vannmiljø
Inconsistent flag combinations for Al fractions. LAl is simply calculated as (RAl - ILAl), which leads to negative values being reported in some cases (is this OK?). However, for one sample, RAl is reported as 5.3 ug/l and ILAl as < 5 ug/l, but LAl is reported as < 0,3 ug/l, which doesn't seem correct. Valid values for LAl in this case lie in the range $0.3 < LAl <= 5.3$ ug/l (i.e. LAl is greater than 0.3 ug/l). For now, I have converted this value to 0.3 ug/l, but we could choose the middle of the valid range instead

9.2. Duplicates¶

There are 499 duplicated records in the Eurofins data submission, where same station_code-date-depth-parameter has been reported more than once (see here for a list). It looks as though these are genuine duplicates (i.e. these samples have been analysed more than once), because they have different Labreferanse codes in the spreadsheet. Are these reanalysis values (in which cases only the best/most recent value should probably be included), or do they have another significance? Check with Eurofins

9.3. Stations¶

The following location IDs have inconsistent names within this template:

 025-45779 ['Litlåna oppstrøms doserer (5)']   ==>   ['Logen nedstrøms doserer ' 'Litlåna oppstrøms doserer ']
 025-81555 ['Stakkeland bru (Rv. 42) (9)']   ==>   ['Stakkeland bru (Rv. 42)' 'Stallemobekken v/Underåsen']
 022-54615 ['Logna v/Kyrkjebygda']   ==>   ['Logna v/Kyrkjebygda' 'Logna v/Kyrkjebygda ']
 022-90731 ['Entredalsbekken']   ==>   ['Entredalsbekken' 'Entredalsbekken ']
 022-97011 ['Stemkjerrbekken ved bru']   ==>   ['Stemkjerrbekken ved bru' 'Stemkjerrbekken ved bru ']
 019-58794 ['Fyresvatn 4 dyp (5)']   ==>   ['Fyresvatn ' 'Fyresvatn']
 021-101031 ['Otra oppstrøms doserer Iveland (Dalanekilen inntak)']   ==>   ['Otra nedstrøms doserer Iveland' 'Otra oppstrøms doserer Iveland']
 036-58752 ['Mosåna oppstrøms doserer (23.1)']   ==>   ['Monebekken oppstrøms doserer' 'Monebekken oppstrøms doserer ' 'Mosåna oppstrøms doserer '] 
 027-58846 ['Eikeland nedstrøms doserer (51)']   ==>   ['Eikeland oppstrøms doserer ' 'Eikeland nedstrøms doserer ']
 062-58820 ['Raundalselva (1)']   ==>   ['Evanger kraftstasjon nedstrøms' 'Raundalselva ']
 062-58821 ['Strandaelva (2)']   ==>   ['Evanger kraftstasjon nedstrøms' 'Strandaelva ']
 062-58822 ['Teigdalselva (6)']   ==>   ['Evanger kraftstasjon nedstrøms' 'Teigdalselva ']
 062-58823 ['Vossedalselva (18)']   ==>   ['Evanger kraftstasjon nedstrøms' 'Vossedalselva ']

The following location names have multiple IDs within this template:

 Logen nedstrøms doserer  []   ==>   ['025-45779' '067-58826']
 Stallemobekken v/Underåsen ['022-32019']   ==>   ['025-81555' '022-32019']
 Eikeland oppstrøms doserer  []   ==>   ['027-58846' '027-58847']
 Samnanger []   ==>   ['055-58811' '055-58812' '055-58813']
 Evanger kraftstasjon nedstrøms []   ==>   ['062-58819' '062-58820' '062-58821' '062-58822' '062-58823' '062-58824' '062-58825' '062-58826' '062-58827' '062-58828']

9.4. Range checks¶

Note: Some of the points identified below are explored more thoroughly in the time series analysis.

Calculating LAl as (RAl - ILAl) gives negative values in some cases. Should these be reported "as is", or should these values be set to zero? Check with Kjetil and Eurofins
Values for nitrate > 2000 ug/l seem very high?
An SO4 concentration of 58 mg/l seems very high?
ANC of -1000 uekv/l seems implausibly low?

The following samples have nitrate greater than total nitrogen:

     vannmiljo_code         sample_date depth1 depth2  NO3_µg/l  Tot-N_µg/l
513       027-38543 2020-12-08 12:50:27      0      0     380.0         360
574       030-58776 2020-12-08 07:20:51      0      0     290.0         280
634       030-58838 2020-10-13 07:20:17      0      0     730.0         680
650       027-79278 2020-12-08 07:20:36      0      0     760.0         750
809       038-58854 2020-12-08 12:50:42      0      0     290.0         280
892       036-58748 2020-12-08 07:20:42      0      0     260.0         250
1016      038-58868 2020-12-08 12:50:42      0      0     300.0         290
1349      064-82800 2020-12-08 07:20:42      0      0     150.0         140
1424      045-58816 2020-12-08 12:50:45      0      0     410.0         380
1429      045-58817 2020-12-08 12:50:45      0      0     240.0         220