#!/usr/bin/env python
# coding: utf-8

# # Sorting pre-requisits for ibicus: downloading and preprocessing data

# This notebook shows how to download and preprocess climate model data for bias correction and further use. To apply a bias adjustment method, three datasets are needed: 1) observation or reanalysis data; 2) historical climate model data over the same reference period that observations are available for; and 3) climate model data for a future, or more generally, application, period that is to be bias corrected. 
# 
# Here we will download and preprocess CMIP6 data from the Climate Data Store (CDS) as climate model input and two reanalysis datasets: 1) ERA5 from the CDS and 2) NCEP/DOE Reanalysis II from the PSL datastore (NOAA).
# 
# There are many ways to access climate data on different temporal or spatial resolutions. This notebook is meant to illustrate one possible way to download data at daily resolution which is currently the primary temporal resolution supported in ibicus, although some can be applied at monthly resolution. 

# In[1]:


import os
import urllib

# Scientific computing
import iris
import xarray
import numpy as np


# ## 1. Download data

# Let's create a data-directory where our inputs will go, if it does not yet exist:

# In[2]:


DATADIR = "data_download_and_preprocessing"

if not os.path.exists(DATADIR):
    os.mkdir(DATADIR)


# ### 1.1. Download climate model data

# To request climate data from the Climate Data Store (CDS) we will use the CDS API. Run the following cell if you have not yet istalled it:

# In[3]:


#!pip install cdsapi
import cdsapi

# We disable urllib3 (used by cdsapi) warning
import urllib3
urllib3.disable_warnings()


# We make use of the option to manually set the CDS API credentials. First, you have to set two variables: URL and KEY which build together your CDS API key. The string of characters that make up your KEY include your personal User ID and CDS API key. To obtain these, first register or login to the CDS (http://cds.climate.copernicus.eu), then visit https://cds.climate.copernicus.eu/api-how-to and copy the string of characters listed after "key:". Replace the ######### below with your key.

# In[4]:


URL = 'https://cds.climate.copernicus.eu/api/v2'
KEY = '########################################' # enter your key instead


# Let's choose a model and variable of interest, and fix some meta-paramters. If we are interested in multiple variable we can just iterate the code below:

# In[5]:


# choose model
model = 'mpi_esm1_2_lr'

# choose variables to extract (not all variables available at daily resolution for all cmip6 models at the moment)
variable = 'precipitation'

# choose area to extract
area = [80, 3, 20, 30]

# choose a historical period to extract
period_hist = '1979-01-01/2005-12-31' 

# choose a future period to extract:
period_fut = '2050-01-01/2070-12-31'

# choose a filename for the historical cm data
fname_cm_hist = f"cmip6_daily_1979-2015_ipsl_historical_{variable}.zip"

# choose a filename for the future cm data
fname_cm_future = f"cmip6_daily_2050-2070_ipsl_ssp5_8_5_{variable}.zip"


# Both datasets will be in `DATADIR` under `fname_cm_hist` and `fname_cm_future`.

# #### 1.1.1. Download historical climate model data

# Executing the following cell will retrieve historical climate model data:

# In[6]:


# download historical climate model data

c = cdsapi.Client(url=URL, key=KEY)

c.retrieve(
    'projections-cmip6',
    {
        'temporal_resolution': 'daily',
        'experiment': 'historical',
        'level': 'single_levels',
        'variable': variable,
        'model': model,
        'date': period_hist,
        'area': area,
        'format': 'zip',
    },
    f'{DATADIR}/{fname_cm_hist}')


# After unzipping the folder...

# In[7]:


import zipfile

with zipfile.ZipFile(f'{DATADIR}/{fname_cm_hist}', 'r') as zip_ref:
    zip_ref.extractall(DATADIR)


# ...the file is now in `DATADIR/pr_day_MPI-ESM1-2-LR_historical_r1i1p1f1_gn_19790101-20141231_*.nc`

# #### 1.1.2. Download future climate model data

# Now we go through the same steps to download climate data in the future or application period:

# In[8]:


# download future climate model data

c = cdsapi.Client(url=URL, key=KEY)

c.retrieve(
    'projections-cmip6',
    {
        'temporal_resolution': 'daily',
        'experiment': 'ssp5_8_5',
        'level': 'single_levels',
        'variable': variable,
        'model': model,
        'date': period_fut,
        'area': area,
        'format': 'zip',
    },
    f'{DATADIR}/{fname_cm_future}')


# Again, we need to unzip the folder:

# In[9]:


import zipfile

with zipfile.ZipFile(f'{DATADIR}/{fname_cm_future}', 'r') as zip_ref:
    zip_ref.extractall(DATADIR)


# The file is now in `DATADIR/pr_day_MPI-ESM1-2-LR_ssp585_r1i1p1f1_gn_20500101-20701231_*.nc`

# ### 1.2. Download observations

# Now we will download observations. We will first download ERA5 data from the CDS and afterwards the NCEP/DOE II Reanalysis from the PSL.

# #### 1.2.1. Download ERA5

# We will download ERA5 on daily temporal resolution (if the climate model were on other temporal resolutions we would also need a different one for ERA5). The script is inspired by [this discussion](https://confluence.ecmwf.int/pages/viewpage.action?pageId=228867588) and uses the [
# Daily statistics calculated from ERA5 data
# ](https://cds.climate.copernicus.eu/cdsapp#!/software/app-c3s-daily-era5-statistics?tab=overview) application. The output of this application is a separate netCDF file for chosen daily statistic for each month for each year. We concatenate those files then manually. First we need to make some selections (make sure the data chosen here is consistent with the cm data pulled above):

# In[10]:


# choose years to request (this should overlap with the `period_hist` chosen for the cm data)
# this is chosen shorter for demonstration purposes
years =  list(map(str, range(1979, 1981)))

# choose months to request
months = list(map(str, range(10, 12)))

# choose a variable (must be a valid ERA5 CDS API name)
variable = "total_precipitation"

# choose a required statistic (valid names given in the application description above)
statistic = "daily_mean"

# choose an area (should be the same as above)
area = {"lat": [30, 80], "lon": [3, 20]}

# choose a filename
fname_era5 = f"era5_{variable}_{statistic}_1979_1981.nc"


# And now we can request the data:

# In[11]:


c = cdsapi.Client(url=URL, key=KEY, timeout=300)

# Loop over years and months
filenames_for_cleanup= []
for yr in years:
    print(f"----- Requesting year: {yr} -----")
    for mn in months:
        result = c.service(
            "tool.toolbox.orchestrator.workflow",
            params={
                 "realm": "user-apps",
                 "project": "app-c3s-daily-era5-statistics",
                 "version": "master",
                 "kwargs": {
                     "dataset": "reanalysis-era5-single-levels",
                     "product_type": "reanalysis",
                     "variable": variable,
                     "statistic": statistic,
                     "year": yr,
                     "month": mn,
                     "time_zone": "UTC+00:0",
                     "frequency": "1-hourly",
                     "grid": "1.0/1.0",
                     "area": area,
                     },
            "workflow_name": "application"
        })  

 
        filename = f"{DATADIR}/era5_{variable}_{statistic}_{yr}_{mn}.nc"
        url = result[0]['location']

        # Download nc file
        urllib.request.urlretrieve(url, filename)
        # Append filename to list of filenames to cleanup
        filenames_for_cleanup.append(filename)

# Combine monthly data
combined_data = xarray.open_mfdataset(f"{DATADIR}/era5_{variable}_{statistic}_*.nc", combine = 'nested', concat_dim="time")
combined_data.to_netcdf(f"{DATADIR}/{fname_era5}")

# Cleanup
for filename in filenames_for_cleanup:
    os.remove(filename)


# #### 1.2.2. Download NCEP/DOE II

# We now download the [NCEP/DOE II data](https://psl.noaa.gov/data/gridded/data.ncep.reanalysis2.html). [Here is an overview](https://psl.noaa.gov/data/gridded/data.ncep.reanalysis2.html) of the possible variables and we take the data from [the datastore here](https://downloads.psl.noaa.gov/Datasets/ncep.reanalysis2/Dailies/gaussian_grid/). 

# In[12]:


# Variable name. Needs to be one of the NCEP-names in https://downloads.psl.noaa.gov/Datasets/ncep.reanalysis2/Dailies/gaussian_grid/. 
variable = "prate.sfc.gauss"

# choose years to request (this should overlap with the `period_hist` chosen for the cm data)*
years = map(str, range(1979, 2005))

# choose an area (should be the same as above)
area = {"lat": [30, 80], "lon": [3, 20]}

# choose a filename
fname_ncep_doe = f"ncep_doe_{variable}_1979_2005.nc"


# Now we can download the data:

# In[13]:


# Download data year by year
filenames_for_cleanup = []
for year in years:
    url = f"https://downloads.psl.noaa.gov/Datasets/ncep.reanalysis2/Dailies/gaussian_grid/{variable}.{year}.nc"
    filename = f"{DATADIR}/{variable}_{year}.nc"
    # Download nc file
    urllib.request.urlretrieve(url, filename)
    # Append filename to list of filenames to cleanup
    filenames_for_cleanup.append(filename)

# Combine data for variable
combined_data = xarray.open_mfdataset(f"{DATADIR}/{variable}_*.nc", combine = 'nested', concat_dim="time")
# Select area
combined_data = combined_data.sel(lon=slice(min(area["lon"]), max(area["lon"])),lat=slice(max(area["lat"]), min(area["lat"])))
# Write to file                                                                                        
combined_data.to_netcdf(f"{DATADIR}/{fname_ncep_doe}")
# Cleanup
for filename in filenames_for_cleanup:
    os.remove(filename)


# It is also possible (and probably easier) to download the data via ftp through the same links, or via the visual interface accessible via the graph-icon next to a variable in the [NCEP/DOE 2 overview page](https://psl.noaa.gov/data/gridded/data.ncep.reanalysis2.html). The latter also provides an option to select a range of dates and access merged data for that range that can directly be used for the further preprocessing steps.

# ## 2. Convert and regrid data

# Now that we have downloaded the data we need to make sure that observations and climate model data are:
# 
# - on the same temporal resolution: this is covered because we downloaded the data on daily resolution.
# - on the same spatial resolution: we need to regrid the data.
# - in the same units: we may need to convert units.
# 
# Furthermore we might want to extract additional information and need to get the numpy arrays corresponding to the data. In the numpy arrays we need to make sure that they have the form `[t,x,y]`.
# 

# ### 2.1. Regrid data 

# Now that we have data on the same temporal resolution for both the climate model and observations we need to make sure they are also on the same spatial one and regrid the datasets. The climate model data is on a coarser grid, therefore we will regrid the observational data onto this resolution. However there are also other solutions, where the [climate model data is regridded onto the resolution of the observations](https://esd.copernicus.org/articles/9/313/2018/).
# 
# We will use iris for the regridding, however there are also xarray solutions. Different variables might require different regridding schemes: [a list of ones available in iris is given here](https://scitools-iris.readthedocs.io/en/latest/userguide/interpolation_and_regridding.html?highlight=interpolate#regridding). For precipitation we choose a regridder based on Nearest values. Other regridders like linear ones *might* introduce negative values.

# In[14]:


cm_hist = iris.load_cube(f"{DATADIR}/pr_day_MPI-ESM1-2-LR_historical_r1i1p1f1_gn_19790101-20051231_v20190710.nc", "precipitation_flux")
cm_future = iris.load_cube(f"{DATADIR}/pr_day_MPI-ESM1-2-LR_ssp585_r1i1p1f1_gn_20500101-20701231_v20190710.nc", "precipitation_flux")


# First let's take care of the ERA5 reanalysis:

# In[15]:


obs_era5 = iris.load_cube(f"{DATADIR}/{fname_era5}")
obs_era5 = obs_era5.regrid(cm_hist, iris.analysis.Nearest())


# And now of the NCEP/DOE II data:

# In[16]:


obs_ncep_doe = iris.load_cube(f"{DATADIR}/{fname_ncep_doe}")
obs_ncep_doe = obs_ncep_doe.regrid(cm_hist, iris.analysis.Nearest())


# ### 2.1. Extract additional information

# The data objects are now all at the same temporal and spatial resolution. Because some debiasers need the dates as input, it is useful to extract them here:

# In[17]:


def get_dates(x):
    time_dimension = x.coords()[0]
    dates = time_dimension.units.num2date(time_dimension.points)
    return dates

get_dates = np.vectorize(get_dates)

dates_cm_hist = get_dates(cm_hist)
dates_cm_future = get_dates(cm_future)

dates_obs_era5 = get_dates(obs_era5)
dates_obs_ncep_doe = get_dates(obs_ncep_doe)


# ### 2.3. Get numpy arrays

# In order to start working with ibicus, we need to get the numpy arrays associated with the data from the iris cubes:

# In[18]:


cm_hist = cm_hist.data
cm_future = cm_future.data

obs_era5 = obs_era5.data
obs_ncep_doe = obs_ncep_doe.data


# We look at the shapes to make sure they are all in the form `[t, x, y]`:

# In[19]:


print(f"Shape cm_hist: {cm_hist.shape}")
print(f"Shape cm_future: {cm_future.shape}")

print(f"Shape obs_era5: {obs_era5.shape}")
print(f"Shape obs_ncep_doe: {obs_ncep_doe.shape}")


# ### 2.4. Convert units

# From the [ERA5 documentation](https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation) we can see that the precipitation is measured in m, whilst in `cm_hist` and `cm_future` it is measured as flux (m / s^-1). To convert we need to divide the ERA5-values by 86 400 (the number of seconds per day):

# In[20]:


obs_era5 = obs_era5/86400


# The values in the NCEP/DOE II reanalysis are in the same units.

# ## 3. Apply debiaser

# After these preparations we can finally apply a bias adjustment method. For a detailed introduction into the actual application of bias correction using ibicus, we refer you to the other notebooks.
# 
# For illustrative purposes we give one example here using a simple quantile mapping methodology that we apply to both ERA5 and NCEP/DOE II data.

# In[21]:


from ibicus.debias import QuantileMapping

debiaser = QuantileMapping.from_variable("pr")

debiased_cm_future_era5 = debiaser.apply(obs_era5, cm_hist, cm_future)
debiased_cm_future_ncep_doe = debiaser.apply(obs_ncep_doe, cm_hist, cm_future)