Reading ensemble output¶

Import modules¶

In [1]:

import xarray as xr

Set number of ensemble members¶

In [2]:

niter = 100

Set output path and ensemble name¶

In this case, all my ensemble members are named "hydro_ensemble_LHC_X" where X is the member number

In [3]:

path = "/glade/scratch/kdagon/archive/"
PPE = "hydro_ensemble_LHC_"

Set output variables of interest¶

In [12]:

var = ['FPSN', 'EFLX_LH_TOT']

Create a list of paths to each ensemble member output directory¶

Here is where we use unix wildcards (e.g., *) to get all the files in each directory
Note that not all wildcard functionality works the same way as in does in the command line

In [7]:

#full_paths = [path+PPE+str(i+1)+"/lnd/hist/*{001[6-9],20-}*" for i in range(niter)] # specific to years 16-20; NOTE: this wildcard doesn't work with open_mfdataset
full_paths = [path+PPE+str(i+1)+"/lnd/hist/*" for i in range(niter)] # all history files in the folder
full_paths[:10] # look at the first 10 paths

Out[7]:

['/glade/scratch/kdagon/archive/hydro_ensemble_LHC_1/lnd/hist/*',
 '/glade/scratch/kdagon/archive/hydro_ensemble_LHC_2/lnd/hist/*',
 '/glade/scratch/kdagon/archive/hydro_ensemble_LHC_3/lnd/hist/*',
 '/glade/scratch/kdagon/archive/hydro_ensemble_LHC_4/lnd/hist/*',
 '/glade/scratch/kdagon/archive/hydro_ensemble_LHC_5/lnd/hist/*',
 '/glade/scratch/kdagon/archive/hydro_ensemble_LHC_6/lnd/hist/*',
 '/glade/scratch/kdagon/archive/hydro_ensemble_LHC_7/lnd/hist/*',
 '/glade/scratch/kdagon/archive/hydro_ensemble_LHC_8/lnd/hist/*',
 '/glade/scratch/kdagon/archive/hydro_ensemble_LHC_9/lnd/hist/*',
 '/glade/scratch/kdagon/archive/hydro_ensemble_LHC_10/lnd/hist/*']

Create a preprocess function that returns a specific variable or variables¶

In [9]:

def preprocess(ds):
    return ds[var]

Test opening the first ensemble member¶

In [17]:

%%time
da_model = xr.open_mfdataset(full_paths[0], combine='by_coords', preprocess=preprocess)

CPU times: user 9.21 s, sys: 51.6 ms, total: 9.26 s
Wall time: 11.5 s

In [18]:

da_model

Out[18]:

<xarray.Dataset>
Dimensions:      (lat: 46, lon: 72, time: 60)
Coordinates:
  * lon          (lon) float32 0.0 5.0 10.0 15.0 ... 340.0 345.0 350.0 355.0
  * lat          (lat) float32 -90.0 -86.0 -82.0 -78.0 ... 78.0 82.0 86.0 90.0
  * time         (time) object 0016-02-01 00:00:00 ... 0021-01-01 00:00:00
Data variables:
    FPSN         (time, lat, lon) float32 dask.array<shape=(60, 46, 72), chunksize=(1, 46, 72)>
    EFLX_LH_TOT  (time, lat, lon) float32 dask.array<shape=(60, 46, 72), chunksize=(1, 46, 72)>
Attributes:
    title:                                     CLM History file information
    comment:                                   NOTE: None of the variables ar...
    Conventions:                               CF-1.0
    history:                                   created on 05/28/18 20:36:54
    source:                                    Community Land Model CLM4.0
    hostname:                                  cheyenne
    username:                                  kdagon
    version:                                   unknown
    revision_id:                               $Id: histFileMod.F90 42903 201...
    case_title:                                UNSET
    case_id:                                   hydro_ensemble_LHC_1
    Surface_dataset:                           surfdata_4x5_16pfts_Irrig_CMIP...
    Initial_conditions_dataset:                finidat_interp_dest.nc
    PFT_physiological_constants_dataset:       hydro_ensemble_LHC_1.nc
    ltype_vegetated_or_bare_soil:              1
    ltype_crop:                                2
    ltype_UNUSED:                              3
    ltype_landice_multiple_elevation_classes:  4
    ltype_deep_lake:                           5
    ltype_wetland:                             6
    ltype_urban_tbd:                           7
    ltype_urban_hd:                            8
    ltype_urban_md:                            9
    ctype_vegetated_or_bare_soil:              1
    ctype_crop:                                2
    ctype_crop_noncompete:                     2*100+m, m=cft_lb,cft_ub
    ctype_landice:                             3
    ctype_landice_multiple_elevation_classes:  4*100+m, m=1,glcnec
    ctype_deep_lake:                           5
    ctype_wetland:                             6
    ctype_urban_roof:                          71
    ctype_urban_sunwall:                       72
    ctype_urban_shadewall:                     73
    ctype_urban_impervious_road:               74
    ctype_urban_pervious_road:                 75
    cft_c3_crop:                               1
    cft_c3_irrigated:                          2
    time_period_freq:                          month_1
    Time_constant_3Dvars_filename:             ./hydro_ensemble_LHC_1.clm2.h0...
    Time_constant_3Dvars:                      ZSOI:DZSOI:WATSAT:SUCSAT:BSW:H...

Open each ensemble member as a list of datasets¶

Start with the first 10 ensemble members to test functionality

In [19]:

%%time
da_model = [xr.open_mfdataset(p, combine='by_coords', preprocess=preprocess) for p in full_paths[:10]]

CPU times: user 1min 33s, sys: 1.17 s, total: 1min 34s
Wall time: 1min 55s

Note: this takes about ~2min to read 10 ensemble members with 5 years of monthly history files for each (4x5 resolution)

Create an ensemble member dimension to index on¶

Again, start with first 10 members

In [21]:

ensdim = xr.DataArray(list(range(1,11)), dims='ens', name='ens') # or can use np.arange
ensdim

Out[21]:

<xarray.DataArray 'ens' (ens: 10)>
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
Dimensions without coordinates: ens

Concatenate the ensemble members along new ensemble dimension¶

In [22]:

da_model_concat = xr.concat(da_model, dim=ensdim)

Now you have a dataset indexed by ensemble member, with only the specify varible output you want¶

In [23]:

da_model_concat

Out[23]:

<xarray.Dataset>
Dimensions:      (ens: 10, lat: 46, lon: 72, time: 60)
Coordinates:
  * lon          (lon) float32 0.0 5.0 10.0 15.0 ... 340.0 345.0 350.0 355.0
  * lat          (lat) float32 -90.0 -86.0 -82.0 -78.0 ... 78.0 82.0 86.0 90.0
  * time         (time) object 0016-02-01 00:00:00 ... 0021-01-01 00:00:00
  * ens          (ens) int64 1 2 3 4 5 6 7 8 9 10
Data variables:
    FPSN         (ens, time, lat, lon) float32 dask.array<shape=(10, 60, 46, 72), chunksize=(1, 1, 46, 72)>
    EFLX_LH_TOT  (ens, time, lat, lon) float32 dask.array<shape=(10, 60, 46, 72), chunksize=(1, 1, 46, 72)>
Attributes:
    title:                                     CLM History file information
    comment:                                   NOTE: None of the variables ar...
    Conventions:                               CF-1.0
    history:                                   created on 05/28/18 20:36:54
    source:                                    Community Land Model CLM4.0
    hostname:                                  cheyenne
    username:                                  kdagon
    version:                                   unknown
    revision_id:                               $Id: histFileMod.F90 42903 201...
    case_title:                                UNSET
    case_id:                                   hydro_ensemble_LHC_1
    Surface_dataset:                           surfdata_4x5_16pfts_Irrig_CMIP...
    Initial_conditions_dataset:                finidat_interp_dest.nc
    PFT_physiological_constants_dataset:       hydro_ensemble_LHC_1.nc
    ltype_vegetated_or_bare_soil:              1
    ltype_crop:                                2
    ltype_UNUSED:                              3
    ltype_landice_multiple_elevation_classes:  4
    ltype_deep_lake:                           5
    ltype_wetland:                             6
    ltype_urban_tbd:                           7
    ltype_urban_hd:                            8
    ltype_urban_md:                            9
    ctype_vegetated_or_bare_soil:              1
    ctype_crop:                                2
    ctype_crop_noncompete:                     2*100+m, m=cft_lb,cft_ub
    ctype_landice:                             3
    ctype_landice_multiple_elevation_classes:  4*100+m, m=1,glcnec
    ctype_deep_lake:                           5
    ctype_wetland:                             6
    ctype_urban_roof:                          71
    ctype_urban_sunwall:                       72
    ctype_urban_shadewall:                     73
    ctype_urban_impervious_road:               74
    ctype_urban_pervious_road:                 75
    cft_c3_crop:                               1
    cft_c3_irrigated:                          2
    time_period_freq:                          month_1
    Time_constant_3Dvars_filename:             ./hydro_ensemble_LHC_1.clm2.h0...
    Time_constant_3Dvars:                      ZSOI:DZSOI:WATSAT:SUCSAT:BSW:H...

Example of chunking when reading in a large file¶

Set the file path¶

In [34]:

file_path = "/glade/scratch/nanr/forKatie/daily/"

Specify the file name¶

This is 1/4 degree atmosphere daily output, RCP8.5 from 2070-2100

In [35]:

file = "b.e13.BRCP85C5CN.ne120_g16.001.cam.h1.PRECT.20700101-21001231.FV.nc"

Open with chunking along time, lat, lon¶

You kind of have to know your general dimension sizes to select chunk sizes
For example here I am spliting lat and lon into 2 chunks (total lat = 786, total lon = 1152)
And I am chunking time (the largest dimension) into size 100 each

In [36]:

ds = xr.open_dataset(file_path+file, chunks={'time': 100, 'lat': 384, 'lon': 576})

More about how to choose chunk sizes:
https://docs.dask.org/en/latest/array-best-practices.html
https://examples.dask.org/xarray.html

Read in a specific variable as a data array¶

In [37]:

PRECT = ds.PRECT
PRECT

Out[37]:

<xarray.DataArray 'PRECT' (time: 11315, lat: 768, lon: 1152)>
dask.array<shape=(11315, 768, 1152), dtype=float32, chunksize=(100, 384, 576)>
Coordinates:
  * lat      (lat) float64 -90.0 -89.77 -89.53 -89.3 ... 89.3 89.53 89.77 90.0
  * lon      (lon) float64 0.0 0.3125 0.625 0.9375 ... 358.8 359.1 359.4 359.7
  * time     (time) object 2070-01-02 00:00:00 ... 2101-01-01 00:00:00
Attributes:
    units:          m/s
    long_name:      Total (convective and large-scale) precipitation rate (li...
    cell_methods:   time: mean
    cell_measures:  area: area

Look at the total array and chunk sizes¶

In [38]:

PRECT.data

Out[38]:

	Array	Chunk
Bytes	40.04 GB	88.47 MB
Shape	(11315, 768, 1152)	(100, 384, 576)
Count	457 Tasks	456 Chunks
Type	float32	numpy.ndarray

In [ ]: