*Purpose: Your data has labels; you should use them*
Scientific data is inherently labeled. For example, time series data includes timestamps that label individual periods or points in time, spatial data has coordinates (e.g. longitude, latitude, elevation), and model or laboratory experiments are often identified by unique identifiers. The figure above provides an example of a labeled dataset. In this case the data is a map of global air temperature from a numeric weather model. The labels on this particular dataset are time (e.g. “2016-05-01”), longitude (x-axis), and latitude (y-axis).
10 minutes
Xarray Documentation on Indexing: http://xarray.pydata.org/en/latest/indexing.html
import xarray as xr
# load a sample dataset
ds = xr.tutorial.load_dataset('air_temperature')
ds
<xarray.Dataset> Dimensions: (lat: 25, lon: 53, time: 2920) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 52.5 ... * lon (lon) float32 200.0 202.5 205.0 207.5 210.0 212.5 215.0 217.5 ... * time (time) datetime64[ns] 2013-01-01 2013-01-01T06:00:00 ... Data variables: air (time, lat, lon) float32 241.2 242.5 243.5 244.0 244.1 243.89 ... Attributes: Conventions: COARDS title: 4x daily NMC reanalysis (1948) description: Data is from NMC initialized reanalysis\n(4x/day). These a... platform: Model references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
When working with numpy, indexing is done by position (slices/ranges/scalars).
t = ds['air'].data # numpy array
t
array([[[ 241.19999695, 242.5 , 243.5 , ..., 232.79998779, 235.5 , 238.59999084], [ 243.79998779, 244.5 , 244.69999695, ..., 232.79998779, 235.29998779, 239.29998779], [ 250. , 249.79998779, 248.88999939, ..., 233.19999695, 236.38999939, 241.69999695], ..., [ 296.6000061 , 296.19998169, 296.3999939 , ..., 295.3999939 , 295.1000061 , 294.69998169], [ 295.8999939 , 296.19998169, 296.79000854, ..., 295.8999939 , 295.8999939 , 295.19998169], [ 296.29000854, 296.79000854, 297.1000061 , ..., 296.8999939 , 296.79000854, 296.6000061 ]], [[ 242.09999084, 242.69999695, 243.09999084, ..., 232. , 233.59999084, 235.79998779], [ 243.59999084, 244.09999084, 244.19999695, ..., 231. , 232.5 , 235.69999695], [ 253.19999695, 252.88999939, 252.09999084, ..., 230.79998779, 233.38999939, 238.5 ], ..., [ 296.3999939 , 295.8999939 , 296.19998169, ..., 295.3999939 , 295.1000061 , 294.79000854], [ 296.19998169, 296.69998169, 296.79000854, ..., 295.6000061 , 295.5 , 295.1000061 ], [ 296.29000854, 297.19998169, 297.3999939 , ..., 296.3999939 , 296.3999939 , 296.6000061 ]], [[ 242.29998779, 242.19999695, 242.29998779, ..., 234.29998779, 236.09999084, 238.69999695], [ 244.59999084, 244.38999939, 244. , ..., 230.29998779, 232. , 235.69999695], [ 256.19998169, 255.5 , 254.19999695, ..., 231.19999695, 233.19999695, 238.19999695], ..., [ 295.6000061 , 295.3999939 , 295.3999939 , ..., 296.29000854, 295.29000854, 295. ], [ 296.19998169, 296.5 , 296.29000854, ..., 296.3999939 , 296. , 295.6000061 ], [ 296.3999939 , 296.29000854, 296.3999939 , ..., 297. , 297. , 296.79000854]], ..., [[ 243.48999023, 242.98999023, 242.08999634, ..., 244.18998718, 244.48999023, 244.88999939], [ 249.08999634, 248.98999023, 248.58999634, ..., 240.58999634, 241.28999329, 242.68998718], [ 262.69000244, 262.19000244, 261.69000244, ..., 239.38999939, 241.68998718, 245.18998718], ..., [ 294.79000854, 295.29000854, 297.48999023, ..., 295.48999023, 295.38998413, 294.69000244], [ 296.79000854, 297.88998413, 298.29000854, ..., 295.48999023, 295.48999023, 294.79000854], [ 298.19000244, 299.19000244, 298.79000854, ..., 296.08999634, 295.79000854, 295.79000854]], [[ 245.78999329, 244.78999329, 243.48999023, ..., 243.28999329, 243.98999023, 244.78999329], [ 249.88999939, 249.28999329, 248.48999023, ..., 241.28999329, 242.48999023, 244.28999329], [ 262.38998413, 261.79000854, 261.29000854, ..., 240.48999023, 243.08999634, 246.88999939], ..., [ 293.69000244, 293.88998413, 295.38998413, ..., 295.08999634, 294.69000244, 294.29000854], [ 296.29000854, 297.19000244, 297.58999634, ..., 295.29000854, 295.08999634, 294.38998413], [ 297.79000854, 298.38998413, 298.48999023, ..., 295.69000244, 295.48999023, 295.19000244]], [[ 245.08999634, 244.28999329, 243.28999329, ..., 241.68998718, 241.48999023, 241.78999329], [ 249.88999939, 249.28999329, 248.38999939, ..., 239.58999634, 240.28999329, 241.68998718], [ 262.98999023, 262.19000244, 261.38998413, ..., 239.88999939, 242.58999634, 246.28999329], ..., [ 293.79000854, 293.69000244, 295.08999634, ..., 295.29000854, 295.08999634, 294.69000244], [ 296.08999634, 296.88998413, 297.19000244, ..., 295.69000244, 295.69000244, 295.19000244], [ 297.69000244, 298.08999634, 298.08999634, ..., 296.48999023, 296.19000244, 295.69000244]]], dtype=float32)
t.shape
(2920, 25, 53)
# extract a time-series for one spatial location
t[:, 10, 20]
array([ 269.79000854, 267.69998169, 269.8999939 , ..., 266.19000244, 270.88998413, 268.38998413], dtype=float32)
but wait, what labels go with 10
and 20
? Was that lat/lon or lon/lat? Where are the timestamps that go along with this time-series?
xarray offers extremely flexible indexing routines that combine the best features of NumPy and pandas for data selection.
da = ds['air']
# numpy style indexing still works (but preserves the labels/metadata)
da[:, 10, 20]
<xarray.DataArray 'air' (time: 2920)> array([ 269.790009, 267.699982, 269.899994, ..., 266.190002, 270.889984, 268.389984], dtype=float32) Coordinates: lat float32 50.0 lon float32 250.0 * time (time) datetime64[ns] 2013-01-01 2013-01-01T06:00:00 ... Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [ 185.16000366 322.1000061 ]
# Positional indexing using dimension names
da.isel(lat=10, lon=20)
<xarray.DataArray 'air' (time: 2920)> array([ 269.790009, 267.699982, 269.899994, ..., 266.190002, 270.889984, 268.389984], dtype=float32) Coordinates: lat float32 50.0 lon float32 250.0 * time (time) datetime64[ns] 2013-01-01 2013-01-01T06:00:00 ... Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [ 185.16000366 322.1000061 ]
# Label-based indexing
da.sel(lat=50., lon=250.)
<xarray.DataArray 'air' (time: 2920)> array([ 269.790009, 267.699982, 269.899994, ..., 266.190002, 270.889984, 268.389984], dtype=float32) Coordinates: lat float32 50.0 lon float32 250.0 * time (time) datetime64[ns] 2013-01-01 2013-01-01T06:00:00 ... Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [ 185.16000366 322.1000061 ]
# Nearest neighbor lookups
da.sel(lat=52.25, lon=251.8998, method='nearest')
<xarray.DataArray 'air' (time: 2920)> array([ 262.699982, 263.199982, 270.899994, ..., 264.190002, 265.190002, 266.98999 ], dtype=float32) Coordinates: lat float32 52.5 lon float32 252.5 * time (time) datetime64[ns] 2013-01-01 2013-01-01T06:00:00 ... Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [ 185.16000366 322.1000061 ]
# all of these indexing methods work on the dataset too, e.g.:
ds.sel(lat=52.25, lon=251.8998, method='nearest')
<xarray.Dataset> Dimensions: (time: 2920) Coordinates: lat float32 52.5 lon float32 252.5 * time (time) datetime64[ns] 2013-01-01 2013-01-01T06:00:00 ... Data variables: air (time) float32 262.7 263.2 270.9 274.1 273.29 270.6 266.7 263.4 ... Attributes: Conventions: COARDS title: 4x daily NMC reanalysis (1948) description: Data is from NMC initialized reanalysis\n(4x/day). These a... platform: Model references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
Like numpy and pandas, xarray supports indexing many array elements at once in a vectorized manner.
# generate a coordinates for a transect of points
lat_points = xr.DataArray([52, 52.5, 53], dims='points')
lon_points = xr.DataArray([250, 250, 250], dims='points')
# nearest neighbor selection along the transect
da.sel(lat=lat_points, lon=lon_points, method='nearest')
<xarray.DataArray 'air' (time: 2920, points: 3)> array([[ 269.5 , 269.5 , 269.5 ], [ 269.290009, 269.290009, 269.290009], [ 273.699982, 273.699982, 273.699982], ..., [ 267.48999 , 267.48999 , 267.48999 ], [ 269.290009, 269.290009, 269.290009], [ 268.690002, 268.690002, 268.690002]], dtype=float32) Coordinates: lat (points) float32 52.5 52.5 52.5 lon (points) float32 250.0 250.0 250.0 * time (time) datetime64[ns] 2013-01-01 2013-01-01T06:00:00 ... Dimensions without coordinates: points Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [ 185.16000366 322.1000061 ]
xarray enforces alignment between index Coordinates (that is, coordinates with the same name as a dimension, marked by *) on objects used in binary operations.
da
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> array([[[ 241.199997, 242.5 , ..., 235.5 , 238.599991], [ 243.799988, 244.5 , ..., 235.299988, 239.299988], ..., [ 295.899994, 296.199982, ..., 295.899994, 295.199982], [ 296.290009, 296.790009, ..., 296.790009, 296.600006]], [[ 242.099991, 242.699997, ..., 233.599991, 235.799988], [ 243.599991, 244.099991, ..., 232.5 , 235.699997], ..., [ 296.199982, 296.699982, ..., 295.5 , 295.100006], [ 296.290009, 297.199982, ..., 296.399994, 296.600006]], ..., [[ 245.789993, 244.789993, ..., 243.98999 , 244.789993], [ 249.889999, 249.289993, ..., 242.48999 , 244.289993], ..., [ 296.290009, 297.190002, ..., 295.089996, 294.389984], [ 297.790009, 298.389984, ..., 295.48999 , 295.190002]], [[ 245.089996, 244.289993, ..., 241.48999 , 241.789993], [ 249.889999, 249.289993, ..., 240.289993, 241.689987], ..., [ 296.089996, 296.889984, ..., 295.690002, 295.190002], [ 297.690002, 298.089996, ..., 296.190002, 295.690002]]], dtype=float32) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 52.5 ... * lon (lon) float32 200.0 202.5 205.0 207.5 210.0 212.5 215.0 217.5 ... * time (time) datetime64[ns] 2013-01-01 2013-01-01T06:00:00 ... Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [ 185.16000366 322.1000061 ]
arr = da.isel(time=0, lat=slice(5, 10), lon=slice(7, 11))
arr
<xarray.DataArray 'air' (lat: 5, lon: 4)> array([[ 265.100006, 264.899994, 265.5 , 266.5 ], [ 272.899994, 272. , 271.790009, 271.100006], [ 279.399994, 278.899994, 278.399994, 276.600006], [ 279.5 , 279.899994, 280.5 , 279.899994], [ 280.100006, 280.199982, 280.790009, 281.399994]], dtype=float32) Coordinates: * lat (lat) float32 62.5 60.0 57.5 55.0 52.5 * lon (lon) float32 217.5 220.0 222.5 225.0 time datetime64[ns] 2013-01-01 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [ 185.16000366 322.1000061 ]
part = arr[:-1]
part
<xarray.DataArray 'air' (lat: 4, lon: 4)> array([[ 265.100006, 264.899994, 265.5 , 266.5 ], [ 272.899994, 272. , 271.790009, 271.100006], [ 279.399994, 278.899994, 278.399994, 276.600006], [ 279.5 , 279.899994, 280.5 , 279.899994]], dtype=float32) Coordinates: * lat (lat) float32 62.5 60.0 57.5 55.0 * lon (lon) float32 217.5 220.0 222.5 225.0 time datetime64[ns] 2013-01-01 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [ 185.16000366 322.1000061 ]
# default behavior is an "inner join"
(arr + part) / 2
<xarray.DataArray 'air' (lat: 4, lon: 4)> array([[ 265.100006, 264.899994, 265.5 , 266.5 ], [ 272.899994, 272. , 271.790009, 271.100006], [ 279.399994, 278.899994, 278.399994, 276.600006], [ 279.5 , 279.899994, 280.5 , 279.899994]], dtype=float32) Coordinates: * lat (lat) float64 62.5 60.0 57.5 55.0 * lon (lon) float32 217.5 220.0 222.5 225.0 time datetime64[ns] 2013-01-01
# we can also use an outer join
with xr.set_options(arithmetic_join="outer"):
print((arr + part) / 2)
# notice that missing values (nan) were inserted
<xarray.DataArray 'air' (lat: 5, lon: 4)> array([[ nan, nan, nan, nan], [ 279.5 , 279.899994, 280.5 , 279.899994], [ 279.399994, 278.899994, 278.399994, 276.600006], [ 272.899994, 272. , 271.790009, 271.100006], [ 265.100006, 264.899994, 265.5 , 266.5 ]], dtype=float32) Coordinates: * lat (lat) float64 52.5 55.0 57.5 60.0 62.5 * lon (lon) float32 217.5 220.0 222.5 225.0 time datetime64[ns] 2013-01-01
DataArray objects are automatically align themselves (“broadcasting” in the numpy parlance) by dimension name instead of axis order. With xarray, you do not need to transpose arrays or insert dimensions of length 1 to get array operations to work, as commonly done in numpy with np.reshape() or np.newaxis.