Most of the exciting developments in seamless data analysis, particularly for large Earth Science datasets are happening in python. In recent years, robust and open ecosystems have been developed around tools like xarray and Pandas which make it very easy to load in large data sets, manipulate them in various ways and produce beautiful plots. This notebook will serve as a basic overview of netcdf (in Python) manipulation with xarray and the Pandas/Geopandas packages. Most of this notebook is based on Project Pythia by Brian Rose and the Earth and Environmental Data Science Course. If you'd like to learn more, the tutorials at the above link are highly recommended.
from datetime import timedelta
import numpy as np
import pandas as pd
import xarray as xr
from matplotlib import pyplot as plt
data = 283 + 5 * np.random.randn(5, 3, 4) #make a random numpy array
The primary data type in xarray is a "DataArray". It works just like a numpy array, but also:
Coordinate names and values are stored with the data, making slicing and indexing much more powerful.
Attributes, similar to those in netCDF files, can be stored in a container built into the DataArray.
temp = xr.DataArray(data) #turn numpy array into DataArray
temp = xr.DataArray(data, dims=['time', 'lat', 'lon']) #assigns names to the three dimensions
Most modern datasets that include temporal information (time and date when an observation was made) use a format for time called "datetime" (see Project Pythia for detail). What you should know is that datetime format has information as: YYYY-MM-DD HH:MM:SS. This can also be shortened if further precision (i.e. time or minutes) is not available, or can include arbitrarily high precision (decimal seconds). Converting from other time format to datetime can make many different ways of processing data much easier (especially when using Pandas, see later).
times = pd.date_range('2018-01-01', periods=5)
times
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05'], dtype='datetime64[ns]', freq='D')
lons = np.linspace(-120, -60, 4)
lats = np.linspace(25, 55, 3)
temp = xr.DataArray(data, coords=[times, lats, lons], dims=['time', 'lat', 'lon'])
temp
<xarray.DataArray (time: 5, lat: 3, lon: 4)> array([[[284.3731919 , 280.69406396, 272.63458131, 288.861379 ], [286.08371827, 293.92505478, 279.98117705, 284.09732089], [280.48544704, 276.70461341, 280.8361101 , 292.87753072]], [[285.18374146, 284.77539863, 277.84078267, 287.28971632], [288.62604636, 279.36089958, 280.94742249, 284.81648292], [284.3972094 , 285.39161043, 287.75959911, 278.08258108]], [[284.20285521, 273.86316932, 283.40986702, 273.06314051], [280.92411653, 281.67024583, 284.94829422, 282.34354215], [283.78733794, 273.81960044, 286.70009539, 284.55548309]], [[278.56755805, 269.94938932, 285.02154146, 285.52978898], [279.26592939, 284.28803678, 289.08551781, 273.03007754], [283.63122401, 291.03504506, 288.71806512, 286.0750452 ]], [[293.04821027, 284.54705147, 290.26563039, 276.23482923], [278.11692433, 290.20971507, 285.2343849 , 283.74721148], [278.20545212, 276.78642027, 284.64727083, 281.20318262]]]) Coordinates: * time (time) datetime64[ns] 2018-01-01 2018-01-02 ... 2018-01-05 * lat (lat) float64 25.0 40.0 55.0 * lon (lon) float64 -120.0 -100.0 -80.0 -60.0
array([[[284.3731919 , 280.69406396, 272.63458131, 288.861379 ], [286.08371827, 293.92505478, 279.98117705, 284.09732089], [280.48544704, 276.70461341, 280.8361101 , 292.87753072]], [[285.18374146, 284.77539863, 277.84078267, 287.28971632], [288.62604636, 279.36089958, 280.94742249, 284.81648292], [284.3972094 , 285.39161043, 287.75959911, 278.08258108]], [[284.20285521, 273.86316932, 283.40986702, 273.06314051], [280.92411653, 281.67024583, 284.94829422, 282.34354215], [283.78733794, 273.81960044, 286.70009539, 284.55548309]], [[278.56755805, 269.94938932, 285.02154146, 285.52978898], [279.26592939, 284.28803678, 289.08551781, 273.03007754], [283.63122401, 291.03504506, 288.71806512, 286.0750452 ]], [[293.04821027, 284.54705147, 290.26563039, 276.23482923], [278.11692433, 290.20971507, 285.2343849 , 283.74721148], [278.20545212, 276.78642027, 284.64727083, 281.20318262]]])
array(['2018-01-01T00:00:00.000000000', '2018-01-02T00:00:00.000000000', '2018-01-03T00:00:00.000000000', '2018-01-04T00:00:00.000000000', '2018-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
array([25., 40., 55.])
array([-120., -100., -80., -60.])
Just like we saw with netcdf files, attributes are useful information about variables (like units and names). We can add attributes to the DataArray to convey important metadata about the variables.
temp.attrs['units'] = 'kelvin'
temp.attrs['standard_name'] = 'air_temperature'
When an operation is performed on a DataArray it loses its attributes in case they have changed (but it stay attached to the original data)
temp_in_celsius = temp - 273.15
temp_in_celsius
<xarray.DataArray (time: 5, lat: 3, lon: 4)> array([[[11.2231919 , 7.54406396, -0.51541869, 15.711379 ], [12.93371827, 20.77505478, 6.83117705, 10.94732089], [ 7.33544704, 3.55461341, 7.6861101 , 19.72753072]], [[12.03374146, 11.62539863, 4.69078267, 14.13971632], [15.47604636, 6.21089958, 7.79742249, 11.66648292], [11.2472094 , 12.24161043, 14.60959911, 4.93258108]], [[11.05285521, 0.71316932, 10.25986702, -0.08685949], [ 7.77411653, 8.52024583, 11.79829422, 9.19354215], [10.63733794, 0.66960044, 13.55009539, 11.40548309]], [[ 5.41755805, -3.20061068, 11.87154146, 12.37978898], [ 6.11592939, 11.13803678, 15.93551781, -0.11992246], [10.48122401, 17.88504506, 15.56806512, 12.9250452 ]], [[19.89821027, 11.39705147, 17.11563039, 3.08482923], [ 4.96692433, 17.05971507, 12.0843849 , 10.59721148], [ 5.05545212, 3.63642027, 11.49727083, 8.05318262]]]) Coordinates: * time (time) datetime64[ns] 2018-01-01 2018-01-02 ... 2018-01-05 * lat (lat) float64 25.0 40.0 55.0 * lon (lon) float64 -120.0 -100.0 -80.0 -60.0
array([[[11.2231919 , 7.54406396, -0.51541869, 15.711379 ], [12.93371827, 20.77505478, 6.83117705, 10.94732089], [ 7.33544704, 3.55461341, 7.6861101 , 19.72753072]], [[12.03374146, 11.62539863, 4.69078267, 14.13971632], [15.47604636, 6.21089958, 7.79742249, 11.66648292], [11.2472094 , 12.24161043, 14.60959911, 4.93258108]], [[11.05285521, 0.71316932, 10.25986702, -0.08685949], [ 7.77411653, 8.52024583, 11.79829422, 9.19354215], [10.63733794, 0.66960044, 13.55009539, 11.40548309]], [[ 5.41755805, -3.20061068, 11.87154146, 12.37978898], [ 6.11592939, 11.13803678, 15.93551781, -0.11992246], [10.48122401, 17.88504506, 15.56806512, 12.9250452 ]], [[19.89821027, 11.39705147, 17.11563039, 3.08482923], [ 4.96692433, 17.05971507, 12.0843849 , 10.59721148], [ 5.05545212, 3.63642027, 11.49727083, 8.05318262]]])
array(['2018-01-01T00:00:00.000000000', '2018-01-02T00:00:00.000000000', '2018-01-03T00:00:00.000000000', '2018-01-04T00:00:00.000000000', '2018-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
array([25., 40., 55.])
array([-120., -100., -80., -60.])
A DataSet is an object type in xarray that can hold many different variables (sort of like a netCDF file, but within python's working memory). xarray can figure out automatically if the variables in a dataset share the same dimensions. Here we make one.
pressure_data = 1000.0 + 5 * np.random.randn(5, 3, 4)
pressure = xr.DataArray(
pressure_data, coords=[times, lats, lons], dims=['time', 'lat', 'lon']
)
pressure.attrs['units'] = 'hPa'
pressure.attrs['standard_name'] = 'air_pressure'
ds = xr.Dataset(data_vars={'Temperature': temp, 'Pressure': pressure})
ds
#ds.Pressure #to select a variable
<xarray.Dataset> Dimensions: (time: 5, lat: 3, lon: 4) Coordinates: * time (time) datetime64[ns] 2018-01-01 2018-01-02 ... 2018-01-05 * lat (lat) float64 25.0 40.0 55.0 * lon (lon) float64 -120.0 -100.0 -80.0 -60.0 Data variables: Temperature (time, lat, lon) float64 284.4 280.7 272.6 ... 284.6 281.2 Pressure (time, lat, lon) float64 995.1 1.01e+03 ... 1.004e+03 998.7
array(['2018-01-01T00:00:00.000000000', '2018-01-02T00:00:00.000000000', '2018-01-03T00:00:00.000000000', '2018-01-04T00:00:00.000000000', '2018-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
array([25., 40., 55.])
array([-120., -100., -80., -60.])
array([[[284.3731919 , 280.69406396, 272.63458131, 288.861379 ], [286.08371827, 293.92505478, 279.98117705, 284.09732089], [280.48544704, 276.70461341, 280.8361101 , 292.87753072]], [[285.18374146, 284.77539863, 277.84078267, 287.28971632], [288.62604636, 279.36089958, 280.94742249, 284.81648292], [284.3972094 , 285.39161043, 287.75959911, 278.08258108]], [[284.20285521, 273.86316932, 283.40986702, 273.06314051], [280.92411653, 281.67024583, 284.94829422, 282.34354215], [283.78733794, 273.81960044, 286.70009539, 284.55548309]], [[278.56755805, 269.94938932, 285.02154146, 285.52978898], [279.26592939, 284.28803678, 289.08551781, 273.03007754], [283.63122401, 291.03504506, 288.71806512, 286.0750452 ]], [[293.04821027, 284.54705147, 290.26563039, 276.23482923], [278.11692433, 290.20971507, 285.2343849 , 283.74721148], [278.20545212, 276.78642027, 284.64727083, 281.20318262]]])
array([[[ 995.08443953, 1009.78779738, 1000.26764054, 991.93826999], [1005.53116384, 1001.870548 , 1001.97063922, 1004.26012155], [1002.78544688, 1001.88950075, 997.72977052, 998.12648121]], [[ 991.7437733 , 1001.70447043, 995.07617465, 1002.48681581], [1003.64974536, 1009.7372329 , 997.20093925, 996.17135375], [1008.99713765, 991.28580241, 995.48484803, 1000.15905388]], [[1003.85310125, 999.03612204, 1004.53011743, 996.94532445], [1003.38294827, 1001.54444458, 1009.27433379, 997.49263664], [1006.08521439, 999.69237614, 1003.05649378, 998.81424633]], [[ 994.87571815, 996.66094323, 1003.1943119 , 997.85420066], [1001.55196585, 999.44341144, 996.67298236, 1007.98226527], [1002.24517774, 1008.46829165, 1003.92727235, 1003.76455478]], [[ 999.15733468, 999.81879066, 997.56398523, 1004.5146154 ], [1006.07821099, 1003.89981548, 998.90591047, 1000.90512193], [1017.32944894, 1004.41220167, 1003.82980693, 998.66615914]]])
We can index and select data as in a numpy array using [], but also in other clever ways based on the dimensions.
named_selection = ds.Temperature.sel(time='2018-01-02')
named_selection
<xarray.DataArray 'Temperature' (lat: 3, lon: 4)> array([[285.18374146, 284.77539863, 277.84078267, 287.28971632], [288.62604636, 279.36089958, 280.94742249, 284.81648292], [284.3972094 , 285.39161043, 287.75959911, 278.08258108]]) Coordinates: time datetime64[ns] 2018-01-02 * lat (lat) float64 25.0 40.0 55.0 * lon (lon) float64 -120.0 -100.0 -80.0 -60.0 Attributes: units: kelvin standard_name: air_temperature
array([[285.18374146, 284.77539863, 277.84078267, 287.28971632], [288.62604636, 279.36089958, 280.94742249, 284.81648292], [284.3972094 , 285.39161043, 287.75959911, 278.08258108]])
array('2018-01-02T00:00:00.000000000', dtype='datetime64[ns]')
array([25., 40., 55.])
array([-120., -100., -80., -60.])
In this example, we are trying to sample a temporal data point within 2 days of the date 2018-01-07. Since the final date on our DataArray’s temporal axis is 2018-01-05, we can use .sel() to perform nearest-neighbor sampling, by setting the method keyword argument to ‘nearest’.
temp.sel(time='2018-01-07', method='nearest', tolerance=timedelta(days=2))
<xarray.DataArray (lat: 3, lon: 4)> array([[293.04821027, 284.54705147, 290.26563039, 276.23482923], [278.11692433, 290.20971507, 285.2343849 , 283.74721148], [278.20545212, 276.78642027, 284.64727083, 281.20318262]]) Coordinates: time datetime64[ns] 2018-01-05 * lat (lat) float64 25.0 40.0 55.0 * lon (lon) float64 -120.0 -100.0 -80.0 -60.0 Attributes: units: kelvin standard_name: air_temperature
array([[293.04821027, 284.54705147, 290.26563039, 276.23482923], [278.11692433, 290.20971507, 285.2343849 , 283.74721148], [278.20545212, 276.78642027, 284.64727083, 281.20318262]])
array('2018-01-05T00:00:00.000000000', dtype='datetime64[ns]')
array([25., 40., 55.])
array([-120., -100., -80., -60.])
In this example, we are trying to extract a timeseries at 40°N latitude and 105°W longitude. Our DataArray does not contain a longitude data value of -105, so in order to retrieve this timeseries, we must interpolate between data points. .interp() allows us to retrieve data from any latitude and longitude by means of interpolation. This method uses coordinate-value selection, similarly to .sel().
temp.interp(lon=-105, lat=40)
<xarray.DataArray (time: 5)> array([291.96472065, 281.67718627, 281.48371351, 283.03250993, 287.18651738]) Coordinates: * time (time) datetime64[ns] 2018-01-01 2018-01-02 ... 2018-01-05 lon int64 -105 lat int64 40 Attributes: units: kelvin standard_name: air_temperature
array([291.96472065, 281.67718627, 281.48371351, 283.03250993, 287.18651738])
array(['2018-01-01T00:00:00.000000000', '2018-01-02T00:00:00.000000000', '2018-01-03T00:00:00.000000000', '2018-01-04T00:00:00.000000000', '2018-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
array(-105)
array(40)
A slice is a range of data along one or more coordinates. To slice data, you create a slice object and then use it to index a dataset.
temp.sel(time=slice('2018-01-01', '2018-01-03'), lon=slice(-110, -70), lat=slice(25, 45))
<xarray.DataArray (time: 3, lat: 2, lon: 2)> array([[[280.69406396, 272.63458131], [293.92505478, 279.98117705]], [[284.77539863, 277.84078267], [279.36089958, 280.94742249]], [[273.86316932, 283.40986702], [281.67024583, 284.94829422]]]) Coordinates: * time (time) datetime64[ns] 2018-01-01 2018-01-02 2018-01-03 * lat (lat) float64 25.0 40.0 * lon (lon) float64 -100.0 -80.0 Attributes: units: kelvin standard_name: air_temperature
array([[[280.69406396, 272.63458131], [293.92505478, 279.98117705]], [[284.77539863, 277.84078267], [279.36089958, 280.94742249]], [[273.86316932, 283.40986702], [281.67024583, 284.94829422]]])
array(['2018-01-01T00:00:00.000000000', '2018-01-02T00:00:00.000000000', '2018-01-03T00:00:00.000000000'], dtype='datetime64[ns]')
array([25., 40.])
array([-100., -80.])
Xarray can easily open netCDF datasets, provided they conform to certain limitations (for example, 1-dimensional coordinates)
ds = xr.open_dataset('ERA5_2mtemp_SLP_2022.nc')
ds
<xarray.Dataset> Dimensions: (longitude: 1440, latitude: 721, time: 6) Coordinates: * longitude (longitude) float32 0.0 0.25 0.5 0.75 ... 359.0 359.2 359.5 359.8 * latitude (latitude) float32 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0 * time (time) datetime64[ns] 2022-01-01 2022-02-01 ... 2022-06-01 Data variables: t2m (time, latitude, longitude) float32 ... sp (time, latitude, longitude) float32 ... Attributes: Conventions: CF-1.6 history: 2023-08-16 15:10:53 GMT by grib_to_netcdf-2.25.1: /opt/ecmw...
array([0.0000e+00, 2.5000e-01, 5.0000e-01, ..., 3.5925e+02, 3.5950e+02, 3.5975e+02], dtype=float32)
array([ 90. , 89.75, 89.5 , ..., -89.5 , -89.75, -90. ], dtype=float32)
array(['2022-01-01T00:00:00.000000000', '2022-02-01T00:00:00.000000000', '2022-03-01T00:00:00.000000000', '2022-04-01T00:00:00.000000000', '2022-05-01T00:00:00.000000000', '2022-06-01T00:00:00.000000000'], dtype='datetime64[ns]')
[6229440 values with dtype=float32]
[6229440 values with dtype=float32]
We can select slices based on coordinates as before, and xarray will select in for all DataArrays within the DataSet
ds_feb = ds.sel(time='2022-02-01')
ds_feb
<xarray.Dataset> Dimensions: (longitude: 1440, latitude: 721) Coordinates: * longitude (longitude) float32 0.0 0.25 0.5 0.75 ... 359.0 359.2 359.5 359.8 * latitude (latitude) float32 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0 time datetime64[ns] 2022-02-01 Data variables: t2m (latitude, longitude) float32 ... sp (latitude, longitude) float32 ... Attributes: Conventions: CF-1.6 history: 2023-08-16 15:10:53 GMT by grib_to_netcdf-2.25.1: /opt/ecmw...
array([0.0000e+00, 2.5000e-01, 5.0000e-01, ..., 3.5925e+02, 3.5950e+02, 3.5975e+02], dtype=float32)
array([ 90. , 89.75, 89.5 , ..., -89.5 , -89.75, -90. ], dtype=float32)
array('2022-02-01T00:00:00.000000000', dtype='datetime64[ns]')
[1038240 values with dtype=float32]
[1038240 values with dtype=float32]
You can use named dimensions in an Xarray Dataset to manually slice and index data. However, these dimension names also serve an additional purpose: you can use them to specify dimensions to aggregate on. There are many different aggregation operations available; in this example, we focus on std (standard deviation). Here I am first selecting one variable (temperature) and then taking the mean over time, giving back a 2D DataArray with the mean at each lat/lon coordinate
t2m = ds['t2m']
t2m.mean(dim=['time'])
<xarray.DataArray 't2m' (latitude: 721, longitude: 1440)> array([[256.01334, 256.01334, 256.01334, ..., 256.01334, 256.01334, 256.01334], [256.00876, 256.00906, 256.00876, ..., 256.00876, 256.00876, 256.00906], [256.06802, 256.06827, 256.06854, ..., 256.06775, 256.06802, 256.06827], ..., [228.11365, 228.11528, 228.11772, ..., 228.10957, 228.11177, 228.11339], [228.01544, 228.0168 , 228.01843, ..., 228.0125 , 228.01355, 228.0141 ], [227.6132 , 227.6132 , 227.6132 , ..., 227.6132 , 227.6132 , 227.6132 ]], dtype=float32) Coordinates: * longitude (longitude) float32 0.0 0.25 0.5 0.75 ... 359.0 359.2 359.5 359.8 * latitude (latitude) float32 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0
array([[256.01334, 256.01334, 256.01334, ..., 256.01334, 256.01334, 256.01334], [256.00876, 256.00906, 256.00876, ..., 256.00876, 256.00876, 256.00906], [256.06802, 256.06827, 256.06854, ..., 256.06775, 256.06802, 256.06827], ..., [228.11365, 228.11528, 228.11772, ..., 228.10957, 228.11177, 228.11339], [228.01544, 228.0168 , 228.01843, ..., 228.0125 , 228.01355, 228.0141 ], [227.6132 , 227.6132 , 227.6132 , ..., 227.6132 , 227.6132 , 227.6132 ]], dtype=float32)
array([0.0000e+00, 2.5000e-01, 5.0000e-01, ..., 3.5925e+02, 3.5950e+02, 3.5975e+02], dtype=float32)
array([ 90. , 89.75, 89.5 , ..., -89.5 , -89.75, -90. ], dtype=float32)
Xarray greatly simplifies plotting of data stored as DataArrays and Datasets. One advantage of this is that many common plot elements, such as axis labels, are automatically generated and optimized for the data being plotted. Xarray includes a built-in plotting interface, which makes use of Matplotlib behind the scenes. In order to use this interface, you can call the .plot() method, which is included in every DataArray.
ds.t2m.sel(time='2022-02-01').plot()
<matplotlib.collections.QuadMesh at 0x7f7f81de5790>
t2m.mean(dim=['time']).plot()
<matplotlib.collections.QuadMesh at 0x7f7f90faa580>
ds.t2m.sel(latitude=33.0).plot()
<matplotlib.collections.QuadMesh at 0x7f7f90e490d0>
ds.t2m.sel(latitude=33.0,time='2022-02-01').plot()
[<matplotlib.lines.Line2D at 0x7f7f81f9dac0>]
Pandas is a very powerful library for working with tabular data (e.g., spreadsheets, comma-separated-value files, or database printouts; all of these are quite common for geoscientific data). It allows us to use labels for our data; this, in turn, allows us to write expressive and robust code to manipulate the data.
Key features of Pandas are the abilities to read in tabular data and to slice and dice data, as well as exploratory analysis tools native to the library.
A Series represents a one-dimensional array of data. The main difference between a Series and numpy array is that a Series has an index. The index contains the labels that we use to access the data.
names = ['Mercury', 'Venus', 'Earth']
values = [0.3e24, 4.87e24, 5.97e24]
masses = pd.Series(values, index=names)
masses
Mercury 3.000000e+23 Venus 4.870000e+24 Earth 5.970000e+24 dtype: float64
Like in xarray, series have built-in plotting methods
masses.plot(kind='bar')
<AxesSubplot:>
Arithmetic operations and most numpy function can be applied to Series. An important point is that the Series keep their index during such operations.
np.log(masses) / masses**2
Mercury 6.006452e-46 Venus 2.396820e-48 Earth 1.600655e-48 dtype: float64
We can index the series using strings, which is very handy
masses.loc['Earth']
5.97e+24
masses.loc[['Venus', 'Earth']]
Venus 4.870000e+24 Earth 5.970000e+24 dtype: float64
masses.loc['Mercury':'Earth'] #slicing
Mercury 3.000000e+23 Venus 4.870000e+24 Earth 5.970000e+24 dtype: float64
There is a lot more to Series, but they are limit to a single “column”. A more useful Pandas data structure is the DataFrame. A DataFrame is basically a bunch of series that share the same index. It’s a lot like a table in a spreadsheet.
# first we create a dictionary
data = {'mass': [0.3e24, 4.87e24, 5.97e24], # kg
'diameter': [4879e3, 12_104e3, 12_756e3], # m
'rotation_period': [1407.6, np.nan, 23.9] # h
}
df = pd.DataFrame(data, index=['Mercury', 'Venus', 'Earth'])
df
mass | diameter | rotation_period | |
---|---|---|---|
Mercury | 3.000000e+23 | 4879000.0 | 1407.6 |
Venus | 4.870000e+24 | 12104000.0 | NaN |
Earth | 5.970000e+24 | 12756000.0 | 23.9 |
We can do many different arithmetic or statistical operations on a DataFrame
df.min()
mass 3.000000e+23 diameter 4.879000e+06 rotation_period 2.390000e+01 dtype: float64
df.mean()
mass 3.713333e+24 diameter 9.913000e+06 rotation_period 7.157500e+02 dtype: float64
df.describe()
mass | diameter | rotation_period | |
---|---|---|---|
count | 3.000000e+00 | 3.000000e+00 | 2.000000 |
mean | 3.713333e+24 | 9.913000e+06 | 715.750000 |
std | 3.006765e+24 | 4.371744e+06 | 978.423653 |
min | 3.000000e+23 | 4.879000e+06 | 23.900000 |
25% | 2.585000e+24 | 8.491500e+06 | 369.825000 |
50% | 4.870000e+24 | 1.210400e+07 | 715.750000 |
75% | 5.420000e+24 | 1.243000e+07 | 1061.675000 |
max | 5.970000e+24 | 1.275600e+07 | 1407.600000 |
Indexing is made very easy with dataframes
df['mass']
Mercury 3.000000e+23 Venus 4.870000e+24 Earth 5.970000e+24 Name: mass, dtype: float64
df.loc['Earth']
mass 5.970000e+24 diameter 1.275600e+07 rotation_period 2.390000e+01 Name: Earth, dtype: float64
df.loc['Earth', 'mass']
5.97e+24
Adding new columns is as easy as referencing an undefined column
df['density'] = df.mass / (4/3 * np.pi * (df.diameter/2)**3)
df
mass | diameter | rotation_period | density | |
---|---|---|---|---|
Mercury | 3.000000e+23 | 4879000.0 | 1407.6 | 4933.216530 |
Venus | 4.870000e+24 | 12104000.0 | NaN | 5244.977070 |
Earth | 5.970000e+24 | 12756000.0 | 23.9 | 5493.285577 |
New series or dataframes can be added to existing dataframes
temperature = pd.Series([167, 464, 15, -65],
index=['Mercury', 'Venus', 'Earth', 'Mars'],
name='temperature')
df.join(temperature, how='right')
mass | diameter | rotation_period | density | temperature | |
---|---|---|---|---|---|
Mercury | 3.000000e+23 | 4879000.0 | 1407.6 | 4933.216530 | 167 |
Venus | 4.870000e+24 | 12104000.0 | NaN | 5244.977070 | 464 |
Earth | 5.970000e+24 | 12756000.0 | 23.9 | 5493.285577 | 15 |
Mars | NaN | NaN | NaN | NaN | -65 |
everyone = df.reindex(['Mercury', 'Venus', 'Earth', 'Mars'])
Indexing with a boolean is easy
bigguys = df[df.mass > 4e24]
bigguys
mass | diameter | rotation_period | density | |
---|---|---|---|---|
Venus | 4.870000e+24 | 12104000.0 | NaN | 5244.977070 |
Earth | 5.970000e+24 | 12756000.0 | 23.9 | 5493.285577 |
As in xarray, plotting functions from matplotlib are build directly into Pandas
df.plot(kind='scatter', x='mass', y='diameter', grid=True)
<AxesSubplot:xlabel='mass', ylabel='diameter'>
df.plot(kind='bar')
<AxesSubplot:>
Also like xarray, we can use datetime to index data. Plotting with datetime is very nice
two_years = pd.date_range(start='2014-01-01', end='2016-01-01', freq='D')
timeseries = pd.Series(np.sin(2 *np.pi *two_years.dayofyear / 365),
index=two_years)
timeseries.plot()
<AxesSubplot:>
timeseries.loc['2015-01-01':'2015-07-01'].plot() #slicing
<AxesSubplot:>
Loading uncompressed data directly into a dataframe makes manipulating the data much easier. Here we need to know that the file is space-delimited, not comma delimited which is the default assumption of the read_csv function.
df = pd.read_csv('station_data.txt', sep='\s+')
df
WBANNO | LST_DATE | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | ... | SOIL_MOISTURE_5_DAILY | SOIL_MOISTURE_10_DAILY | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 64756 | 20170101 | 2.422 | -73.74 | 41.79 | 6.6 | -5.4 | 0.6 | 2.2 | 0.0 | ... | -99.000 | -99.000 | 0.207 | 0.152 | 0.175 | -0.1 | 0.0 | 0.6 | 1.5 | 3.4 |
1 | 64756 | 20170102 | 2.422 | -73.74 | 41.79 | 4.0 | -6.8 | -1.4 | -1.2 | 0.0 | ... | -99.000 | -99.000 | 0.205 | 0.151 | 0.173 | -0.2 | 0.0 | 0.6 | 1.5 | 3.3 |
2 | 64756 | 20170103 | 2.422 | -73.74 | 41.79 | 4.9 | 0.7 | 2.8 | 2.7 | 13.1 | ... | -99.000 | -99.000 | 0.205 | 0.150 | 0.173 | -0.1 | 0.0 | 0.5 | 1.5 | 3.3 |
3 | 64756 | 20170104 | 2.422 | -73.74 | 41.79 | 8.7 | -1.6 | 3.6 | 3.5 | 1.3 | ... | -99.000 | -99.000 | 0.215 | 0.153 | 0.174 | -0.1 | 0.0 | 0.5 | 1.5 | 3.2 |
4 | 64756 | 20170105 | 2.422 | -73.74 | 41.79 | -0.5 | -4.6 | -2.5 | -2.8 | 0.0 | ... | -99.000 | -99.000 | 0.215 | 0.154 | 0.177 | -0.1 | 0.0 | 0.5 | 1.4 | 3.1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
360 | 64756 | 20171227 | 2.622 | -73.74 | 41.79 | -6.7 | -19.3 | -13.0 | -12.9 | 0.0 | ... | 0.258 | 0.238 | 0.215 | 0.166 | 0.170 | 0.8 | 1.0 | 1.7 | 3.1 | 4.6 |
361 | 64756 | 20171228 | 2.622 | -73.74 | 41.79 | -10.3 | -21.5 | -15.9 | -15.8 | 0.0 | ... | 0.235 | 0.244 | 0.211 | 0.165 | 0.168 | 0.4 | 0.6 | 1.4 | 2.8 | 4.5 |
362 | 64756 | 20171229 | 2.622 | -73.74 | 41.79 | -9.4 | -19.0 | -14.2 | -14.7 | 0.0 | ... | -99.000 | -99.000 | 0.207 | 0.163 | 0.167 | 0.1 | 0.4 | 1.1 | 2.6 | 4.3 |
363 | 64756 | 20171230 | 2.622 | -73.74 | 41.79 | -7.1 | -18.9 | -13.0 | -13.6 | 1.1 | ... | -99.000 | -99.000 | 0.203 | 0.161 | 0.166 | 0.0 | 0.2 | 0.9 | 2.4 | 4.1 |
364 | 64756 | 20171231 | 2.622 | -73.74 | 41.79 | -12.3 | -21.8 | -17.0 | -16.7 | 0.0 | ... | -99.000 | -99.000 | 0.200 | 0.160 | 0.165 | -0.2 | 0.1 | 0.8 | 2.2 | 4.0 |
365 rows × 28 columns
df = pd.read_csv('station_data.txt', sep='\s+', na_values=[-9999.0, -99.0]) #we tell it what missing values look like
df.info() #dump info about dataframe
<class 'pandas.core.frame.DataFrame'> RangeIndex: 365 entries, 0 to 364 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 WBANNO 365 non-null int64 1 LST_DATE 365 non-null int64 2 CRX_VN 365 non-null float64 3 LONGITUDE 365 non-null float64 4 LATITUDE 365 non-null float64 5 T_DAILY_MAX 364 non-null float64 6 T_DAILY_MIN 364 non-null float64 7 T_DAILY_MEAN 364 non-null float64 8 T_DAILY_AVG 364 non-null float64 9 P_DAILY_CALC 364 non-null float64 10 SOLARAD_DAILY 364 non-null float64 11 SUR_TEMP_DAILY_TYPE 365 non-null object 12 SUR_TEMP_DAILY_MAX 364 non-null float64 13 SUR_TEMP_DAILY_MIN 364 non-null float64 14 SUR_TEMP_DAILY_AVG 364 non-null float64 15 RH_DAILY_MAX 364 non-null float64 16 RH_DAILY_MIN 364 non-null float64 17 RH_DAILY_AVG 364 non-null float64 18 SOIL_MOISTURE_5_DAILY 317 non-null float64 19 SOIL_MOISTURE_10_DAILY 317 non-null float64 20 SOIL_MOISTURE_20_DAILY 336 non-null float64 21 SOIL_MOISTURE_50_DAILY 364 non-null float64 22 SOIL_MOISTURE_100_DAILY 359 non-null float64 23 SOIL_TEMP_5_DAILY 364 non-null float64 24 SOIL_TEMP_10_DAILY 364 non-null float64 25 SOIL_TEMP_20_DAILY 364 non-null float64 26 SOIL_TEMP_50_DAILY 364 non-null float64 27 SOIL_TEMP_100_DAILY 364 non-null float64 dtypes: float64(25), int64(2), object(1) memory usage: 80.0+ KB
df = pd.read_csv('station_data.txt', sep='\s+',na_values=[-9999.0, -99.0],parse_dates=[1]) #make sure it knows which column is datatime formated
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 365 entries, 0 to 364 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 WBANNO 365 non-null int64 1 LST_DATE 365 non-null datetime64[ns] 2 CRX_VN 365 non-null float64 3 LONGITUDE 365 non-null float64 4 LATITUDE 365 non-null float64 5 T_DAILY_MAX 364 non-null float64 6 T_DAILY_MIN 364 non-null float64 7 T_DAILY_MEAN 364 non-null float64 8 T_DAILY_AVG 364 non-null float64 9 P_DAILY_CALC 364 non-null float64 10 SOLARAD_DAILY 364 non-null float64 11 SUR_TEMP_DAILY_TYPE 365 non-null object 12 SUR_TEMP_DAILY_MAX 364 non-null float64 13 SUR_TEMP_DAILY_MIN 364 non-null float64 14 SUR_TEMP_DAILY_AVG 364 non-null float64 15 RH_DAILY_MAX 364 non-null float64 16 RH_DAILY_MIN 364 non-null float64 17 RH_DAILY_AVG 364 non-null float64 18 SOIL_MOISTURE_5_DAILY 317 non-null float64 19 SOIL_MOISTURE_10_DAILY 317 non-null float64 20 SOIL_MOISTURE_20_DAILY 336 non-null float64 21 SOIL_MOISTURE_50_DAILY 364 non-null float64 22 SOIL_MOISTURE_100_DAILY 359 non-null float64 23 SOIL_TEMP_5_DAILY 364 non-null float64 24 SOIL_TEMP_10_DAILY 364 non-null float64 25 SOIL_TEMP_20_DAILY 364 non-null float64 26 SOIL_TEMP_50_DAILY 364 non-null float64 27 SOIL_TEMP_100_DAILY 364 non-null float64 dtypes: datetime64[ns](1), float64(25), int64(1), object(1) memory usage: 80.0+ KB
df = df.set_index('LST_DATE') #how to index (see on left)
df
WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | SOLARAD_DAILY | ... | SOIL_MOISTURE_5_DAILY | SOIL_MOISTURE_10_DAILY | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LST_DATE | |||||||||||||||||||||
2017-01-01 | 64756 | 2.422 | -73.74 | 41.79 | 6.6 | -5.4 | 0.6 | 2.2 | 0.0 | 8.68 | ... | NaN | NaN | 0.207 | 0.152 | 0.175 | -0.1 | 0.0 | 0.6 | 1.5 | 3.4 |
2017-01-02 | 64756 | 2.422 | -73.74 | 41.79 | 4.0 | -6.8 | -1.4 | -1.2 | 0.0 | 2.08 | ... | NaN | NaN | 0.205 | 0.151 | 0.173 | -0.2 | 0.0 | 0.6 | 1.5 | 3.3 |
2017-01-03 | 64756 | 2.422 | -73.74 | 41.79 | 4.9 | 0.7 | 2.8 | 2.7 | 13.1 | 0.68 | ... | NaN | NaN | 0.205 | 0.150 | 0.173 | -0.1 | 0.0 | 0.5 | 1.5 | 3.3 |
2017-01-04 | 64756 | 2.422 | -73.74 | 41.79 | 8.7 | -1.6 | 3.6 | 3.5 | 1.3 | 2.85 | ... | NaN | NaN | 0.215 | 0.153 | 0.174 | -0.1 | 0.0 | 0.5 | 1.5 | 3.2 |
2017-01-05 | 64756 | 2.422 | -73.74 | 41.79 | -0.5 | -4.6 | -2.5 | -2.8 | 0.0 | 4.90 | ... | NaN | NaN | 0.215 | 0.154 | 0.177 | -0.1 | 0.0 | 0.5 | 1.4 | 3.1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2017-12-27 | 64756 | 2.622 | -73.74 | 41.79 | -6.7 | -19.3 | -13.0 | -12.9 | 0.0 | 8.36 | ... | 0.258 | 0.238 | 0.215 | 0.166 | 0.170 | 0.8 | 1.0 | 1.7 | 3.1 | 4.6 |
2017-12-28 | 64756 | 2.622 | -73.74 | 41.79 | -10.3 | -21.5 | -15.9 | -15.8 | 0.0 | 8.46 | ... | 0.235 | 0.244 | 0.211 | 0.165 | 0.168 | 0.4 | 0.6 | 1.4 | 2.8 | 4.5 |
2017-12-29 | 64756 | 2.622 | -73.74 | 41.79 | -9.4 | -19.0 | -14.2 | -14.7 | 0.0 | 7.09 | ... | NaN | NaN | 0.207 | 0.163 | 0.167 | 0.1 | 0.4 | 1.1 | 2.6 | 4.3 |
2017-12-30 | 64756 | 2.622 | -73.74 | 41.79 | -7.1 | -18.9 | -13.0 | -13.6 | 1.1 | 1.10 | ... | NaN | NaN | 0.203 | 0.161 | 0.166 | 0.0 | 0.2 | 0.9 | 2.4 | 4.1 |
2017-12-31 | 64756 | 2.622 | -73.74 | 41.79 | -12.3 | -21.8 | -17.0 | -16.7 | 0.0 | 3.77 | ... | NaN | NaN | 0.200 | 0.160 | 0.165 | -0.2 | 0.1 | 0.8 | 2.2 | 4.0 |
365 rows × 27 columns
df.loc['2017-08-07']
WBANNO 64756 CRX_VN 2.422 LONGITUDE -73.74 LATITUDE 41.79 T_DAILY_MAX 19.3 T_DAILY_MIN 12.3 T_DAILY_MEAN 15.8 T_DAILY_AVG 16.3 P_DAILY_CALC 4.9 SOLARAD_DAILY 3.93 SUR_TEMP_DAILY_TYPE C SUR_TEMP_DAILY_MAX 22.3 SUR_TEMP_DAILY_MIN 11.9 SUR_TEMP_DAILY_AVG 17.7 RH_DAILY_MAX 94.7 RH_DAILY_MIN 76.4 RH_DAILY_AVG 89.5 SOIL_MOISTURE_5_DAILY 0.148 SOIL_MOISTURE_10_DAILY 0.113 SOIL_MOISTURE_20_DAILY 0.094 SOIL_MOISTURE_50_DAILY 0.114 SOIL_MOISTURE_100_DAILY 0.151 SOIL_TEMP_5_DAILY 21.4 SOIL_TEMP_10_DAILY 21.7 SOIL_TEMP_20_DAILY 22.1 SOIL_TEMP_50_DAILY 22.2 SOIL_TEMP_100_DAILY 21.5 Name: 2017-08-07 00:00:00, dtype: object
df.loc['2017-07-01':'2017-07-31'] #slicing
WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | SOLARAD_DAILY | ... | SOIL_MOISTURE_5_DAILY | SOIL_MOISTURE_10_DAILY | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LST_DATE | |||||||||||||||||||||
2017-07-01 | 64756 | 2.422 | -73.74 | 41.79 | 28.0 | 19.7 | 23.9 | 23.8 | 0.2 | 19.28 | ... | 0.157 | 0.136 | 0.144 | 0.129 | 0.163 | 25.7 | 25.4 | 23.7 | 21.9 | 19.9 |
2017-07-02 | 64756 | 2.422 | -73.74 | 41.79 | 29.8 | 18.4 | 24.1 | 23.7 | 4.0 | 27.67 | ... | 0.146 | 0.135 | 0.143 | 0.129 | 0.162 | 26.8 | 26.4 | 24.5 | 22.3 | 20.1 |
2017-07-03 | 64756 | 2.422 | -73.74 | 41.79 | 28.3 | 15.0 | 21.7 | 21.4 | 0.0 | 27.08 | ... | 0.141 | 0.132 | 0.139 | 0.128 | 0.162 | 26.4 | 26.3 | 24.8 | 22.8 | 20.3 |
2017-07-04 | 64756 | 2.422 | -73.74 | 41.79 | 26.8 | 12.6 | 19.7 | 20.0 | 0.0 | 29.45 | ... | 0.131 | 0.126 | 0.136 | 0.126 | 0.161 | 25.9 | 25.8 | 24.6 | 22.9 | 20.6 |
2017-07-05 | 64756 | 2.422 | -73.74 | 41.79 | 28.0 | 11.9 | 20.0 | 20.7 | 0.0 | 26.90 | ... | 0.116 | 0.114 | 0.131 | 0.125 | 0.161 | 25.3 | 25.3 | 24.2 | 22.8 | 20.7 |
2017-07-06 | 64756 | 2.422 | -73.74 | 41.79 | 25.7 | 14.3 | 20.0 | 20.3 | 0.0 | 19.03 | ... | 0.105 | 0.104 | 0.126 | 0.124 | 0.160 | 24.7 | 24.7 | 23.9 | 22.7 | 20.9 |
2017-07-07 | 64756 | 2.422 | -73.74 | 41.79 | 25.8 | 16.8 | 21.3 | 20.0 | 11.5 | 13.88 | ... | 0.114 | 0.100 | 0.123 | 0.123 | 0.160 | 24.2 | 24.2 | 23.4 | 22.4 | 20.8 |
2017-07-08 | 64756 | 2.422 | -73.74 | 41.79 | 29.0 | 15.3 | 22.1 | 21.5 | 0.0 | 21.92 | ... | 0.130 | 0.106 | 0.122 | 0.123 | 0.159 | 25.5 | 25.3 | 23.9 | 22.4 | 20.8 |
2017-07-09 | 64756 | 2.422 | -73.74 | 41.79 | 26.3 | 10.9 | 18.6 | 19.4 | 0.0 | 29.72 | ... | 0.119 | 0.103 | 0.119 | 0.121 | 0.158 | 24.8 | 24.8 | 23.8 | 22.5 | 20.8 |
2017-07-10 | 64756 | 2.422 | -73.74 | 41.79 | 27.6 | 11.8 | 19.7 | 21.3 | 0.0 | 23.67 | ... | 0.105 | 0.096 | 0.113 | 0.120 | 0.158 | 24.7 | 24.7 | 23.6 | 22.5 | 20.9 |
2017-07-11 | 64756 | 2.422 | -73.74 | 41.79 | 27.4 | 19.2 | 23.3 | 22.6 | 8.5 | 17.79 | ... | 0.106 | 0.093 | 0.110 | 0.120 | 0.157 | 25.6 | 25.4 | 24.1 | 22.6 | 20.9 |
2017-07-12 | 64756 | 2.422 | -73.74 | 41.79 | 29.4 | 18.5 | 23.9 | 23.1 | 1.9 | 16.27 | ... | 0.108 | 0.094 | 0.108 | 0.118 | 0.157 | 25.8 | 25.6 | 24.2 | 22.8 | 21.0 |
2017-07-13 | 64756 | 2.422 | -73.74 | 41.79 | 29.5 | 18.3 | 23.9 | 23.4 | 23.3 | 13.61 | ... | 0.134 | 0.110 | 0.108 | 0.118 | 0.156 | 25.7 | 25.7 | 24.4 | 23.0 | 21.0 |
2017-07-14 | 64756 | 2.422 | -73.74 | 41.79 | 18.5 | 15.9 | 17.2 | 17.5 | 4.1 | 5.36 | ... | 0.194 | 0.151 | 0.114 | 0.120 | 0.155 | 23.0 | 23.3 | 23.4 | 22.9 | 21.2 |
2017-07-15 | 64756 | 2.422 | -73.74 | 41.79 | 26.6 | 16.5 | 21.5 | 21.0 | 0.8 | 21.13 | ... | 0.190 | 0.163 | 0.119 | 0.122 | 0.155 | 24.6 | 24.4 | 23.2 | 22.2 | 21.2 |
2017-07-16 | 64756 | 2.422 | -73.74 | 41.79 | 27.9 | 13.3 | 20.6 | 21.0 | 0.0 | 27.03 | ... | 0.171 | 0.154 | 0.123 | 0.123 | 0.155 | 25.4 | 25.3 | 23.9 | 22.6 | 21.1 |
2017-07-17 | 64756 | 2.422 | -73.74 | 41.79 | 29.2 | 16.1 | 22.6 | 22.9 | 0.0 | 20.47 | ... | 0.155 | 0.143 | 0.124 | 0.122 | 0.156 | 25.7 | 25.6 | 24.4 | 22.9 | 21.2 |
2017-07-18 | 64756 | 2.422 | -73.74 | 41.79 | 30.3 | 19.3 | 24.8 | 24.7 | 0.0 | 24.99 | ... | 0.142 | 0.132 | 0.122 | 0.122 | 0.156 | 27.0 | 26.7 | 24.9 | 23.2 | 21.3 |
2017-07-19 | 64756 | 2.422 | -73.74 | 41.79 | 31.2 | 19.1 | 25.1 | 25.0 | 0.0 | 27.69 | ... | 0.126 | 0.118 | 0.118 | 0.122 | 0.156 | 27.6 | 27.4 | 25.6 | 23.7 | 21.5 |
2017-07-20 | 64756 | 2.422 | -73.74 | 41.79 | 31.8 | 16.6 | 24.2 | 23.4 | 0.7 | 21.53 | ... | 0.111 | 0.103 | 0.114 | 0.121 | 0.156 | 27.0 | 27.0 | 25.6 | 24.0 | 21.7 |
2017-07-21 | 64756 | 2.422 | -73.74 | 41.79 | 30.6 | 16.6 | 23.6 | 23.6 | 0.0 | 25.55 | ... | 0.100 | 0.093 | 0.108 | 0.120 | 0.155 | 27.1 | 27.0 | 25.5 | 24.0 | 21.9 |
2017-07-22 | 64756 | 2.422 | -73.74 | 41.79 | 27.7 | 15.6 | 21.7 | 21.2 | 0.5 | 16.04 | ... | 0.092 | 0.086 | 0.104 | 0.119 | 0.156 | 25.9 | 26.1 | 25.3 | 24.1 | 22.0 |
2017-07-23 | 64756 | 2.422 | -73.74 | 41.79 | 26.4 | 18.5 | 22.5 | 22.2 | 0.0 | 19.03 | ... | 0.087 | 0.082 | 0.100 | 0.118 | 0.155 | 26.0 | 26.0 | 24.9 | 23.8 | 22.1 |
2017-07-24 | 64756 | 2.422 | -73.74 | 41.79 | 19.4 | 14.8 | 17.1 | 16.7 | 29.2 | 9.10 | ... | 0.145 | 0.118 | 0.102 | 0.117 | 0.154 | 23.1 | 23.6 | 23.9 | 23.5 | 22.1 |
2017-07-25 | 64756 | 2.422 | -73.74 | 41.79 | 18.6 | 13.7 | 16.2 | 16.2 | 0.0 | 7.35 | ... | 0.167 | 0.133 | 0.107 | 0.116 | 0.153 | 21.9 | 22.2 | 22.4 | 22.5 | 21.9 |
2017-07-26 | 64756 | 2.422 | -73.74 | 41.79 | 24.7 | 11.2 | 18.0 | 18.3 | 0.0 | 22.22 | ... | 0.155 | 0.128 | 0.108 | 0.118 | 0.152 | 22.9 | 23.0 | 22.3 | 22.0 | 21.7 |
2017-07-27 | 64756 | 2.422 | -73.74 | 41.79 | 24.2 | 15.2 | 19.7 | 19.5 | 0.0 | 8.28 | ... | 0.144 | 0.122 | 0.109 | 0.118 | 0.154 | 22.5 | 22.7 | 22.4 | 22.0 | 21.4 |
2017-07-28 | 64756 | 2.422 | -73.74 | 41.79 | 26.5 | 16.9 | 21.7 | 20.9 | 0.0 | 21.06 | ... | 0.137 | 0.117 | 0.110 | 0.119 | 0.154 | 24.1 | 24.1 | 22.8 | 22.0 | 21.3 |
2017-07-29 | 64756 | 2.422 | -73.74 | 41.79 | 24.2 | 10.4 | 17.3 | 18.1 | 0.0 | 21.28 | ... | 0.126 | 0.108 | 0.108 | 0.118 | 0.154 | 23.3 | 23.6 | 23.0 | 22.2 | 21.3 |
2017-07-30 | 64756 | 2.422 | -73.74 | 41.79 | 25.5 | 8.2 | 16.8 | 17.3 | 0.0 | 27.68 | ... | 0.113 | 0.099 | 0.104 | 0.117 | 0.154 | 22.8 | 23.0 | 22.4 | 22.0 | 21.3 |
2017-07-31 | 64756 | 2.422 | -73.74 | 41.79 | 29.4 | 10.1 | 19.7 | 20.1 | 0.0 | 25.49 | ... | 0.101 | 0.090 | 0.099 | 0.116 | 0.153 | 23.8 | 23.8 | 22.7 | 21.9 | 21.2 |
31 rows × 27 columns
df.describe()
WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | SOLARAD_DAILY | ... | SOIL_MOISTURE_5_DAILY | SOIL_MOISTURE_10_DAILY | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 365.0 | 365.000000 | 3.650000e+02 | 3.650000e+02 | 364.000000 | 364.000000 | 364.000000 | 364.000000 | 364.000000 | 364.000000 | ... | 317.000000 | 317.000000 | 336.000000 | 364.000000 | 359.000000 | 364.000000 | 364.000000 | 364.000000 | 364.000000 | 364.000000 |
mean | 64756.0 | 2.470767 | -7.374000e+01 | 4.179000e+01 | 15.720055 | 4.037912 | 9.876374 | 9.990110 | 2.797802 | 13.068187 | ... | 0.189498 | 0.183991 | 0.165470 | 0.140192 | 0.160630 | 12.312637 | 12.320604 | 12.060165 | 11.978022 | 11.915659 |
std | 0.0 | 0.085997 | 5.265234e-13 | 3.842198e-13 | 10.502087 | 9.460676 | 9.727451 | 9.619168 | 7.238628 | 7.953074 | ... | 0.052031 | 0.054113 | 0.043989 | 0.020495 | 0.016011 | 9.390034 | 9.338176 | 8.767752 | 8.078346 | 7.187317 |
min | 64756.0 | 2.422000 | -7.374000e+01 | 4.179000e+01 | -12.300000 | -21.800000 | -17.000000 | -16.700000 | 0.000000 | 0.100000 | ... | 0.075000 | 0.078000 | 0.087000 | 0.101000 | 0.117000 | -0.700000 | -0.400000 | 0.200000 | 0.900000 | 1.900000 |
25% | 64756.0 | 2.422000 | -7.374000e+01 | 4.179000e+01 | 6.900000 | -2.775000 | 2.100000 | 2.275000 | 0.000000 | 6.225000 | ... | 0.152000 | 0.139000 | 0.118750 | 0.118000 | 0.154000 | 2.225000 | 2.000000 | 2.475000 | 3.300000 | 4.100000 |
50% | 64756.0 | 2.422000 | -7.374000e+01 | 4.179000e+01 | 17.450000 | 4.350000 | 10.850000 | 11.050000 | 0.000000 | 12.865000 | ... | 0.192000 | 0.198000 | 0.183000 | 0.147500 | 0.165000 | 13.300000 | 13.350000 | 13.100000 | 12.850000 | 11.600000 |
75% | 64756.0 | 2.422000 | -7.374000e+01 | 4.179000e+01 | 24.850000 | 11.900000 | 18.150000 | 18.450000 | 1.400000 | 19.740000 | ... | 0.234000 | 0.227000 | 0.203000 | 0.157000 | 0.173000 | 21.025000 | 21.125000 | 20.400000 | 19.800000 | 19.325000 |
max | 64756.0 | 2.622000 | -7.374000e+01 | 4.179000e+01 | 33.400000 | 20.700000 | 25.700000 | 26.700000 | 65.700000 | 29.910000 | ... | 0.296000 | 0.321000 | 0.235000 | 0.182000 | 0.192000 | 27.600000 | 27.400000 | 25.600000 | 24.100000 | 22.100000 |
8 rows × 26 columns
fig, ax = plt.subplots(ncols=2, nrows=2, figsize=(14,14))
df.iloc[:, 4:8].boxplot(ax=ax[0,0])
df.iloc[:, 10:14].boxplot(ax=ax[0,1])
df.iloc[:, 14:17].boxplot(ax=ax[1,0])
df.iloc[:, 18:22].boxplot(ax=ax[1,1])
ax[1, 1].set_xticklabels(ax[1, 1].get_xticklabels(), rotation=90);
df[['T_DAILY_MIN', 'T_DAILY_MEAN', 'T_DAILY_MAX']].plot()
<AxesSubplot:xlabel='LST_DATE'>
Pandas can quickly resample data
df_mm = df.resample('MS').mean() #resample monthly and take a mean
df_mm[['T_DAILY_MIN', 'T_DAILY_MEAN', 'T_DAILY_MAX']].plot()
<AxesSubplot:xlabel='LST_DATE'>
One of the most powerful functions of Pandas is groupby, which allows you to do operations on data grouped by different attributes. It takes practice to get "good" at groupby, but worth it. First lets pull in some large earthquake catalogue data
df = pd.read_csv('http://www.ldeo.columbia.edu/~rpa/usgs_earthquakes_2014.csv', parse_dates=['time'], index_col='id')
df['country'] = df.place.str.split(', ').str[-1]
df_small = df[df.mag<4]
df = df[df.mag>4]
df
time | latitude | longitude | depth | mag | magType | nst | gap | dmin | rms | net | updated | place | type | country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||
usc000mqlp | 2014-01-31 23:08:03.660 | -4.9758 | 153.9466 | 110.18 | 4.2 | mb | NaN | 98.0 | 1.940 | 0.61 | us | 2014-04-08T01:43:19.000Z | 115km ESE of Taron, Papua New Guinea | earthquake | Papua New Guinea |
usc000mqln | 2014-01-31 22:54:32.970 | -28.1775 | -177.9058 | 95.84 | 4.3 | mb | NaN | 104.0 | 1.063 | 1.14 | us | 2014-04-08T01:43:19.000Z | 120km N of Raoul Island, New Zealand | earthquake | New Zealand |
usc000mqls | 2014-01-31 22:49:49.740 | -23.1192 | 179.1174 | 528.34 | 4.4 | mb | NaN | 80.0 | 5.439 | 0.95 | us | 2014-04-08T01:43:19.000Z | South of the Fiji Islands | earthquake | South of the Fiji Islands |
usc000mf1x | 2014-01-31 22:19:44.330 | 51.1569 | -178.0910 | 37.50 | 4.2 | mb | NaN | NaN | NaN | 0.83 | us | 2014-04-08T01:43:19.000Z | 72km E of Amatignak Island, Alaska | earthquake | Alaska |
usc000mqlm | 2014-01-31 21:56:44.320 | -4.8800 | 153.8434 | 112.66 | 4.3 | mb | NaN | 199.0 | 1.808 | 0.79 | us | 2014-04-08T01:43:19.000Z | 100km ESE of Taron, Papua New Guinea | earthquake | Papua New Guinea |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
usc000t6yh | 2014-12-01 02:56:07.950 | 21.2031 | 143.5484 | 11.05 | 4.4 | mb | NaN | 107.0 | 5.996 | 0.87 | us | 2015-02-24T00:35:14.040Z | 158km WNW of Farallon de Pajaros, Northern Mar... | earthquake | Northern Mariana Islands |
usc000t6y2 | 2014-12-01 01:50:23.380 | -7.8798 | 106.4275 | 52.10 | 4.3 | mb | NaN | 119.0 | 1.412 | 1.50 | us | 2015-02-24T00:35:14.040Z | 57km SSW of Cibungur, Indonesia | earthquake | Indonesia |
usc000t6y1 | 2014-12-01 01:04:17.890 | 7.1429 | 126.8844 | 176.67 | 4.3 | mb | NaN | 134.0 | 1.297 | 0.87 | us | 2015-02-24T00:35:14.040Z | 37km ESE of Santiago, Philippines | earthquake | Philippines |
usb000t1gp | 2014-12-01 00:40:02.720 | 37.2096 | 71.9458 | 95.57 | 4.2 | mb | NaN | 125.0 | 1.097 | 0.91 | us | 2015-02-24T00:35:14.040Z | 11km ESE of Roshtqal'a, Tajikistan | earthquake | Tajikistan |
usc000t6yn | 2014-12-01 00:24:33.140 | -24.6340 | -179.6018 | 470.86 | 4.5 | mb | NaN | 131.0 | 10.547 | 0.74 | us | 2015-02-24T00:35:14.040Z | South of the Fiji Islands | earthquake | South of the Fiji Islands |
16371 rows × 15 columns
The workflow with groupby can be divided into three general steps:
Split: Partition the data into different groups based on some criterion.
Apply: Do some caclulation within each group. Different types of “apply” steps might be
(a) Aggregation: Get the mean or max within the group.
(b) Transformation: Normalize all the values within a group
(c) Filtration: Eliminate some groups based on a criterion.
Combine: Put the results back together into a single object.
Here we split by country name
gb = df.groupby('country')
list(gb.groups.keys())
['', 'Afghanistan', 'Alaska', 'Albania', 'Algeria', 'American Samoa', 'Angola', 'Anguilla', 'Antarctica', 'Argentina', 'Arizona', 'Aruba', 'Ascension Island region', 'Australia', 'Azerbaijan', 'Azores Islands region', 'Azores-Cape St. Vincent Ridge', 'Balleny Islands region', 'Banda Sea', 'Bangladesh', 'Barbados', 'Barbuda', 'Bay of Bengal', 'Bermuda', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina', 'Bouvet Island', 'Bouvet Island region', 'Brazil', 'British Indian Ocean Territory', 'British Virgin Islands', 'Burma', 'Burundi', 'California', 'Canada', 'Cape Verde', 'Carlsberg Ridge', 'Cayman Islands', 'Celebes Sea', 'Central East Pacific Rise', 'Central Mid-Atlantic Ridge', 'Chagos Archipelago region', 'Chile', 'China', 'Christmas Island', 'Colombia', 'Comoros', 'Cook Islands', 'Costa Rica', 'Crozet Islands region', 'Cuba', 'Cyprus', 'Davis Strait', 'Democratic Republic of the Congo', 'Djibouti', 'Dominica', 'Dominican Republic', 'Drake Passage', 'East Timor', 'East of Severnaya Zemlya', 'East of the Kuril Islands', 'East of the North Island of New Zealand', 'East of the Philippine Islands', 'East of the South Sandwich Islands', 'Easter Island region', 'Eastern Greenland', 'Ecuador', 'Ecuador region', 'Egypt', 'El Salvador', 'Eritrea', 'Ethiopia', 'Falkland Islands region', 'Federated States of Micronesia region', 'Fiji', 'Fiji region', 'France', 'French Polynesia', 'French Southern Territories', 'Galapagos Triple Junction region', 'Georgia', 'Greece', 'Greenland', 'Greenland Sea', 'Guadeloupe', 'Guam', 'Guatemala', 'Gulf of Alaska', 'Haiti', 'Hawaii', 'Honduras', 'Iceland', 'Idaho', 'India', 'India region', 'Indian Ocean Triple Junction', 'Indonesia', 'Iran', 'Iraq', 'Italy', 'Japan', 'Japan region', 'Jordan', 'Kansas', 'Kazakhstan', 'Kermadec Islands region', 'Kosovo', 'Kuril Islands', 'Kyrgyzstan', 'Labrador Sea', 'Laptev Sea', 'Macedonia', 'Macquarie Island region', 'Malawi', 'Malaysia', 'Mariana Islands region', 'Martinique', 'Mauritania', 'Mauritius', 'Mauritius - Reunion region', 'Mexico', 'Micronesia', 'Mid-Indian Ridge', 'Molucca Sea', 'Mongolia', 'Montana', 'Montenegro', 'Morocco', 'Mozambique', 'Mozambique Channel', 'Nepal', 'New Caledonia', 'New Mexico', 'New Zealand', 'Nicaragua', 'Niue', 'North Atlantic Ocean', 'North Indian Ocean', 'North Korea', 'North of Ascension Island', 'North of Franz Josef Land', 'North of New Zealand', 'North of Severnaya Zemlya', 'North of Svalbard', 'Northern East Pacific Rise', 'Northern Mariana Islands', 'Northern Mid-Atlantic Ridge', 'Northwest of Australia', 'Norway', 'Norwegian Sea', 'Off the coast of Central America', 'Off the coast of Ecuador', 'Off the coast of Oregon', 'Off the east coast of the North Island of New Zealand', 'Off the south coast of Australia', 'Off the west coast of northern Sumatra', 'Oklahoma', 'Oman', 'Oregon', 'Owen Fracture Zone region', 'Pacific-Antarctic Ridge', 'Pakistan', 'Palau', 'Palau region', 'Panama', 'Papua New Guinea', 'Peru', 'Peru-Ecuador border region', 'Philippine Islands region', 'Philippines', 'Poland', 'Portugal', 'Portugal region', 'Prince Edward Islands', 'Prince Edward Islands region', 'Puerto Rico', 'Republic of the Congo', 'Reykjanes Ridge', 'Romania', 'Russia', 'Russia region', 'Saint Helena', 'Saint Lucia', 'Saint Vincent and the Grenadines', 'Samoa', 'Santa Cruz Islands region', 'Saudi Arabia', 'Scotia Sea', 'Sea of Okhotsk', 'Serbia', 'Slovenia', 'Socotra region', 'Solomon Islands', 'Somalia', 'South Africa', 'South Atlantic Ocean', 'South Carolina', 'South Georgia Island region', 'South Georgia and the South Sandwich Islands', 'South Indian Ocean', 'South Napa Earthquake', 'South Sandwich Islands', 'South Sandwich Islands region', 'South Shetland Islands', 'South Sudan', 'South of Africa', 'South of Australia', 'South of Panama', 'South of Tasmania', 'South of Tonga', 'South of the Fiji Islands', 'South of the Kermadec Islands', 'South of the Mariana Islands', 'Southeast Indian Ridge', 'Southeast central Pacific Ocean', 'Southeast of Easter Island', 'Southern East Pacific Rise', 'Southern Mid-Atlantic Ridge', 'Southern Pacific Ocean', 'Southwest Indian Ridge', 'Southwest of Africa', 'Southwest of Australia', 'Southwestern Atlantic Ocean', 'Spain', 'Sudan', 'Svalbard and Jan Mayen', 'Sweden', 'Syria', 'Taiwan', 'Tajikistan', 'Tanzania', 'Thailand', 'Tonga', 'Tonga region', 'Trinidad and Tobago', 'Tristan da Cunha region', 'Turkey', 'Turkmenistan', 'Uganda', 'Ukraine', 'United Kingdom', 'Utah', 'Uzbekistan', 'Vanuatu', 'Vanuatu region', 'Venezuela', 'Vietnam', 'Wallis and Futuna', 'West Chile Rise', 'West of Australia', 'West of Macquarie Island', 'West of Vancouver Island', 'West of the Galapagos Islands', 'Western Australia', 'Western Indian-Antarctic Ridge', 'Yemen', 'Zambia', 'north of Ascension Island', 'northern Mid-Atlantic Ridge', 'south of Panama', 'western Xizang']
Then we aggregate - in this case we take the magnitude and find the the ten countries with the largest Earthquakes (by MM). Doing groupby first ensures that we don't just get the ten largest earthquakes. We groupby, find the max for each group, then find the then largest.
gb.mag.max().nlargest(10)
country Chile 8.2 Alaska 7.9 Solomon Islands 7.6 Papua New Guinea 7.5 El Salvador 7.3 Mexico 7.2 Fiji 7.1 Indonesia 7.1 Southern East Pacific Rise 7.0 6.9 Name: mag, dtype: float64
Multiple functions can also be done at once
gb.mag.aggregate([np.min, np.max, np.mean])
amin | amax | mean | |
---|---|---|---|
country | |||
4.1 | 6.9 | 4.582544 | |
Afghanistan | 4.1 | 5.6 | 4.410656 |
Alaska | 4.1 | 7.9 | 4.515025 |
Albania | 4.1 | 5.0 | 4.391667 |
Algeria | 4.1 | 5.5 | 4.583333 |
... | ... | ... | ... |
Zambia | 4.1 | 5.3 | 4.457143 |
north of Ascension Island | 4.4 | 4.4 | 4.400000 |
northern Mid-Atlantic Ridge | 4.7 | 4.7 | 4.700000 |
south of Panama | 4.1 | 4.1 | 4.100000 |
western Xizang | 4.5 | 4.5 | 4.500000 |
262 rows × 3 columns
This can be combined with a plotting function all in a single line of code
df.groupby('country').mag.count().nlargest(20).plot(kind='bar', figsize=(12,6)) #bar graph of countries by number of earthquakes
<AxesSubplot:xlabel='country'>
The key difference between aggregation and transformation is that aggregation returns a smaller object than the original, indexed by the group keys, while transformation returns an object with the same index (and same size) as the original object. Groupby + transformation is used when applying an operation that requires information about the whole group. Here we normalize earthquakes by magnitude within country grouping
def normalize(x):
return (x - x.mean())/x.std()
mag_normalized_by_country = gb.mag.transform(normalize)
mag_normalized_by_country
id usc000mqlp -0.915774 usc000mqln -0.675696 usc000mqls -0.282385 usc000mf1x -0.684915 usc000mqlm -0.666807 ... usc000t6yh -0.281723 usc000t6y2 -0.451617 usc000t6y1 -0.627247 usb000t1gp -0.629262 usc000t6yn 0.043277 Name: mag, Length: 16371, dtype: float64