This notebook will introduce you to the Python programming language and useful Python libraries for processing and visualising data that we will use during this training school. You will learn about:
Are you new to programming with Python? Don't worry, there are many good resources online for you to catch up on before the start of the training school. Having a basic understanding of Python is all you need and there will be plenty of time to ask questions.
The most important things to understand are linked below:
If you would like to dive deeper into Python, you can check out the following free courses:
Python libraries (also known as packages) are similar to a toolbox or a drawer of utensils you find in your kitchen. They contain a set of tools for a specific purpose. For example, you might use a set of tools for cooking a meal or replacing a tire on a car. You would not use your cooking utensils for replacing a tire, nor the other way around. Thus, finding the right set of tool sets or the right libraries for what you want to do is important. In this training school, you will use Python libraries for processing data and for visualising data. In this section, you will learn about:
In order to use the tools in a library, you need to first import it at the start of your Jupyter notebook. In fact, the first code cell in the workflow notebooks imports the libraries you will need in the workflow. The benefit of importing all the libraries upfront is that you can see immediately which libraries are required for the entire workflow to run.
The simplest way to import a library is to use the command import
followed by the name of the library. In the code cell below, the library numpy
is imported.
import numpy
If you wanted to give the library a different name to reduce the number of letters you need to type when using a function from that library, you can specify an alternate name (or alias) with as
.
import numpy as np
A function is like a tool from the toolbox, such as a wrench or a measuring cup from your cooking drawer. To use a function, you need to first specify the library name as different libraries may have the same function name. This would be similar to toolbox.scissors
versus kitchen.scissors
, where you specify which pair of scissors you want to use from the toolbox or the kitchen. If you didn't want to type out toolbox
each time, you could use tb
as an alias.
In the case of the library numpy
, it now has the alias np
. Instead of having to type numpy.arange(6)
to use the arange
function, you can just type np.arange(6)
.
You can also import just specific parts of a library. These parts are also known as modules. A library can be made up of several modules. For example, we can just import the colors
module from matplotlib
as follows:
import matplotlib.colors
Another way to just import a part of a library is to use from... import...
. This imports a single function from a module. In the code cell below, the function Axes
is imported from the matplotlib.axes
module.
from matplotlib.axes import Axes
Note: In order to import a library, you need to download and install the library beforehand. In the JupyterLab environment prepared for you, the libraries have been preinstalled. If you want to replicate this environment on your own desktop or laptop to work locally, please follow the instructions in this notebook.
Often, the data you download may need to be processed into a different form to answer the questions you have. For example, you may need to filter out missing data values or combine data from different satellite instruments into a single time series. This is where data processing libraries come in handy. The following data processing libraries are useful for reading and manipulating various data types:
The NumPy library provides a "universal standard for working with numerical data in Python" (Source). The library contains multidimensional array and matrix data structures as well as a wide variety of mathematical operations for arrays and matrices. Numerous other libraries including Pandas and Matplotlib use the data structures and operations from NumPy.
Fun fact: NumPy's full name is "Numerical Python".
The Pandas library is commonly used for working with tabular data or data that has rows and columns. For example, data stored in a Comma-Separated Values (CSV) file.
The two data structures associated with Pandas include:
Series
which "is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.)" (Source
Dataframe
which "is a 2-dimensional labeled data structure with columns of potentially different types" similar to a spreadsheet or table. (Source)
The xarray library is very useful for working with multi-dimensional data. For example, data that is stored in a Network Common Data Form (netcdf) file.
A few things to note:
An xarray DataArray
is a "multi-dimensional array with labeled or named dimensions." (Source) Examples of dimensions include “time”, “latitude” or “longitude”.
An xarray Dataset
is like a bag or container which can hold multiple xarray DataArray
objects.
Metadata in the form of attributes can be found in both DataArray
and Dataset
objects.
The Satpy library enables you to read, manipulate and write data from satellite instruments. It is also useful for visualising this data in combination with other libraries.
The following are some things that you can do with Satpy (Source):
Convert between file formats to xarray DataArray
and Dataset
classes
Create Red-Green-Blue (RGB) or other types of image composites
Do atmospheric correction or visual enhancement of images
Save output images to file formats such as PNG, GeoTIFF and NetCDF
Resample data to geographic projected grids.
Another data format you may encounter is a Hierarchical Data Format (HDF) file, especially for data from the United States. This data file format is "designed by the National Center for Supercomputing Applications (NCSA) to assist users in the storage and manipulation of scientific data across diverse operating systems and machines." (Source) The pyhdf library is useful for working with data in the HDF file format.
Some things to note about the HDF data format (Source):
HDF supports a variety of data types including scientific data arrays, tables, text annotations and raster images
It is self-describing, which means that it contains metadata which describes the data in the file
After processing the data into a suitable form, you may wish to visualise this data in the form of a graph, map or animation using a visualisation library. Note that as many libraries use American English, the spelling used for "visualise" is often "visualize".
The Matplotlib library enables you to create Python visualisations that are static, animated, and interactive. Common uses of Matplotlib you may encounter in this training include generating figures such as line graphs, bar charts, scatterplots, maps or animations. You can also use Matplotlib to generate customised color ramps for your figures.
The Cartopy library is essential for plotting maps. As it is built on top of Matplotlib, you will often see code which uses both these libraries in the creation of a map.
Some things you can use Cartopy for:
Defining projections of data
Adding and customising coastlines and country borders to maps
Adding and customising gridlines that show latitude and longitude to maps
In the training material, you may encounter a few other miscellaneous libraries. This section explains what these are used for.
The glob module is part of the standard Python library. It is used to search for files whose filenames match a certain pattern. For example, if you would like to return all filenames that contain a certain date or name of instrument in the filename. Wildcard characters are often used for these searches.
The warnings module is a part of the standard Python library. As the documentation describes, warnings are "typically issued in situations where it is useful to alert the user of some condition in a program, where that condition (normally) doesn’t warrant raising an exception and terminating the program." This means, warnings do not result in the Python program terminating but an error or exception does.
In general, the warnings filter can be used to show, ignore or convert warnings into errors. The following code block shows how you can use the warnings filter to ignore all warnings.
import warnings
warnings.filterwarnings('ignore')
You can also use a simplefilter
to conduct an action on a category of warnings. In the code block below, the filter ignores all runtime warnings.
warnings.simplefilter(action = "ignore", category = RuntimeWarning)
The training school materials include a functions library, which is a collection of pre-defined functions that support the learner with data loading, pre-processing and visualization. Learners with basic Python knowledge learn Python by applying these functions, where only keyword arguments have to be provided. Learners with more Python experience can examine the functions in a separate notebook or build their own functions.
For example, the following code block shows the function generate_geographical_subset
, which is used to crop or subset an xarray Data Array to a specific geographical region.
def generate_geographical_subset(xarray, latmin, latmax, lonmin, lonmax, reassign=False):
"""
Generates a geographical subset of a xarray.DataArray and if kwarg reassign=True, shifts the longitude grid
from a 0-360 to a -180 to 180 deg grid.
Parameters:
xarray(xarray.DataArray): a xarray DataArray with latitude and longitude coordinates
latmin, latmax, lonmin, lonmax(int): lat/lon boundaries of the geographical subset
reassign(boolean): default is False
Returns:
Geographical subset of a xarray.DataArray.
"""
if(reassign):
xarray = xarray.assign_coords(longitude=(((xarray.longitude + 180) % 360) - 180))
return xarray.where((xarray.latitude < latmax) & (xarray.latitude > latmin) & (xarray.longitude < lonmax) & (xarray.longitude > lonmin),drop=True)
To use this function, you need to provide keyword arguments such as the minimum and maximum longitudes (lonmin
and lonmax
), the minimum and maximum latitudes (latmin
and latmax
), as well as the variable that the xarray Data Array is stored in. In the following code block, which generates a geographical subset for Europe, the variable name of the xarray Data Array is dust_aod
. Thus, the keyword argument for the xarray
parameter is dust_aod
.
europe_subset = generate_geographical_subset(xarray=dust_aod,
latmin=28,
latmax=71,
lonmin=-22,
lonmax=43)
The functions library is stored in a Jupyter notebook. You can import the functions library using code which specifies the path to the functions notebook. Note that you may have to change the path depending on where the functions notebook is stored. If it is stored in the same folder as the notebook you are working on, you can use:
%run ./functions.ipynb
If it is stored one folder or level above, you can use:
%run ../functions.ipynb
If it is stored two folders or levels above, you can use:
%run ../../functions.ipynb
This project is licensed under GNU General Public License v3.0 only and is developed under a Copernicus contract.