This tutorial will give an overview of two methods of preparing sample data for use in the tutorial notebooks which demonstrate the use of each scoring function. The result of this tutorial will be two files - "forecast_grid.nc" and "analysis_grid".nc. The two methods covered will be (a) downloaded data from the Austration National Computational Infrastructure, and (b) synthetic data.
Numerical weather prediction data is large in size, and comes in many formats and structures. Common structures will include:
The data in these tutorials will start with typical model data outputs and ensembles. This will be made clearer with an example.
Downloaded data is usually covered by a license agreement. You should check the license agreements for the downloaded data in these examples.
import hashlib # Used to check the downloaded file is what is expected
import matplotlib # Used to improve plot outputs
import numpy # Used if generating synthetic data
import os # Used to check if files are already downloaded
import pandas # Used if generating synthetic data
import requests # Used to retrieve files
import xarray # Used for opening and inspecting the data
# This is a basic file fetch only. Do not rely on it for untrusted data. A basic check has been included to make sure the
# downloaded file matches what was expected.
def basic_fetch(url, filename, expected_hash):
if os.path.exists(filename):
print("File already exists, skipping download")
else:
response = requests.get(url, allow_redirects=True)
if response.ok:
outfile = open(filename, 'wb') # This will write the file out to the current directory, typically where this notebook is being run
outfile.write(response.content)
content = open(filename, 'rb').read()
found_hash = hashlib.sha256(content).hexdigest()
if found_hash != expected_hash:
os.remove(filename)
print("File has unexpected contents. The file has been removed - please download manually and check the data carefully")
forecast_url = 'https://dapds00.nci.org.au/thredds/fileServer/wr45/ops_aps3/access-g/1/20221120/0000/fc/sfc/temp_scrn.nc'
forecast_hash = '7956d95ea3a7edee2a01c989b1f9e089199da5b1924b4c2d4611088713fbcb44' # Recorded on 13/5/2023
analysis_url = 'https://dapds00.nci.org.au/thredds/fileServer/wr45/ops_aps3/access-g/1/20221124/0000/an/sfc/temp_scrn.nc'
analysis_hash = '163c5de55e721ad2a76518242120044bedfec805e3397cfb0008435521630042' # Recorded on 13/5/2023
%%time
basic_fetch(forecast_url, 'forecast_grid.nc', forecast_hash)
CPU times: user 3.83 s, sys: 1.94 s, total: 5.77 s Wall time: 13.5 s
%%time
basic_fetch(analysis_url, 'analysis_grid.nc', analysis_hash)
CPU times: user 28.2 ms, sys: 4.87 ms, total: 33 ms Wall time: 285 ms
forecast = xarray.open_dataset('forecast_grid.nc')
forecast.temp_scrn
<xarray.DataArray 'temp_scrn' (time: 240, lat: 1536, lon: 2048)> [754974720 values with dtype=float32] Coordinates: * time (time) datetime64[ns] 2022-11-20T01:00:00 ... 2022-11-30 * lat (lat) float64 89.94 89.82 89.71 89.59 ... -89.71 -89.82 -89.94 * lon (lon) float64 0.08789 0.2637 0.4395 0.6152 ... 359.6 359.7 359.9 Attributes: grid_type: spatial level_type: single units: K long_name: screen level temperature stash_code: 3236 accum_type: instantaneous
forecast.temp_scrn[0].plot()
<matplotlib.collections.QuadMesh at 0x7fdc28ecf7c0>
# The forecast contains 240 hours of forecast data, spanning 20th November 2022 at 1am to 30 November 2022 at midnight.
print(len(forecast.time.values))
print(min(forecast.time.values))
print(max(forecast.time.values))
240 2022-11-20T01:00:00.000000000 2022-11-30T00:00:00.000000000
analysis = xarray.open_dataset('analysis_grid.nc')
analysis.temp_scrn
<xarray.DataArray 'temp_scrn' (time: 1, lat: 1536, lon: 2048)> [3145728 values with dtype=float32] Coordinates: * time (time) datetime64[ns] 2022-11-24 * lat (lat) float64 89.94 89.82 89.71 89.59 ... -89.71 -89.82 -89.94 * lon (lon) float64 0.08789 0.2637 0.4395 0.6152 ... 359.6 359.7 359.9 Attributes: grid_type: spatial level_type: single units: K long_name: screen level temperature stash_code: 3236 accum_type: instantaneous
analysis.temp_scrn.plot()
<matplotlib.collections.QuadMesh at 0x7fdc1fdf4fa0>
# This is the analysis grid for midnight on the 24th of November
# This represents what happened, four days after the prediction was first made, and can be used to check accuracy
analysis.temp_scrn.time.values
array(['2022-11-24T00:00:00.000000000'], dtype='datetime64[ns]')
# This data extends from -90 degrees latitude to +90 degrees latitude, and from 0 to 360 degrees longitude.
# There are 1536 points of latitude, and 2048 points on longitude.
print(min(analysis.temp_scrn.lat.values))
print(max(analysis.temp_scrn.lat.values))
print(len(analysis.temp_scrn.lat.values))
print(min(analysis.temp_scrn.lon.values))
print(max(analysis.temp_scrn.lon.values))
print(len(analysis.temp_scrn.lon.values))
-89.94140600000904 89.941406 1536 0.087891 359.912109 2048
Synthetic data is easy to generate quickly, can be customised to focus on a specific example, and doesn't require any time to download. However, it can be less realistic and therefore may not capture the complexity of a real-world situation.
This method will generate an array of 240 x 1536 x 2048, closely matching the domain of the ACCESS-G model and write this to a file called "forecast_grid.nc". It will then generate another array of 1x1536x2048 and write this to a file called "analysis_grid.nc". These files can then substitute to a greater or lesser degree for the real-world data shown in the previous step. There may be some differences but it should be sufficient for demonstration purposes.
lat = numpy.linspace(-90, 90, 1536)
lon = numpy.linspace(0, 360, 2048)
slow_decrease = numpy.linspace(26, 18, 240)
time_series = pandas.date_range(start='2022-11-20T01:00:00.000000000',
end='2022-11-30T00:00:00.000000000',
periods=240)
analysis_time = [time_series[24*4-1].to_pydatetime()] # November 24th at midnight
# We will use random noise to provide some variation to the data
random_number_generator = numpy.random.default_rng(seed=42)
random = random_number_generator.random((1, 1536, 2048))
%%time
# The analysis step is around 24 degrees, plus some random noise. Ideally this would include some more structure.
analysis_values = random + 24
# The forecast contains a slowly decreasing trend on average which will give us some difference for comparison
forecast_values = numpy.array([random[0] + trend for trend in slow_decrease])
CPU times: user 857 ms, sys: 606 ms, total: 1.46 s Wall time: 1.47 s
analysis_array = xarray.DataArray(coords={'time': analysis_time, 'lat': lat, 'lon': lon, }, data=analysis_values)
analysis = xarray.Dataset(data_vars={'temp_scrn': analysis_array})
analysis.to_netcdf('analysis_grid.nc')
analysis
<xarray.Dataset> Dimensions: (time: 1, lat: 1536, lon: 2048) Coordinates: * time (time) datetime64[ns] 2022-11-24 * lat (lat) float64 -90.0 -89.88 -89.77 -89.65 ... 89.77 89.88 90.0 * lon (lon) float64 0.0 0.1759 0.3517 0.5276 ... 359.6 359.8 360.0 Data variables: temp_scrn (time, lat, lon) float64 24.77 24.44 24.86 ... 24.07 24.39 24.2
analysis.temp_scrn.plot()
<matplotlib.collections.QuadMesh at 0x7fdc16cd8d00>
forecast_array = xarray.DataArray(coords={'time': time_series, 'lat': lat, 'lon': lon, }, data=forecast_values)
forecast = xarray.Dataset(data_vars={'temp_scrn': forecast_array})
forecast.to_netcdf('forecast_grid.nc')
forecast
<xarray.Dataset> Dimensions: (time: 240, lat: 1536, lon: 2048) Coordinates: * time (time) datetime64[ns] 2022-11-20T01:00:00 ... 2022-11-30 * lat (lat) float64 -90.0 -89.88 -89.77 -89.65 ... 89.77 89.88 90.0 * lon (lon) float64 0.0 0.1759 0.3517 0.5276 ... 359.6 359.8 360.0 Data variables: temp_scrn (time, lat, lon) float64 26.77 26.44 26.86 ... 18.07 18.39 18.2
forecast.temp_scrn[0].plot(cmap=matplotlib.pyplot.cm.Blues, vmin=22, vmax=27)
forecast.temp_scrn[0].mean()
<xarray.DataArray 'temp_scrn' ()> array(26.49989251) Coordinates: time datetime64[ns] 2022-11-20T01:00:00
forecast.temp_scrn[24*4-1].plot(cmap=matplotlib.pyplot.cm.Blues, vmin=22, vmax=27)
forecast.temp_scrn[24*4-1].mean()
<xarray.DataArray 'temp_scrn' ()> array(23.31997619) Coordinates: time datetime64[ns] 2022-11-24