WIP: Making a ColocatedData object with Pyaerocom๏
Lewis Blake lewisb@met.no
This is a tutorial showing how the current version of pyaerocom can be used for creating colocated data objects which minimizes the risk of processing errors. It is a preliminary tutorial, and therefore subject to change as development continues. As the interface becomes better defined, this should simplify the process and changes will be reflected in updated versions of this tutorial. Feedback welcome.
Getting started๏
You may want to start by creating a Python virtual environment before begining and installing pyaerocom into it.
python3 -m venv --prompt cams2_82 .venv
source .venv/bin/activate
pip install pyaerocom
pya --help
[1]:
import os
import pandas as pd
import pyaerocom
import xarray as xr
Downloading example data๏
Letโs say we have done the collocation and stored the results in some object. As an example we will assume this is a netcdf file called my_colocated_data.nc
. For simplcity of the tutorial, such a colocated data object file is included with pyaerocom. In fact there are a few, however there is a my_colocated_data.nc
specifically included for this tutorial. Pyaerocom has a command line interface which is usefule for this. On the command line there are several functions available.
Our goal is to create a ColocatedData
object named od550aer_od550aer_MOD-IFS-OSUITE_REF-AeronetL1.5-d_20130101_20221231_daily_ALL-wMOUNTAINS.nc
[2]:
! pya --help
Usage: pya [OPTIONS] COMMAND [ARGS]...
๐ฆ Pyaerocom Command Line Interface
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --version -V โ
โ --install-completion Install completion for the current shell. โ
โ --show-completion Show completion for the current shell, to โ
โ copy it or customize the installation. โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Commands โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ aeroval Run an AeroVal experiment as described in a json config file โ
โ browse Browse database (e.g., browse <DATABASE>) โ
โ clearcache Delete cached data objects โ
โ getsampledata Downloads a minimal sample dataset. โ
โ init init ~/MyPyaerocom directory and copy the default paths.ini โ
โ there โ
โ listcache List cached data objects โ
โ ppiaccess Check if MET Norway's PPI can be accessed โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Ensure that pya/pyaerocom is at least version 0.25
[3]:
! pya -V
๐ฆ pyaerocom 0.25.0.dev0
The getsampledata
command will download the neccessary files into ./data/
.
[4]:
! pya getsampledata
[5]:
filepath = "./data/testdata-minimal/modeldata/colocated_netcdf/my_colocated_data.nc"
data = xr.open_dataarray(filepath)
Converting to ColocatedData๏
Note that while included in pyaerocom, this data is not actually CF-compliant. For example, latitude and longitude have units โdegreesโ, but really should be โdegrees_northโ and โdegrees_eastโ respectively. Letโs fix that.
[6]:
data.latitude.attrs["units"] = "degrees_north"
data.longitude.attrs["units"] = "degrees_east"
The general format for ColocateData
objects file name is "{obs_var_name}_{mod_var_name}_MOD-{model_name}_REF-{obs_name}_{date_start]_{date_end}_{ts_type}_{filter_name}.nc"
obs_var_name
: Aerocom name for the obs variable. Can be found by searching here: https://github.com/metno/pyaerocom/blob/main-dev/pyaerocom/data/variables.inimod_var_name
: Aerocom name for model variable.model_name
: Name of the modelobs_name
: Name of the observation networkdate_start
: Start of the timerseries in the colocated array formated YYYYMMDDdate_end
: End of the timeseries in the colocated array formated YYYYMMDDts_type
: Time series type (frequency), for example โdailyโ or โmonthlyโ.filter_name
: name of the region filter applied. If no spatial filering applied, should be the string โALL-wMOUNTAINSโ.
Note that the fields provided here in the name must match their associated values in the metadata below. For exmaple, the model_name
here in the name needs to match the model_name
in the metadata.
A valid (and encouraged) way to create a ColocatedData
object is to convert it into an xarray DataArray with the correct structure, dimensions, and metadata. You will have your object stored in some format. The task of CAMS2_82 partners will then be to do translation of their colocated data into this format.
In the current design, such A ColocatedData
object comprises 3 or 4 dimensions.
The first dimension (
data_source
, index 0) is ALWAYS length 2 and specifies the two datasets that were co-located (index 0 is obs, index 1 is model).The second dimension is
time
.3rd dimension is
station_name
in case of 3D colocated data. For 4D colocated data the 3rd and 4th dimension are latitude and longitude, respectively.
3D colocated data is typically created when a model is colocated with station based ground based observations. 4D colocated data is created when a model is colocated with another model or satellite observations, that cover large parts of Earthโs surface (other than discrete lat/lon pairs in the case of ground based station locations).
We may be curious about the valid potential options available to us. We can check the fields in a few relevant classes to determine what these options are and to check which data types they are allowed to be.
The documentation for the ColocationSetup
class can be found here: https://pyaerocom.readthedocs.io/en/latest/api.html#pyaerocom.colocation.colocation_setup.ColocationSetup
[7]:
from pyaerocom import ColocationSetup, ColocatedData
[8]:
ColocationSetup.model_fields
[8]:
{'model_id': FieldInfo(annotation=Union[str, NoneType], required=True),
'obs_id': FieldInfo(annotation=Union[str, NoneType], required=True),
'obs_vars': FieldInfo(annotation=Union[tuple[str, ...], str], required=True),
'ts_type': FieldInfo(annotation=str, required=True),
'start': FieldInfo(annotation=Union[Timestamp, int, str, NoneType], required=True),
'stop': FieldInfo(annotation=Union[Timestamp, int, str, NoneType], required=True),
'obs_config': FieldInfo(annotation=Union[PyaroConfig, NoneType], required=False, default=None),
'OBS_VERT_TYPES_ALT': FieldInfo(annotation=dict[str, str], required=False, default={'Surface': 'ModelLevel', '2D': '2D'}),
'CRASH_ON_INVALID': FieldInfo(annotation=bool, required=False, default=False),
'FORBIDDEN_KEYS': FieldInfo(annotation=list[str], required=False, default=['var_outlier_ranges', 'var_ref_outlier_ranges', 'remove_outliers']),
'filter_name': FieldInfo(annotation=str, required=False, default='ALL-wMOUNTAINS'),
'basedir_coldata': FieldInfo(annotation=Union[str, Path], required=False, default='/home/heikok/MyPyaerocom/colocated_data', validate_default=True),
'save_coldata': FieldInfo(annotation=bool, required=False, default=False),
'obs_name': FieldInfo(annotation=Union[str, NoneType], required=False, default=None),
'obs_data_dir': FieldInfo(annotation=Union[Path, str, NoneType], required=False, default=None),
'obs_use_climatology': FieldInfo(annotation=bool, required=False, default=False),
'obs_cache_only': FieldInfo(annotation=bool, required=False, default=False),
'obs_vert_type': FieldInfo(annotation=Union[str, NoneType], required=False, default=None),
'obs_ts_type_read': FieldInfo(annotation=Union[str, dict, NoneType], required=False, default=None),
'obs_filters': FieldInfo(annotation=dict, required=False, default={}),
'colocation_layer_limits': FieldInfo(annotation=Union[tuple[LayerLimits, ...], NoneType], required=False, default=None),
'profile_layer_limits': FieldInfo(annotation=Union[tuple[LayerLimits, ...], NoneType], required=False, default=None),
'read_opts_ungridded': FieldInfo(annotation=Union[dict, NoneType], required=False, default={}),
'model_name': FieldInfo(annotation=Union[str, NoneType], required=False, default=None),
'model_data_dir': FieldInfo(annotation=Union[Path, str, NoneType], required=False, default=None),
'model_read_opts': FieldInfo(annotation=Union[dict, NoneType], required=False, default={}),
'model_use_vars': FieldInfo(annotation=Union[dict[str, str], NoneType], required=False, default={}),
'model_rename_vars': FieldInfo(annotation=Union[dict[str, str], NoneType], required=False, default={}),
'model_add_vars': FieldInfo(annotation=Union[dict[str, tuple[str, ...]], NoneType], required=False, default={}),
'model_to_stp': FieldInfo(annotation=bool, required=False, default=False),
'model_ts_type_read': FieldInfo(annotation=Union[str, dict, NoneType], required=False, default=None),
'model_read_aux': FieldInfo(annotation=Union[dict[str, dict[Literal['vars_required', 'fun'], Union[list[str], Callable]]], NoneType], required=False, default={}),
'model_use_climatology': FieldInfo(annotation=bool, required=False, default=False),
'gridded_reader_id': FieldInfo(annotation=dict[str, str], required=False, default={'model': 'ReadGridded', 'obs': 'ReadGridded'}),
'flex_ts_type': FieldInfo(annotation=bool, required=False, default=True),
'min_num_obs': FieldInfo(annotation=Union[dict, int, NoneType], required=False, default=None),
'resample_how': FieldInfo(annotation=Union[str, dict, NoneType], required=False, default='mean'),
'obs_remove_outliers': FieldInfo(annotation=bool, required=False, default=False),
'model_remove_outliers': FieldInfo(annotation=bool, required=False, default=False),
'obs_outlier_ranges': FieldInfo(annotation=Union[dict[str, tuple[float, float]], NoneType], required=False, default={}),
'model_outlier_ranges': FieldInfo(annotation=Union[dict[str, tuple[float, float]], NoneType], required=False, default={}),
'zeros_to_nan': FieldInfo(annotation=bool, required=False, default=False),
'harmonise_units': FieldInfo(annotation=bool, required=False, default=False),
'regrid_res_deg': FieldInfo(annotation=Union[float, RegridResDeg, NoneType], required=False, default=None),
'colocate_time': FieldInfo(annotation=bool, required=False, default=False),
'reanalyse_existing': FieldInfo(annotation=bool, required=False, default=True),
'raise_exceptions': FieldInfo(annotation=bool, required=False, default=False),
'keep_data': FieldInfo(annotation=bool, required=False, default=True),
'add_meta': FieldInfo(annotation=Union[dict, NoneType], required=False, default={}),
'model_kwargs': FieldInfo(annotation=dict, required=False, default={}),
'main_freq': FieldInfo(annotation=str, required=False, default='monthly'),
'freqs': FieldInfo(annotation=list[str], required=False, default=['monthly', 'yearly'])}
The documentation for the ColocatedData
object can be found here: https://pyaerocom.readthedocs.io/en/latest/api.html#pyaerocom.colocation.colocated_data.ColocatedData
[9]:
ColocatedData.model_fields
[9]:
{'data': FieldInfo(annotation=Union[Path, str, DataArray, ndarray, NoneType], required=False, default=None)}
Letโs build this ColocatedData
oibject by hand. We start with the array: arr
, assuming its dimension matches the structure described above.
[10]:
arr = data.data
[11]:
arr.shape
[11]:
(2, 365, 665)
Now the coordinatesโฆ
[12]:
data_source = ['AeronetSunV3Lev1.5.daily', 'ECMWF_OSUITE']
time = pd.date_range(start="2013-01-01", end="2013-12-31", freq = "D") # daily frequency
station_names = list(data.coords["station_name"].data)
[13]:
lats = data.latitude.data
lons = data.longitude.data
alts = data.altitude.data
[14]:
coords = {
"data_source": data_source,
"time": time,
"station_name": station_names,
"latitude": ("station_name", lats),
"longitude": ("station_name", lons),
"altitude": ("station_name", alts),
}
Here we cheat a bit with the station names as it is a very long list, so we get it directly from the data
object, but in principle this is a list of string that you have somewhere. We also cheat a bit with the latitudes, longitudes, and altitudes, but again these are arrays that you have somewhere.
In order to ensure safe processing, we require all the following metadata fields. Here is an example of a dictionary containing some example values.
[15]:
metadata = {
"obs_vars": "od550aer", # must match boths obs_var_name and mod_var_name in filename
"ts_type": "daily",
"filter_name": "ALL-wMOUNTAINS",
"ts_type_src": ["hourly", "hourly"], # must be a list of strings, a ts type for each data_source
"var_units": ["1", "1"], # must have units for each data_source
"data_level": 3,
"revision_ref": "20221027",
"from_files": ['ECMWF_OSUITE.daily.od550aer.2013.nc', 'ECMWF_OSUITE.daily.od550aer.2014.nc', 'ECMWF_OSUITE.daily.od550aer.2015.nc', 'ECMWF_OSUITE.daily.od550aer.2016.nc', 'ECMWF_OSUITE.daily.od550aer.2017.nc', 'ECMWF_OSUITE.daily.od550aer.2018.nc', 'ECMWF_OSUITE.daily.od550aer.2019.nc', 'ECMWF_OSUITE.daily.od550aer.2020.nc', 'ECMWF_OSUITE.daily.od550aer.2021.nc', 'ECMWF_OSUITE.daily.od550aer.2022.nc'],
"from_files_ref": "None",
"colocate_time": 0, # default is 0
"obs_is_clim": 0, # default is 0
"pyaerocom": pyaerocom.__version__,
"CONV!min_num_obs": str(dict(monthly=dict(daily=3))), # yes, needed
"resample_how": "mean", # default
"obs_name": "AeronetL1.5-d", # must match obs_name in filename
"vert_code": "Column", # Literal["Column", "Profile", "Surface", "ModelLevel"]. default is "Surface"
"diurnal_only": 0, # default is 0
"zeros_to_nan": 1, # default is 0
}
We can check that we have the right data types by passing this metadata object to Pyaerocomโs ColocationSetup
class. Under the hood Pyaerocom has checked that we provided the correct data types.
[16]:
col_stp = ColocationSetup(**metadata)
Now we put it all in an xarray DataArrayโฆ
[17]:
da = xr.DataArray(
data = arr,
dims = ["data_source", "time", "station_name"],
coords = coords,
attrs = metadata
)
da.name = "od550aer" # Important to give the DataArray the name of the variable.
And we put that DataArray into Pyaerocom ColocatedData
โฆ
[18]:
coldata = ColocatedData(da)
Under the hood Pyaerocom has checked that we have the correct dimensions on this object.
Saving the ColocatedData object๏
We can save this ColocatedData object.
[19]:
savename = "od550aer_od550aer_MOD-IFS-OSUITE_REF-AeronetL1.5-d_20130101_20221231_daily_ALL-wMOUNTAINS.nc"
out_dir = "./data/coldata/"
os.makedirs(out_dir, exist_ok=True)
[20]:
coldata.to_netcdf(out_dir=out_dir, savename=savename)
[20]:
'./data/coldata/od550aer_od550aer_MOD-IFS-OSUITE_REF-AeronetL1.5-d_20130101_20221231_daily_ALL-wMOUNTAINS.nc'
Verify the contents of the saved ColocatedData object๏
When everything is done we should we able to do a ncdump -h
and verify its contents.
netcdf od550aer_od550aer_MOD-IFS-OSUITE_REF-AeronetL1.5-d_20130101_20221231_daily_ALL-wMOUNTAINS {
dimensions:
data_source = 2 ;
time = 3652 ;
station_name = 665 ;
variables:
string data_source(data_source) ;
int64 time(time) ;
time:units = "days since 2013-01-01 12:00:00" ;
time:calendar = "proleptic_gregorian" ;
string station_name(station_name) ;
double latitude(station_name) ;
latitude:_FillValue = NaN ;
latitude:standard_name = "latitude" ;
latitude:units = "degrees_north" ;
double longitude(station_name) ;
longitude:_FillValue = NaN ;
longitude:standard_name = "longitude" ;
longitude:units = "degrees_east" ;
double altitude(station_name) ;
altitude:_FillValue = NaN ;
double od550aer(data_source, time, station_name) ;
od550aer:_FillValue = NaN ;
string od550aer:data_source = "AeronetSunV3Lev1.5.daily", "ECMWF_OSUITE" ;
string od550aer:var_name = "od550aer", "od550aer" ;
string od550aer:var_name_input = "od550aer", "od550aer" ;
od550aer:ts_type = "daily" ;
od550aer:filter_name = "ALL-wMOUNTAINS" ;
string od550aer:ts_type_src = "daily", "daily" ;
string od550aer:var_units = "1", "1" ;
od550aer:data_level = 3LL ;
od550aer:revision_ref = "20221027" ;
string od550aer:from_files = "ECMWF_OSUITE.daily.od550aer.2013.nc", "ECMWF_OSUITE.daily.od550aer.2014.nc", "ECMWF_OSUITE.daily.od550aer.2015.nc", "ECMWF_OSUITE.daily.od550aer.2016.nc", "ECMWF_OSUITE.daily.od550aer.2017.nc", "ECMWF_OSUITE.daily.od550aer.2018.nc", "ECMWF_OSUITE.daily.od550aer.2019.nc", "ECMWF_OSUITE.daily.od550aer.2020.nc", "ECMWF_OSUITE.daily.od550aer.2021.nc", "ECMWF_OSUITE.daily.od550aer.2022.nc" ;
od550aer:from_files_ref = "None" ;
od550aer:colocate_time = 0LL ;
od550aer:obs_is_clim = 0LL ;
od550aer:pyaerocom = "0.13.2.dev0" ;
od550aer:CONV\!min_num_obs = "{\'monthly\': {\'daily\': 3}}" ;
od550aer:resample_how = "mean" ;
od550aer:model_name = "IFS-OSUITE" ;
od550aer:obs_name = "AeronetL1.5-d" ;
od550aer:vert_code = "Column" ;
od550aer:diurnal_only = 0LL ;
od550aer:coordinates = "altitude longitude latitude" ;
[21]:
! ncdump -h "./data/coldata/od550aer_od550aer_MOD-IFS-OSUITE_REF-AeronetL1.5-d_20130101_20221231_daily_ALL-wMOUNTAINS.nc"
netcdf od550aer_od550aer_MOD-IFS-OSUITE_REF-AeronetL1.5-d_20130101_20221231_daily_ALL-wMOUNTAINS {
dimensions:
data_source = 2 ;
time = 365 ;
station_name = 665 ;
variables:
string data_source(data_source) ;
int64 time(time) ;
time:units = "days since 2013-01-01 00:00:00" ;
time:calendar = "proleptic_gregorian" ;
string station_name(station_name) ;
double latitude(station_name) ;
latitude:_FillValue = NaN ;
double longitude(station_name) ;
longitude:_FillValue = NaN ;
double altitude(station_name) ;
altitude:_FillValue = NaN ;
double od550aer(data_source, time, station_name) ;
od550aer:_FillValue = NaN ;
od550aer:obs_vars = "od550aer" ;
od550aer:ts_type = "daily" ;
od550aer:filter_name = "ALL-wMOUNTAINS" ;
string od550aer:ts_type_src = "hourly", "hourly" ;
string od550aer:var_units = "1", "1" ;
od550aer:data_level = 3LL ;
od550aer:revision_ref = "20221027" ;
string od550aer:from_files = "ECMWF_OSUITE.daily.od550aer.2013.nc", "ECMWF_OSUITE.daily.od550aer.2014.nc", "ECMWF_OSUITE.daily.od550aer.2015.nc", "ECMWF_OSUITE.daily.od550aer.2016.nc", "ECMWF_OSUITE.daily.od550aer.2017.nc", "ECMWF_OSUITE.daily.od550aer.2018.nc", "ECMWF_OSUITE.daily.od550aer.2019.nc", "ECMWF_OSUITE.daily.od550aer.2020.nc", "ECMWF_OSUITE.daily.od550aer.2021.nc", "ECMWF_OSUITE.daily.od550aer.2022.nc" ;
od550aer:from_files_ref = "None" ;
od550aer:colocate_time = 0LL ;
od550aer:obs_is_clim = 0LL ;
od550aer:pyaerocom = "0.25.0.dev0" ;
od550aer:CONV\!min_num_obs = "{\'monthly\': {\'daily\': 3}}" ;
od550aer:resample_how = "mean" ;
od550aer:obs_name = "AeronetL1.5-d" ;
od550aer:vert_code = "Column" ;
od550aer:diurnal_only = 0LL ;
od550aer:zeros_to_nan = 1LL ;
od550aer:coordinates = "altitude latitude longitude" ;
}