WIP: Making a ColocatedData object with Pyaerocom๏ƒ

Lewis Blake lewisb@met.no

This is a tutorial showing how the current version of pyaerocom can be used for creating colocated data objects which minimizes the risk of processing errors. It is a preliminary tutorial, and therefore subject to change as development continues. As the interface becomes better defined, this should simplify the process and changes will be reflected in updated versions of this tutorial. Feedback welcome.

Getting started๏ƒ

You may want to start by creating a Python virtual environment before begining and installing pyaerocom into it.

python3 -m venv --prompt cams2_82 .venv
source .venv/bin/activate
pip install pyaerocom
pya --help
[1]:
import os
import pandas as pd
import pyaerocom
import xarray as xr

Downloading example data๏ƒ

Letโ€™s say we have done the collocation and stored the results in some object. As an example we will assume this is a netcdf file called my_colocated_data.nc. For simplcity of the tutorial, such a colocated data object file is included with pyaerocom. In fact there are a few, however there is a my_colocated_data.nc specifically included for this tutorial. Pyaerocom has a command line interface which is usefule for this. On the command line there are several functions available.

Our goal is to create a ColocatedData object named od550aer_od550aer_MOD-IFS-OSUITE_REF-AeronetL1.5-d_20130101_20221231_daily_ALL-wMOUNTAINS.nc

[2]:
! pya --help
                                                                                
 Usage: pya [OPTIONS] COMMAND [ARGS]...                                         
                                                                                
 ๐Ÿฆ„ Pyaerocom Command Line Interface

โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ --version             -V                                                     โ”‚
โ”‚ --install-completion            Install completion for the current shell.    โ”‚
โ”‚ --show-completion               Show completion for the current shell, to    โ”‚
โ”‚                                 copy it or customize the installation.       โ”‚
โ”‚ --help                          Show this message and exit.                  โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€ Commands โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ aeroval         Run an AeroVal experiment as described in a json config file โ”‚
โ”‚ browse          Browse database (e.g., browse <DATABASE>)                    โ”‚
โ”‚ clearcache      Delete cached data objects                                   โ”‚
โ”‚ getsampledata   Downloads a minimal sample dataset.                          โ”‚
โ”‚ init            init ~/MyPyaerocom directory and copy the default paths.ini  โ”‚
โ”‚                 there                                                        โ”‚
โ”‚ listcache       List cached data objects                                     โ”‚
โ”‚ ppiaccess       Check if MET Norway's PPI can be accessed                    โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Ensure that pya/pyaerocom is at least version 0.25

[3]:
! pya -V
๐Ÿฆ„ pyaerocom 0.25.0.dev0

The getsampledata command will download the neccessary files into ./data/.

[4]:
!  pya getsampledata
[5]:
filepath = "./data/testdata-minimal/modeldata/colocated_netcdf/my_colocated_data.nc"
data = xr.open_dataarray(filepath)

Converting to ColocatedData๏ƒ

Note that while included in pyaerocom, this data is not actually CF-compliant. For example, latitude and longitude have units โ€œdegreesโ€, but really should be โ€œdegrees_northโ€ and โ€œdegrees_eastโ€ respectively. Letโ€™s fix that.

[6]:
data.latitude.attrs["units"] = "degrees_north"
data.longitude.attrs["units"] = "degrees_east"

The general format for ColocateData objects file name is "{obs_var_name}_{mod_var_name}_MOD-{model_name}_REF-{obs_name}_{date_start]_{date_end}_{ts_type}_{filter_name}.nc"

  • obs_var_name: Aerocom name for the obs variable. Can be found by searching here: https://github.com/metno/pyaerocom/blob/main-dev/pyaerocom/data/variables.ini

  • mod_var_name: Aerocom name for model variable.

  • model_name: Name of the model

  • obs_name: Name of the observation network

  • date_start: Start of the timerseries in the colocated array formated YYYYMMDD

  • date_end: End of the timeseries in the colocated array formated YYYYMMDD

  • ts_type: Time series type (frequency), for example โ€œdailyโ€ or โ€œmonthlyโ€.

  • filter_name: name of the region filter applied. If no spatial filering applied, should be the string โ€œALL-wMOUNTAINSโ€.

Note that the fields provided here in the name must match their associated values in the metadata below. For exmaple, the model_name here in the name needs to match the model_name in the metadata.

A valid (and encouraged) way to create a ColocatedData object is to convert it into an xarray DataArray with the correct structure, dimensions, and metadata. You will have your object stored in some format. The task of CAMS2_82 partners will then be to do translation of their colocated data into this format.

In the current design, such A ColocatedData object comprises 3 or 4 dimensions.

  • The first dimension (data_source, index 0) is ALWAYS length 2 and specifies the two datasets that were co-located (index 0 is obs, index 1 is model).

  • The second dimension is time.

  • 3rd dimension is station_name in case of 3D colocated data. For 4D colocated data the 3rd and 4th dimension are latitude and longitude, respectively.

3D colocated data is typically created when a model is colocated with station based ground based observations. 4D colocated data is created when a model is colocated with another model or satellite observations, that cover large parts of Earthโ€™s surface (other than discrete lat/lon pairs in the case of ground based station locations).

We may be curious about the valid potential options available to us. We can check the fields in a few relevant classes to determine what these options are and to check which data types they are allowed to be.

The documentation for the ColocationSetup class can be found here: https://pyaerocom.readthedocs.io/en/latest/api.html#pyaerocom.colocation.colocation_setup.ColocationSetup

[7]:
from pyaerocom import ColocationSetup, ColocatedData
[8]:
ColocationSetup.model_fields
[8]:
{'model_id': FieldInfo(annotation=Union[str, NoneType], required=True),
 'obs_id': FieldInfo(annotation=Union[str, NoneType], required=True),
 'obs_vars': FieldInfo(annotation=Union[tuple[str, ...], str], required=True),
 'ts_type': FieldInfo(annotation=str, required=True),
 'start': FieldInfo(annotation=Union[Timestamp, int, str, NoneType], required=True),
 'stop': FieldInfo(annotation=Union[Timestamp, int, str, NoneType], required=True),
 'obs_config': FieldInfo(annotation=Union[PyaroConfig, NoneType], required=False, default=None),
 'OBS_VERT_TYPES_ALT': FieldInfo(annotation=dict[str, str], required=False, default={'Surface': 'ModelLevel', '2D': '2D'}),
 'CRASH_ON_INVALID': FieldInfo(annotation=bool, required=False, default=False),
 'FORBIDDEN_KEYS': FieldInfo(annotation=list[str], required=False, default=['var_outlier_ranges', 'var_ref_outlier_ranges', 'remove_outliers']),
 'filter_name': FieldInfo(annotation=str, required=False, default='ALL-wMOUNTAINS'),
 'basedir_coldata': FieldInfo(annotation=Union[str, Path], required=False, default='/home/heikok/MyPyaerocom/colocated_data', validate_default=True),
 'save_coldata': FieldInfo(annotation=bool, required=False, default=False),
 'obs_name': FieldInfo(annotation=Union[str, NoneType], required=False, default=None),
 'obs_data_dir': FieldInfo(annotation=Union[Path, str, NoneType], required=False, default=None),
 'obs_use_climatology': FieldInfo(annotation=bool, required=False, default=False),
 'obs_cache_only': FieldInfo(annotation=bool, required=False, default=False),
 'obs_vert_type': FieldInfo(annotation=Union[str, NoneType], required=False, default=None),
 'obs_ts_type_read': FieldInfo(annotation=Union[str, dict, NoneType], required=False, default=None),
 'obs_filters': FieldInfo(annotation=dict, required=False, default={}),
 'colocation_layer_limits': FieldInfo(annotation=Union[tuple[LayerLimits, ...], NoneType], required=False, default=None),
 'profile_layer_limits': FieldInfo(annotation=Union[tuple[LayerLimits, ...], NoneType], required=False, default=None),
 'read_opts_ungridded': FieldInfo(annotation=Union[dict, NoneType], required=False, default={}),
 'model_name': FieldInfo(annotation=Union[str, NoneType], required=False, default=None),
 'model_data_dir': FieldInfo(annotation=Union[Path, str, NoneType], required=False, default=None),
 'model_read_opts': FieldInfo(annotation=Union[dict, NoneType], required=False, default={}),
 'model_use_vars': FieldInfo(annotation=Union[dict[str, str], NoneType], required=False, default={}),
 'model_rename_vars': FieldInfo(annotation=Union[dict[str, str], NoneType], required=False, default={}),
 'model_add_vars': FieldInfo(annotation=Union[dict[str, tuple[str, ...]], NoneType], required=False, default={}),
 'model_to_stp': FieldInfo(annotation=bool, required=False, default=False),
 'model_ts_type_read': FieldInfo(annotation=Union[str, dict, NoneType], required=False, default=None),
 'model_read_aux': FieldInfo(annotation=Union[dict[str, dict[Literal['vars_required', 'fun'], Union[list[str], Callable]]], NoneType], required=False, default={}),
 'model_use_climatology': FieldInfo(annotation=bool, required=False, default=False),
 'gridded_reader_id': FieldInfo(annotation=dict[str, str], required=False, default={'model': 'ReadGridded', 'obs': 'ReadGridded'}),
 'flex_ts_type': FieldInfo(annotation=bool, required=False, default=True),
 'min_num_obs': FieldInfo(annotation=Union[dict, int, NoneType], required=False, default=None),
 'resample_how': FieldInfo(annotation=Union[str, dict, NoneType], required=False, default='mean'),
 'obs_remove_outliers': FieldInfo(annotation=bool, required=False, default=False),
 'model_remove_outliers': FieldInfo(annotation=bool, required=False, default=False),
 'obs_outlier_ranges': FieldInfo(annotation=Union[dict[str, tuple[float, float]], NoneType], required=False, default={}),
 'model_outlier_ranges': FieldInfo(annotation=Union[dict[str, tuple[float, float]], NoneType], required=False, default={}),
 'zeros_to_nan': FieldInfo(annotation=bool, required=False, default=False),
 'harmonise_units': FieldInfo(annotation=bool, required=False, default=False),
 'regrid_res_deg': FieldInfo(annotation=Union[float, RegridResDeg, NoneType], required=False, default=None),
 'colocate_time': FieldInfo(annotation=bool, required=False, default=False),
 'reanalyse_existing': FieldInfo(annotation=bool, required=False, default=True),
 'raise_exceptions': FieldInfo(annotation=bool, required=False, default=False),
 'keep_data': FieldInfo(annotation=bool, required=False, default=True),
 'add_meta': FieldInfo(annotation=Union[dict, NoneType], required=False, default={}),
 'model_kwargs': FieldInfo(annotation=dict, required=False, default={}),
 'main_freq': FieldInfo(annotation=str, required=False, default='monthly'),
 'freqs': FieldInfo(annotation=list[str], required=False, default=['monthly', 'yearly'])}

The documentation for the ColocatedData object can be found here: https://pyaerocom.readthedocs.io/en/latest/api.html#pyaerocom.colocation.colocated_data.ColocatedData

[9]:
ColocatedData.model_fields
[9]:
{'data': FieldInfo(annotation=Union[Path, str, DataArray, ndarray, NoneType], required=False, default=None)}

Letโ€™s build this ColocatedData oibject by hand. We start with the array: arr, assuming its dimension matches the structure described above.

[10]:
arr = data.data
[11]:
arr.shape
[11]:
(2, 365, 665)

Now the coordinatesโ€ฆ

[12]:
data_source = ['AeronetSunV3Lev1.5.daily', 'ECMWF_OSUITE']
time = pd.date_range(start="2013-01-01", end="2013-12-31", freq = "D") # daily frequency
station_names = list(data.coords["station_name"].data)
[13]:
lats = data.latitude.data
lons = data.longitude.data
alts = data.altitude.data
[14]:
coords = {
    "data_source": data_source,
    "time": time,
    "station_name": station_names,
    "latitude": ("station_name", lats),
    "longitude": ("station_name", lons),
    "altitude": ("station_name", alts),
}

Here we cheat a bit with the station names as it is a very long list, so we get it directly from the data object, but in principle this is a list of string that you have somewhere. We also cheat a bit with the latitudes, longitudes, and altitudes, but again these are arrays that you have somewhere.

In order to ensure safe processing, we require all the following metadata fields. Here is an example of a dictionary containing some example values.

[15]:
metadata = {
    "obs_vars": "od550aer", # must match boths obs_var_name and mod_var_name in filename
    "ts_type": "daily",
    "filter_name": "ALL-wMOUNTAINS",
    "ts_type_src": ["hourly", "hourly"], # must be a list of strings, a ts type for each data_source
    "var_units": ["1", "1"], # must have units for each data_source
    "data_level": 3,
    "revision_ref": "20221027",
    "from_files": ['ECMWF_OSUITE.daily.od550aer.2013.nc', 'ECMWF_OSUITE.daily.od550aer.2014.nc', 'ECMWF_OSUITE.daily.od550aer.2015.nc', 'ECMWF_OSUITE.daily.od550aer.2016.nc', 'ECMWF_OSUITE.daily.od550aer.2017.nc', 'ECMWF_OSUITE.daily.od550aer.2018.nc', 'ECMWF_OSUITE.daily.od550aer.2019.nc', 'ECMWF_OSUITE.daily.od550aer.2020.nc', 'ECMWF_OSUITE.daily.od550aer.2021.nc', 'ECMWF_OSUITE.daily.od550aer.2022.nc'],
    "from_files_ref": "None",
    "colocate_time": 0, # default is 0
    "obs_is_clim": 0, # default is 0
    "pyaerocom": pyaerocom.__version__,
    "CONV!min_num_obs": str(dict(monthly=dict(daily=3))), # yes, needed
    "resample_how": "mean", # default
    "obs_name": "AeronetL1.5-d", # must match obs_name in filename
    "vert_code": "Column", # Literal["Column", "Profile", "Surface", "ModelLevel"]. default is "Surface"
    "diurnal_only": 0, # default is 0
    "zeros_to_nan": 1, # default is 0
}

We can check that we have the right data types by passing this metadata object to Pyaerocomโ€™s ColocationSetup class. Under the hood Pyaerocom has checked that we provided the correct data types.

[16]:
col_stp = ColocationSetup(**metadata)

Now we put it all in an xarray DataArrayโ€ฆ

[17]:
da = xr.DataArray(
    data = arr,
    dims = ["data_source", "time", "station_name"],
    coords = coords,
    attrs = metadata
)
da.name = "od550aer" # Important to give the DataArray the name of the variable.

And we put that DataArray into Pyaerocom ColocatedDataโ€ฆ

[18]:
coldata = ColocatedData(da)

Under the hood Pyaerocom has checked that we have the correct dimensions on this object.

Saving the ColocatedData object๏ƒ

We can save this ColocatedData object.

[19]:
savename = "od550aer_od550aer_MOD-IFS-OSUITE_REF-AeronetL1.5-d_20130101_20221231_daily_ALL-wMOUNTAINS.nc"
out_dir = "./data/coldata/"
os.makedirs(out_dir, exist_ok=True)
[20]:
coldata.to_netcdf(out_dir=out_dir, savename=savename)
[20]:
'./data/coldata/od550aer_od550aer_MOD-IFS-OSUITE_REF-AeronetL1.5-d_20130101_20221231_daily_ALL-wMOUNTAINS.nc'

Verify the contents of the saved ColocatedData object๏ƒ

When everything is done we should we able to do a ncdump -h and verify its contents.

netcdf od550aer_od550aer_MOD-IFS-OSUITE_REF-AeronetL1.5-d_20130101_20221231_daily_ALL-wMOUNTAINS {
dimensions:
    data_source = 2 ;
    time = 3652 ;
    station_name = 665 ;
variables:
    string data_source(data_source) ;
    int64 time(time) ;
        time:units = "days since 2013-01-01 12:00:00" ;
        time:calendar = "proleptic_gregorian" ;
    string station_name(station_name) ;
    double latitude(station_name) ;
        latitude:_FillValue = NaN ;
        latitude:standard_name = "latitude" ;
        latitude:units = "degrees_north" ;
    double longitude(station_name) ;
        longitude:_FillValue = NaN ;
        longitude:standard_name = "longitude" ;
        longitude:units = "degrees_east" ;
    double altitude(station_name) ;
        altitude:_FillValue = NaN ;
    double od550aer(data_source, time, station_name) ;
        od550aer:_FillValue = NaN ;
        string od550aer:data_source = "AeronetSunV3Lev1.5.daily", "ECMWF_OSUITE" ;
        string od550aer:var_name = "od550aer", "od550aer" ;
        string od550aer:var_name_input = "od550aer", "od550aer" ;
        od550aer:ts_type = "daily" ;
        od550aer:filter_name = "ALL-wMOUNTAINS" ;
        string od550aer:ts_type_src = "daily", "daily" ;
        string od550aer:var_units = "1", "1" ;
        od550aer:data_level = 3LL ;
        od550aer:revision_ref = "20221027" ;
        string od550aer:from_files = "ECMWF_OSUITE.daily.od550aer.2013.nc", "ECMWF_OSUITE.daily.od550aer.2014.nc", "ECMWF_OSUITE.daily.od550aer.2015.nc", "ECMWF_OSUITE.daily.od550aer.2016.nc", "ECMWF_OSUITE.daily.od550aer.2017.nc", "ECMWF_OSUITE.daily.od550aer.2018.nc", "ECMWF_OSUITE.daily.od550aer.2019.nc", "ECMWF_OSUITE.daily.od550aer.2020.nc", "ECMWF_OSUITE.daily.od550aer.2021.nc", "ECMWF_OSUITE.daily.od550aer.2022.nc" ;
        od550aer:from_files_ref = "None" ;
        od550aer:colocate_time = 0LL ;
        od550aer:obs_is_clim = 0LL ;
        od550aer:pyaerocom = "0.13.2.dev0" ;
        od550aer:CONV\!min_num_obs = "{\'monthly\': {\'daily\': 3}}" ;
        od550aer:resample_how = "mean" ;
        od550aer:model_name = "IFS-OSUITE" ;
        od550aer:obs_name = "AeronetL1.5-d" ;
        od550aer:vert_code = "Column" ;
        od550aer:diurnal_only = 0LL ;
        od550aer:coordinates = "altitude longitude latitude" ;
[21]:
! ncdump -h "./data/coldata/od550aer_od550aer_MOD-IFS-OSUITE_REF-AeronetL1.5-d_20130101_20221231_daily_ALL-wMOUNTAINS.nc"
netcdf od550aer_od550aer_MOD-IFS-OSUITE_REF-AeronetL1.5-d_20130101_20221231_daily_ALL-wMOUNTAINS {
dimensions:
        data_source = 2 ;
        time = 365 ;
        station_name = 665 ;
variables:
        string data_source(data_source) ;
        int64 time(time) ;
                time:units = "days since 2013-01-01 00:00:00" ;
                time:calendar = "proleptic_gregorian" ;
        string station_name(station_name) ;
        double latitude(station_name) ;
                latitude:_FillValue = NaN ;
        double longitude(station_name) ;
                longitude:_FillValue = NaN ;
        double altitude(station_name) ;
                altitude:_FillValue = NaN ;
        double od550aer(data_source, time, station_name) ;
                od550aer:_FillValue = NaN ;
                od550aer:obs_vars = "od550aer" ;
                od550aer:ts_type = "daily" ;
                od550aer:filter_name = "ALL-wMOUNTAINS" ;
                string od550aer:ts_type_src = "hourly", "hourly" ;
                string od550aer:var_units = "1", "1" ;
                od550aer:data_level = 3LL ;
                od550aer:revision_ref = "20221027" ;
                string od550aer:from_files = "ECMWF_OSUITE.daily.od550aer.2013.nc", "ECMWF_OSUITE.daily.od550aer.2014.nc", "ECMWF_OSUITE.daily.od550aer.2015.nc", "ECMWF_OSUITE.daily.od550aer.2016.nc", "ECMWF_OSUITE.daily.od550aer.2017.nc", "ECMWF_OSUITE.daily.od550aer.2018.nc", "ECMWF_OSUITE.daily.od550aer.2019.nc", "ECMWF_OSUITE.daily.od550aer.2020.nc", "ECMWF_OSUITE.daily.od550aer.2021.nc", "ECMWF_OSUITE.daily.od550aer.2022.nc" ;
                od550aer:from_files_ref = "None" ;
                od550aer:colocate_time = 0LL ;
                od550aer:obs_is_clim = 0LL ;
                od550aer:pyaerocom = "0.25.0.dev0" ;
                od550aer:CONV\!min_num_obs = "{\'monthly\': {\'daily\': 3}}" ;
                od550aer:resample_how = "mean" ;
                od550aer:obs_name = "AeronetL1.5-d" ;
                od550aer:vert_code = "Column" ;
                od550aer:diurnal_only = 0LL ;
                od550aer:zeros_to_nan = 1LL ;
                od550aer:coordinates = "altitude latitude longitude" ;
}