Core API

Documentation of the core API of pyaerocom.

Logging

pyaerocom initializes logging automatically on import in the following way.

  1. info-messages or worse are logged to logs/pyaerocom.log.$PID or (dynamic feature) the file given in the environment variable PYAEROCOM_LOG_FILE - (dynamic feature) these log-files will be deleted after 7 days.

  2. warning-messages or worse are also printed on stdout. (dynamic feature) Output to stdout is disabled if the script is called non-interactive.

Besides the default records as defined in https://docs.python.org/3/library/logging.html#logrecord-attributes pyaerocom also adds a special mem_usage keyword to be able to detect memory-leaks of the python process early.

Putting a file with the name logging.ini in the scripts current working directory will use that configuration instead of above described default. An example logging.ini doing about the same as described above, except for the dynamic features, and enable debug logging on one package (pyaerocom.io.ungridded), is provided here:

[loggers]
keys=root,pyaerocom-ungridded

[handlers]
keys=console,file

[formatters]
keys=plain,detailed

[formatter_plain]
format=%(message)s

[formatter_detailed]
format=%(asctime)s:%(name)s:%(mem_usage)s:%(levelname)s:%(message)s
datefmt=%F %T

[handler_console]
class=StreamHandler
formatter=plain
args=(sys.stdout,)
level=WARN

[handler_file]
class=FileHandler
formatter=detailed
level=DEBUG
file_name=logs/pyaerocom.log.%(pid)s
args=('%(file_name)s', "w")


[logger_root]
handlers=file,console
level=INFO

[logger_pyaerocom-ungridded]
handlers=file
qualname=pyaerocom.io.readungriddedbase
level=DEBUG
propagate=0

Data classes

Gridded data

class pyaerocom.griddeddata.GriddedData(input=None, var_name=None, check_unit=True, convert_unit_on_init=True, proj_info: ProjectionInformation | None = None, **meta)[source]

pyaerocom object representing gridded data (e.g. model diagnostics)

Gridded data refers to data that can be represented on a regular, multidimensional grid. In pyaerocom this comprises both model output and diagnostics as well as gridded level 3 satellite data, typically with dimensions latitude, longitude, time (for surface or columnar data) and an additional dimension lev (or similar) for vertically resolved data.

Under the hood, this data object is based on (but not inherited from) the iris.cube.Cube object, and makes large use of the therein implemented functionality (many methods implemented here in GriddedData are simply wrappers for Cube methods.

Note

Note that the implemented functionality in this class is mostly limited to what is needed in the pyaerocom API (e.g. for pyaerocom.colocation routines or data import) and is not aimed at replacing or competing with similar data classes such as iris.cube.Cube or xarray.DataArray. Rather, dependent on the use case, one or another of such gridded data objects is needed for optimal processing, which is why GriddedData provides methods and / or attributes to convert to or from other such data classes (e.g. GriddedData.cube is an instance of iris.cube.Cube and method GriddedData.to_xarray() can be used to convert to xarray.DataArray). Thus, GriddedData can be considered rather high-level as compared to the other mentioned data classes from iris or xarray.

Note

Since GriddedData object is based on the iris.cube.Cube object it is optimised for netCDF files that follow the CF conventions and may not work out of the box for files that do not follow this standard.

Parameters:
  • input (str: or Cube) – data input. Can be a single .nc file or a preloaded iris Cube.

  • var_name (str, optional) – variable name that is extracted if input is a file path. Irrelevant if input is preloaded Cube

  • check_unit (bool) – if True, the assigned unit is checked and if it is an alias to another unit the unit string will be updated. It will print a warning if the unit is invalid or not equal the associated AeroCom unit for the input variable. Set convert_unit_on_init to True, if you want an automatic conversion to AeroCom units. Defaults to True.

  • convert_unit_on_init (bool) – if True and if unit check indicates non-conformity with AeroCom unit it will be converted automatically, and warning will be printed if that conversion fails. Defaults to True.

COORDS_ORDER_TSERIES = ['time', 'latitude', 'longitude']

Req. order of dimension coordinates for time-series computation

SUPPORTED_VERT_SCHEMES = ['mean', 'max', 'min', 'surface', 'altitude', 'profile']
property TS_TYPES

List with valid filename encryptions specifying temporal resolution

aerocom_filename(at_stations=False)[source]

Filename of data following Aerocom 3 conventions

Parameters:

at_stations (str) – if True, then AtStations string will be included in filename

Returns:

generated file name based on what is in this object

Return type:

str

aerocom_savename(data_id=None, var_name=None, vert_code=None, year=None, ts_type=None)[source]

Get filename for saving following AeroCom conventions

Parameters:
  • data_id (str, optional) – data ID used in output filename. Defaults to None, in which case data_id is used.

  • var_name (str, optional) – variable name used in output filename. Defaults to None, in which case var_name is used.

  • vert_code (str, optional) – vertical code used in output filename (e.g. Surface, Column, ModelLevel). Defaults to None, in which case assigned value in metadata is used.

  • year (str, optional) – year to be used in filename. If None, then it is attempted to be inferred from values in time dimension.

  • ts_type (str, optional) – frequency string to be used in filename. If None, then ts_type is used.

Raises:

ValueError – if vertical code is not provided and cannot be inferred or if year is not provided and data is not single year. Note that if year is provided, then no sanity checking is done against time dimension.

Returns:

output filename following AeroCom Phase 3 conventions.

Return type:

str

property altitude_access
apply_region_mask(region_id, thresh_coast=0.5, inplace=False)[source]

Apply a masked region filter

area_weighted_mean()[source]

Get area weighted mean

property area_weights

Area weights of lat / lon grid

property base_year

Base year of time dimension

Note

Changing this attribute will update the time-dimension.

calc_area_weights()[source]

Calculate area weights for grid

change_base_year(new_year, inplace=True)[source]

Changes base year of time dimension

Relevant, e.g. for climatological analyses.

Note

This method does not account for offsets arising from leap years ( affecting daily or higher resolution data). It is thus recommended to use this method with care. E.g. if you use this method on a 2016 daily data object, containing a calendar that supports leap years, you’ll end up with 366 time stamps also in the new data object.

Parameters:
  • new_year (int) – new base year (can also be other than integer if it is convertible)

  • inplace (bool) – if True, modify this object, else, use a copy

Returns:

modified data object

Return type:

GriddedData

check_dimcoords_tseries() None[source]

Check order of dimension coordinates for time series retrieval

For computation of time series at certain lon / lat coordinates, the data dimensions have to be in a certain order specified by COORDS_ORDER_TSERIES.

This method checks the current order (and dimensionality) of data and raises appropriate errors.

Raises:
check_frequency()[source]

Check if all datapoints are sampled at the same time frequency

check_lon_circular()[source]

Check if latitude and longitude coordinates are circular

check_unit(try_convert_if_wrong=False)[source]

Check if unit is correct

collapsed(coords, aggregator, **kwargs)[source]

Collapse cube

Reimplementation of method iris.cube.Cube.collapsed(), for details see here

Parameters:
  • coords (str or list) – string IDs of coordinate(s) that are to be collapsed (e.g. ["longitude", "latitude"])

  • aggregator (str or Aggregator or WeightedAggretor) – the aggregator used. If input is string, it is converted into the corresponding iris Aggregator object, see str_to_iris() for valid strings

  • **kwargs – additional keyword args (e.g. weights)

Returns:

collapsed data object

Return type:

GriddedData

property computed
property concatenated
convert_unit(new_unit, inplace=True)[source]

Convert unit of data to new unit

Parameters:
  • new_unit (str or cf_units.Unit) – new unit of data

  • inplace (bool) – convert in this instance or create a new one

property coord_names

List containing coordinate names

property coords_order

Array containing the order of coordinates

copy()[source]

Copy this data object

copy_coords(other, inplace=True)[source]

Copy all coordinates from other data object

Requires the underlying data to be the same shape.

Warning

This operation will delete all existing coordinates and auxiliary coordinates and will then copy the ones from the input data object. No checks of any kind will be performed

Parameters:
  • other (GriddedData or Cube) – other data object (needs to be same shape as this object)

  • inplace (bool) – if True, then this object will be modified and returned, else a copy.

Returns:

data object containing coordinates from other object

Return type:

GriddedData

crop(lon_range=None, lat_range=None, time_range=None, region=None)[source]

High level function that applies cropping along multiple axes

Note

1. For cropping of longitudes and latitudes, the method iris.cube.Cube.intersection() is used since it automatically accepts and understands longitude input based on definition 0 <= lon <= 360 as well as for -180 <= lon <= 180 2. Time extraction may be provided directly as index or in form of pandas.Timestamp objects.

Parameters:
  • lon_range (tuple, optional) – 2-element tuple containing longitude range for cropping. If None, the longitude axis remains unchanged. Example input to crop around meridian: lon_range=(-30, 30)

  • lat_range (tuple, optional) – 2-element tuple containing latitude range for cropping. If None, the latitude axis remains unchanged

  • time_range (tuple, optional) –

    2-element tuple containing time range for cropping. Allowed data types for specifying the times are

    1. a combination of 2 pandas.Timestamp instances or

    2. a combination of two strings that can be directly converted into pandas.Timestamp instances (e.g. time_range=(“2010-1-1”, “2012-1-1”)) or

    3. directly a combination of indices (int).

    If None, the time axis remains unchanged.

  • region (str or Region, optional) – string ID of pyaerocom default region or directly an instance of the Region object. May be used instead of lon_range and lat_range, if these are unspecified.

Returns:

new data object containing cropped grid

Return type:

GriddedData

property cube

Instance of underlying cube object

property data

Data array (n-dimensional numpy array)

Note

This is a pointer to the data object of the underlying iris.Cube instance and will load the data into memory. Thus, in case of large datasets, this may lead to a memory error

property data_id

ID of data object (e.g. model run ID, obsnetwork ID)

Note

This attribute was formerly named name which is alse the corresponding attribute name in metadata

property data_revision

Revision string from file Revision.txt in the main data directory

delete_all_coords(inplace=True)[source]

Deletes all coordinates (dimension + auxiliary) in this object

delete_aux_vars()[source]

Delete auxiliary variables and iris AuxFactories

property delta_t

Array containing timedelta values for each time stamp

property dimcoord_names

List containing coordinate names

estimate_value_range_from_data(extend_percent=5)[source]

Estimate lower and upper end of value range for these data

Parameters:

extend_percent (int) – percentage specifying to which extend min and max values are to be extended to estimate the value range. Defaults to 5.

Returns:

  • float – lower end of estimated value range

  • float – upper end of estimated value range

extract(constraint, inplace=False)[source]

Extract subset

Parameters:

constraint (iris.Constraint) – constraint that is to be applied

Returns:

new data object containing cropped data

Return type:

GriddedData

extract_surface_level()[source]

Extract surface level from 4D field

filter_altitude(alt_range=None)[source]

Currently dummy method that makes life easier in Filter

Returns:

current instance

Return type:

GriddedData

filter_region(region_id, inplace=False, **kwargs)[source]

Filter region based on ID

This works both for rectangular regions and mask regions

Parameters:
  • region_id (str) – name of region

  • inplace (bool) – if True, the current data object is modified, else a new object is returned

  • **kwargs – additional keyword args passed to apply_region_mask() if input region is a mask.

Returns:

filtered data object

Return type:

GriddedData

find_closest_index(**dimcoord_vals)[source]

Find the closest indices for dimension coordinate values

property from_files

List of file paths from which this data object was created

get_area_weighted_timeseries(region=None)[source]

Helper method to extract area weighted mean timeseries

Parameters:

region – optional, name of AeroCom default region for which the mean is to be calculated (e.g. EUROPE)

Returns:

station data containing area weighted mean

Return type:

StationData

property grid

Underlying grid data object

property has_data

True if sum of shape of underlying Cube instance is > 0, else False

property has_latlon_dims

Boolean specifying whether data has latitude and longitude dimensions

property has_time_dim

Boolean specifying whether data has latitude and longitude dimensions

infer_ts_type()[source]

Try to infer sampling frequency from time dimension data

Returns:

ts_type that was inferred (is assigned to metadata too)

Return type:

str

Raises:

DataDimensionError – if data object does not contain a time dimension

interpolate(sample_points=None, scheme='nearest', collapse_scalar=True, **coords)[source]

Interpolate cube at certain discrete points

Reimplementation of method iris.cube.Cube.interpolate(), for details see here

Note

The input coordinates may also be provided using the input arg **coords which provides a more intuitive option (e.g. input (sample_points=[("longitude", [10, 20]), ("latitude", [1, 2])]) is the same as input (longitude=[10, 20], latitude=[1,2])

Parameters:
  • sample_points (list) – sequence of coordinate pairs over which to interpolate. Sample coords should be sorted in ascending order without duplicates.

  • scheme (str or iris interpolator object) – interpolation scheme, pyaerocom default is nearest. If input is string, it is converted into the corresponding iris Interpolator object, see str_to_iris() for valid strings

  • collapse_scalar (bool) – Whether to collapse the dimension of scalar sample points in the resulting cube. Default is True.

  • **coords – additional keyword args that may be used to provide the interpolation coordinates in an easier way than using the Cube argument sample_points. May also be a combination of both.

Returns:

new data object containing interpolated data

Return type:

GriddedData

Examples

>>> from pyaerocom import GriddedData
>>> data = GriddedData()
>>> data._init_testdata_default()
>>> itp = data.interpolate([("longitude", (10)),
...                         ("latitude" , (35))])
>>> print(itp.shape)
(365, 1, 1)
intersection(*args, **kwargs)[source]

Ectract subset using iris.cube.Cube.intersection()

See here for details related to method and input parameters.

Note

Only works if underlying grid data type is iris.cube.Cube

Parameters:
  • *args – non-keyword args

  • **kwargs – keyword args

Returns:

new data object containing cropped data

Return type:

GriddedData

property is_climatology
property is_masked

Flag specifying whether data is masked or not

Note

This method only works if the data is loaded.

isel(**kwargs)[source]
property lat_res
load_input(input, var_name=None, perform_fmt_checks=None)[source]

Import input as cube

Parameters:
  • input (str: or Cube) – data input. Can be a single .nc file or a preloaded iris Cube.

  • var_name (str, optional) – variable name that is extracted if input is a file path . Irrelevant if input is preloaded Cube

  • perform_fmt_checks (bool, optional) – perform formatting checks based on information in filenames. Only relevant if input is a file

property lon_res
property long_name

Long name of variable

max()[source]

Maximum value

Return type:

float

mean(areaweighted=True)[source]

Mean value of data array

Note

Corresponds to numerical mean of underlying N-dimensional numpy array. Does not consider area-weights or any other advanced averaging.

mean_at_coords(latitude=None, longitude=None, time_resample_kwargs=None, **kwargs)[source]

Compute mean value at all input locations

Parameters:
  • latitude (1D list or similar) – list of latitude coordinates of coordinate locations. If None, please provided coords in iris style as list of (lat, lon) tuples via coords (handled via arg kwargs)

  • longitude (1D list or similar) – list of longitude coordinates of coordinate locations. If None, please provided coords in iris style as list of (lat, lon) tuples via coords (handled via arg kwargs)

  • time_resample_kwargs (dict, optional) – time resampling arguments passed to StationData.resample_time()

  • **kwargs – additional keyword args passed to to_time_series()

Returns:

mean value at coordinates over all times available in this object

Return type:

float

property metadata
min()[source]

Minimum value

Return type:

float

property name

ID of model to which data belongs

nanmax()[source]

Maximum value excluding NaNs

Return type:

float

nanmin()[source]

Minimum value excluding NaNs

Return type:

float

property ndim

Number of dimensions

property plot_settings

Variable instance that contains plot settings

The settings can be specified in the variables.ini file based on the unique var_name, see e.g. here

If no default settings can be found for this variable, all parameters will be initiated with None, in which case the Aerocom plot method uses

property proj_info: ProjectionInformation
quickplot_map(time_idx=0, xlim=(-180, 180), ylim=(-90, 90), add_mean=True, **kwargs)[source]

Make a quick plot onto a map

Parameters:
  • time_idx (int) – index in time to be plotted

  • xlim (tuple) – 2-element tuple specifying plotted longitude range

  • ylim (tuple) – 2-element tuple specifying plotted latitude range

  • add_mean (bool) – if True, the mean value over the region and period is inserted

  • **kwargs – additional keyword arguments passed to pyaerocom.quickplot.plot_map()

Returns:

matplotlib figure instance containing plot

Return type:

fig

property reader

Instance of reader class from which this object was created

Note

Currently only supports instances of ReadGridded.

register_var_glob(delete_existing=True)[source]
regrid(other=None, lat_res_deg=None, lon_res_deg=None, scheme='areaweighted', **kwargs)[source]

Regrid this grid to grid resolution of other grid

Parameters:
  • other (GriddedData or Cube, optional) – other data object to regrid to. If None, then input args lat_res and lon_res are used to regrid.

  • lat_res_deg (float or int, optional) – latitude resolution in degrees (is only used if input arg other is None)

  • lon_res_deg (float or int, optional) – longitude resolution in degrees (is only used if input arg other is None)

  • scheme (str) – regridding scheme (e.g. linear, neirest, areaweighted)

Returns:

regridded data object (new instance, this object remains unchanged)

Return type:

GriddedData

remove_outliers(low=None, high=None, inplace=True)[source]

Remove outliers from data

Parameters:
  • low (float) – lower end of valid range for input variable. If None, then the corresponding value from the default settings for this variable are used (cf. minimum attribute of available variables)

  • high (float) – upper end of valid range for input variable. If None, then the corresponding value from the default settings for this variable are used (cf. maximum attribute of available variables)

  • inplace (bool) – if True, this object is modified, else outliers are removed in a copy of this object

Returns:

modified data object

Return type:

GriddedData

reorder_dimensions_tseries() None[source]

Transpose dimensions of data such that to_time_series() works

Raises:
resample_time(to_ts_type, how=None, min_num_obs=None, use_iris=False)[source]

Resample time to input resolution

Parameters:
  • to_ts_type (str) – either of the supported temporal resolutions (cf. IRIS_AGGREGATORS in helpers, e.g. “monthly”)

  • how (str) – string specifying how the data is to be aggregated, default is mean

  • min_num_obs (dict or int, optional) –

    integer or nested dictionary specifying minimum number of observations required to resample from higher to lower frequency. For instance, if input_data is hourly and to_ts_type is monthly, you may specify something like:

    min_num_obs =
        {'monthly'  :   {'daily'  : 7},
         'daily'    :   {'hourly' : 6}}
    

    to require at least 6 hours per day and 7 days per month.

  • use_iris (bool) – option to use resampling scheme from iris library rather than xarray.

Returns:

new data object containing downscaled data

Return type:

GriddedData

Raises:

TemporalResolutionError – if input resolution is not provided, or if it is higher temporal resolution than this object

search_other(var_name)[source]

Searches data for another variable

The search is constrained to the time period spanned by this object and it is attempted to load the same frequency. Uses reader (instance of ReadGridded to search for the other variable data).

Parameters:

var_name (str) – variable to be searched

Raises:

VariableNotFoundError – if data for input variable cannot be found.

Returns:

input variable data

Return type:

GriddedData

sel(use_neirest=True, **dimcoord_vals)[source]

Select subset by dimension names

Note

This is a BETA version, please use with care

Parameters:

**dimcoord_vals – key / value pairs specifying coordinate values to be extracted

Returns:

subset data object

Return type:

GriddedData

property shape
short_str()[source]

Short string representation

split_years(years=None)[source]

Generator to split data object into individual years

Note

This is a generator method and thus should be looped over

Parameters:

years (list, optional) – List of years that should be excluded. If None, it uses output from years_avail().

Yields:

GriddedData – single year data object

property standard_name

Standard name of variable

property start

Start time of dataset as datetime64 object

std()[source]

Standard deviation of values

property stop

Start time of dataset as datetime64 object

property suppl_info
time_stamps()[source]

Convert time stamps into list of numpy datetime64 objects

The conversion is done using method cfunit_to_datetime64()

Returns:

list containing all time stamps as datetime64 objects

Return type:

list

to_netcdf(out_dir, savename=None, **kwargs)[source]

Save as NetCDF file

Parameters:
  • out_dir (str) – output direcory (must exist)

  • savename (str, optional) – name of file. If None, aerocom_savename() is used which is generated automatically and may be modified via **kwargs

  • **kwargs – keywords for name

Returns:

list of output files created

Return type:

list

to_time_series(sample_points=None, scheme='nearest', vert_scheme=None, add_meta=None, use_iris=False, **coords)[source]

Extract time-series for provided input coordinates (lon, lat)

Extract time series for each lon / lat coordinate in this cube or at predefined sample points (e.g. station data). If sample points are provided, the cube is interpolated first onto the sample points.

Parameters:
  • sample_points (list) – coordinates (e.g. lon / lat) at which time series is supposed to be retrieved

  • scheme (str or iris interpolator object) – interpolation scheme (for details, see interpolate())

  • vert_scheme (str) – string specifying how to treat vertical coordinates. This is only relevant for data that contains vertical levels. It will be ignored otherwise. Note that if the input coordinate specifications contain altitude information, this parameter will be set automatically to ‘altitude’. Allowed inputs are all data collapse schemes that are supported by pyaerocom.helpers.str_to_iris() (e.g. mean, median, sum). Further valid schemes are altitude, surface, profile. If not other specified and if altitude coordinates are provided via sample_points (or **coords parameters) then, vert_scheme will be set to altitude. Else, profile is used.

  • add_meta (dict, optional) – dictionary specifying additional metadata for individual input coordinates. Keys are meta attribute names (e.g. station_name) and corresponding values are lists (with length of input coords) or single entries that are supposed to be assigned to each station. E.g. add_meta=dict(station_name=[<list_of_station_names>])).

  • **coords – additional keyword args that may be used to provide the interpolation coordinates (for details, see interpolate())

Returns:

list of result dictionaries for each coordinate. Dictionary keys are: longitude, latitude, var_name

Return type:

list

to_xarray()[source]

Convert this object to an xarray.DataArray

Return type:

DataArray

transpose(new_order)[source]

Re-order data dimensions in object

Wrapper for iris.cube.Cube.transpose()

Note

Changes THIS object (i.e. no new instance of GriddedData will be created)

Parameters:

order (list) – new index order

property ts_type

Temporal resolution of data

property unit

Unit of data

property unit_ok

Boolean specifying if variable unit is AeroCom default

property units

Unit of data

update_meta(**kwargs)[source]

Update metadata dictionary

Parameters:

**kwargs – metadata to be added to metadata.

property var_info

Print information about variable

property var_name

Name of variable

property var_name_aerocom

AeroCom variable name

property vert_code

Vertical code of data (e.g. Column, Surface, ModelLevel)

years_avail()[source]

Generate list of years that are available in this dataset

Return type:

list

Ungridded data

class pyaerocom.ungriddeddata.UngriddedData(num_points=None, add_cols=None)[source]

Class representing point-cloud data (ungridded)

The data is organised in a 2-dimensional numpy array where the first index (rows) axis corresponds to individual measurements (i.e. one timestamp of one variable) and along the second dimension (containing 11 columns) the actual values are stored (in column 6) along with additional information, such as metadata index (can be used as key in metadata to access additional information related to this measurement), timestamp, latitude, longitude, altitude of instrument, variable index and, in case of 3D data (e.g. LIDAR profiles), also the altitude corresponding to the data value.

Note

That said, let’s look at two examples.

Example 1: Suppose you load 3 variables from 5 files, each of which contains 30 timestamps. This corresponds to a total of 3*5*30=450 data points and hence, the shape of the underlying numpy array will be 450x11.

Example 2: 3 variables, 5 files, 30 timestamps, but each variable is height resolved, containing 100 altitudes => 3*5*30*100=4500 data points, thus, the final shape will be 4500x11.

metadata

dictionary containing meta information about the data. Keys are floating point numbers corresponding to each station, values are corresponding dictionaries containing station information.

Type:

dict[float, dict[str, Any]]

meta_idx

dictionary containing index mapping for each station and variable. Keys correspond to metadata key (float -> station, see metadata) and values are dictionaries containing keys specifying variable name and corresponding values are arrays or lists, specifying indices (rows) of these station / variable information in _data. Note: this information is redunant and is there to accelarate station data extraction since the data index matches for a given metadata block do not need to be searched in the underlying numpy array.

Type:

dict[float, dict[str, list[int]]]

var_idx

mapping of variable name (keys, e.g. od550aer) to numerical variable index of this variable in data numpy array (in column specified by _VARINDEX)

Type:

dict[str, float]

Parameters:
  • num_points (int, optional) – inital number of total datapoints (number of rows in 2D dataarray)

  • add_cols (list, optional) – list of additional index column names of 2D datarray.

ALLOWED_VERT_COORD_TYPES = ['altitude']
STANDARD_META_KEYS = ['filename', 'station_id', 'station_name', 'instrument_name', 'PI', 'country', 'country_code', 'ts_type', 'latitude', 'longitude', 'altitude', 'data_id', 'dataset_name', 'data_product', 'data_version', 'data_level', 'framework', 'instr_vert_loc', 'revision_date', 'website', 'ts_type_src', 'stat_merge_pref_attr']
add_chunk(size=None)[source]

Extend the size of the data array

Parameters:

size (int, optional) – number of additional rows. If None (default) or smaller than minimum chunksize specified in attribute _CHUNKSIZE, then the latter is used.

add_station_data(stat, meta_idx=None, data_idx=None, check_index=False)[source]
all_datapoints_var(var_name)[source]

Get array of all data values of input variable

Parameters:

var_name (str) – variable name

Returns:

1-d numpy array containing all values of this variable

Return type:

ndarray

Raises:

AttributeError – if variable name is not available

property altitude

Altitudes of stations

append(other)[source]

Append other instance of UngriddedData to this object

Note

Calls merge(other, new_obj=False)()

Parameters:

other (UngriddedData) – other data object

Returns:

merged data object

Return type:

UngriddedData

Raises:

ValueError – if input object is not an instance of UngriddedData

apply_filters(var_outlier_ranges=None, **filter_attributes)[source]

Extended filtering method

Combines filter_by_meta() and adds option to also remove outliers (keyword remove_outliers), set flagged data points to NaN (keyword set_flags_nan) and to extract individual variables (keyword var_name).

Parameters:
  • var_outlier_ranges (dict, optional) – dictionary specifying custom outlier ranges for individual variables.

  • **filter_attributes (dict) – filters that are supposed to be applied to the data. To remove outliers, use keyword remove_outliers, to set flagged values to NaN, use keyword set_flags_nan, to extract single or multiple variables, use keyword var_name. Further filter keys are assumed to be metadata specific and are passed to filter_by_meta().

Returns:

filtered data object

Return type:

UngriddedData

apply_region_mask(region_id=None)[source]

TODO : Write documentations

Parameters:

region_id (str or list (of strings)) – ID of region or IDs of multiple regions to be combined

property available_meta_keys

List of all available metadata keys

Note

This is a list of all metadata keys that exist in this dataset, but it does not mean that all of the keys are registered in all metadata blocks, especially if the data is merged from different sources with different metadata availability

change_var_idx(var_name, new_idx)[source]

Change index that is assigned to variable

Each variable in this object has assigned a unique index that is stored in the dictionary var_idx and which is used internally to access data from a certain variable from the data array _data (the indices are stored in the data column specified by _VARINDEX, cf. class header).

This index thus needs to be unique for each variable and hence, may need to be updated, when two instances of UngriddedData are merged (cf. merge()).

And the latter is exactrly what this function does.

Parameters:
  • var_name (str) – name of variable

  • new_idx (int) – new index of variable

Raises:

ValueError – if input new_idx already exist in this object as a variable index

check_convert_var_units(var_name, to_unit=None, inplace=True)[source]
check_set_country()[source]

CHecks all metadata entries for availability of country information

Metadata blocks that are missing country entry will be updated based on country inferred from corresponding lat / lon coordinate. Uses pyaerocom.geodesy.get_country_info_coords() (library reverse-geocode) to retrieve countries. This may be errouneous close to country borders as it uses eucledian distance based on a list of known locations.

Note

Metadata blocks that do not contain latitude and longitude entries are skipped.

Returns:

  • list – metadata entries where country was added

  • list – corresponding countries that were inferred from lat / lon

check_unit(var_name, unit=None)[source]

Check if variable unit corresponds to AeroCom unit

Parameters:
  • var_name (str) – variable name for which unit is to be checked

  • unit (str, optional) – unit to be checked, if None, AeroCom default unit is used

Raises:

MetaDataError – if unit information is not accessible for input variable name

clear_meta_no_data(inplace=True)[source]

Remove all metadata blocks that do not have data associated with it

Parameters:

inplace (bool) – if True, the changes are applied to this instance directly, else to a copy

Returns:

cleaned up data object

Return type:

UngriddedData

Raises:

DataCoverageError – if filtering results in empty data object

code_lat_lon_in_float()[source]

method to code lat and lon in a single number so that we can use np.unique to determine single locations

colocate_vardata(var1, data_id1=None, var2=None, data_id2=None, other=None, **kwargs)[source]
property contains_datasets

List of all datasets in this object

property contains_instruments

List of all instruments in this object

property contains_vars: list[str]

List of all variables in this dataset

copy()[source]

Make a copy of this object

Returns:

copy of this object

Return type:

UngriddedData

Raises:

MemoryError – if copy is too big to fit into memory together with existing instance

property countries_available

Alphabetically sorted list of country names available

decode_lat_lon_from_float()[source]

method to decode lat and lon from a single number calculated by code_lat_lon_in_float

empty_trash()[source]

Set all values in trash column to NaN

extract_dataset(data_id)[source]

Extract single dataset into new instance of UngriddedData

Calls filter_by_meta().

Parameters:

data_id (str) – ID of dataset

Returns:

new instance of ungridded data containing only data from specified input network

Return type:

UngriddedData

extract_var(var_name, check_index=True)[source]

Split this object into single-var UngriddedData objects

Parameters:
  • var_name (str) – name of variable that is supposed to be extracted

  • check_index (Bool) – Call _check_index() in the new data object.

Returns:

new data object containing only input variable data

Return type:

UngriddedData

extract_vars(var_names, check_index=True)[source]

Extract multiple variables from dataset

Loops over input variable names and calls extract_var() to retrieve single variable UngriddedData objects for each variable and then merges all of these into one object

Parameters:
  • var_names (list or str) – list of variables to be extracted

  • check_index (Bool) – Call _check_index() in the new data object.

Returns:

new data object containing input variables

Return type:

UngriddedData

Raises:

VarNotAvailableError – if one of the input variables is not available in this data object

filter_altitude(alt_range)[source]

Filter altitude range

Parameters:

alt_range (list or tuple) – 2-element list specifying altitude range to be filtered in m

Returns:

filtered data object

Return type:

UngriddedData

filter_by_meta(negate=None, **filter_attributes)[source]

Flexible method to filter these data based on input meta specs

Parameters:
  • negate (list or str, optional) – specified meta key(s) provided via filter_attributes that are supposed to be treated as ‘not valid’. E.g. if station_name=”bad_site” is input in filter_attributes and if station_name is listed in negate, then all metadata blocks containing “bad_site” as station_name will be excluded in output data object.

  • **filter_attributes – valid meta keywords that are supposed to be filtered and the corresponding filter values (or value ranges) Only valid meta keywords are considered (e.g. data_id, longitude, latitude, altitude, ts_type)

Returns:

filtered ungridded data object

Return type:

UngriddedData

Raises:
  • NotImplementedError – if attempt variables are supposed to be filtered (not yet possible)

  • IOError – if any of the input keys are not valid meta key

Example

>>> import pyaerocom as pya
>>> r = pya.io.ReadUngridded(['AeronetSunV2Lev2.daily',
                              'AeronetSunV3Lev2.daily'], 'od550aer')
>>> data = r.read()
>>> data_filtered = data.filter_by_meta(data_id='AeronetSunV2Lev2.daily',
...                                     longitude=[-30, 30],
...                                     latitude=[20, 70],
...                                     altitude=[0, 1000])
filter_by_projection(projection, xrange: tuple[float, float], yrange: tuple[float, float])[source]

Filter the ungridded data to a horizontal bounding box given by a projection

Parameters:
  • projection – a function turning projection(lat, lon) -> (x, y)

  • xrange – x range (min/max included) in the projection plane

  • yrange – y range (min/max included) in the projection plane

filter_region(region_id, check_mask=True, check_country_meta=False, **kwargs)[source]

Filter object by a certain region

Parameters:
  • region_id (str) – name of region (must be valid AeroCom region name or HTAP region)

  • check_mask (bool) – if True and region_id a valid name for a binary mask, then the filtering is done based on that binary mask.

  • check_country_meta (bool) – if True, then the input region_id is first checked against available country names in metadata. If that fails, it is assumed that this regions is either a valid name for registered rectangular regions or for available binary masks.

  • **kwargs – currently not used in method (makes usage in higher level classes such as Filter easier as other data objects have the same method with possibly other input possibilities)

Returns:

filtered data object (containing only stations that fall into input region)

Return type:

UngriddedData

find_common_data_points(other, var_name, sampling_freq='daily')[source]
find_common_stations(other: UngriddedData, check_vars_available=None, check_coordinates: bool = True, max_diff_coords_km: float = 0.1) dict[source]

Search common stations between two UngriddedData objects

This method loops over all stations that are stored within this object (using metadata) and checks if the corresponding station exists in a second instance of UngriddedData that is provided. The check is performed on basis of the station name, and optionally, if desired, for each station name match, the lon lat coordinates can be compared within a certain radius (defaul 0.1 km).

Note

This is a beta version and thus, to be treated with care.

Parameters:
  • other (UngriddedData) – other object of ungridded data

  • check_vars_available (list (or similar), optional) – list of variables that need to be available in stations of both datasets

  • check_coordinates (bool) – if True, check that lon and lat coordinates of station candidates match within a certain range, specified by input parameter max_diff_coords_km

Returns:

dictionary where keys are meta_indices of the common station in this object and corresponding values are meta indices of the station in the other object

Return type:

dict

find_station_meta_indices(station_name_or_pattern, allow_wildcards=True)[source]

Find indices of all metadata blocks matching input station name

You may also use wildcard pattern as input (e.g. Potenza)

Parameters:
  • station_pattern (str) – station name or wildcard pattern

  • allow_wildcards (bool) – if True, input station_pattern will be used as wildcard pattern and all matches are returned.

Returns:

list containing all metadata indices that match the input station name or pattern

Return type:

list

Raises:

StationNotFoundError – if no such station exists in this data object

property first_meta_idx
static from_cache(data_dir, file_name)[source]

Load pickled instance of UngriddedData

Parameters:
  • data_dir (str) – directory where pickled object is stored

  • file_name (str) – file name of pickled object (needs to end with pkl)

Raises:

ValueError – if loading failed

Returns:

loaded UngriddedData object. If this method is called from an instance of UngriddedData, this instance remains unchanged. You may merge the returned reloaded instance using merge().

Return type:

UngriddedData

static from_station_data(stats, add_meta_keys=None)[source]

Create UngriddedData from input station data object(s)

Parameters:
  • stats (iterator or StationData) – input data object(s)

  • add_meta_keys (list, optional) – list of metadata keys that are supposed to be imported from the input StationData objects, in addition to the default metadata retrieved via StationData.get_meta().

Raises:

ValueError – if any of the input data objects is not an instance of StationData.

Returns:

ungridded data object created from input station data objects

Return type:

UngriddedData

get_variable_data(variables, start=None, stop=None, ts_type=None, **kwargs)[source]

Extract all data points of a certain variable

Parameters:

vars_to_extract (str or list) – all variables that are supposed to be accessed

property has_flag_data

Boolean specifying whether this object contains flag data

property index
property is_empty

Boolean specifying whether this object contains data or not

property is_filtered

Boolean specifying whether this data object has been filtered

Note

Details about applied filtering can be found in filter_hist

property is_vertical_profile

Boolean specifying whether is vertical profile

last_filter_applied()[source]

Returns the last filter that was applied to this dataset

To see all filters, check out filter_hist

property last_meta_idx

Index of last metadata block

property latitude

Latitudes of stations

property longitude

Longitudes of stations

merge(other, new_obj=True)[source]

Merge another data object with this one

Parameters:
  • other (UngriddedData) – other data object

  • new_obj (bool) – if True, this object remains unchanged and the merged data objects are returned in a new instance of UngriddedData. If False, then this object is modified

Returns:

merged data object

Return type:

UngriddedData

Raises:

ValueError – if input object is not an instance of UngriddedData

merge_common_meta(ignore_keys=None)[source]

Merge all meta entries that are the same

Note

If there is an overlap in time between the data, the blocks are not merged

Parameters:

ignore_keys (list) – list containing meta keys that are supposed to be ignored

Returns:

merged data object

Return type:

UngriddedData

property nonunique_station_names

List of station names that occur more than once in metadata

num_obs_var_valid(var_name)[source]

Number of valid observations of variable in this dataset

Parameters:

var_name (str) – name of variable

Returns:

number of valid observations (all values that are not NaN)

Return type:

int

plot_station_coordinates(var_name=None, start=None, stop=None, ts_type=None, color='r', marker='o', markersize=8, fontsize_base=10, legend=True, add_title=True, **kwargs)[source]

Plot station coordinates on a map

All input parameters are optional and may be used to add constraints related to which stations are plotted. Default is all stations of all times.

Parameters:
  • var_name (str, optional) – name of variable to be retrieved

  • start – start time (optional)

  • stop – stop time (optional). If start time is provided and stop time not, then only the corresponding year inferred from start time will be considered

  • ts_type (str, optional) – temporal resolution

  • color (str) – color of stations on map

  • marker (str) – marker type of stations

  • markersize (int) – size of station markers

  • fontsize_base (int) – basic fontsize

  • legend (bool) – if True, legend is added

  • add_title (bool) – if True, title will be added

  • **kwargs – Addifional keyword args passed to pyaerocom.plot.plot_coordinates()

Returns:

matplotlib axes instance

Return type:

axes

plot_station_timeseries(station_name, var_name, start=None, stop=None, ts_type=None, insert_nans=True, ax=None, **kwargs)[source]

Plot time series of station and variable

Parameters:
  • station_name (str or int) – station name or index of station in metadata dict

  • var_name (str) – name of variable to be retrieved

  • start – start time (optional)

  • stop – stop time (optional). If start time is provided and stop time not, then only the corresponding year inferred from start time will be considered

  • ts_type (str, optional) – temporal resolution

  • **kwargs – Addifional keyword args passed to method pandas.Series.plot()

Returns:

matplotlib axes instance

Return type:

axes

remove_outliers(var_name, inplace=False, low=None, high=None, unit_ref=None, move_to_trash=True)[source]

Method that can be used to remove outliers from data

Parameters:
  • var_name (str) – variable name

  • inplace (bool) – if True, the outliers will be removed in this object, otherwise a new oject will be created and returned

  • low (float) – lower end of valid range for input variable. If None, then the corresponding value from the default settings for this variable are used (cf. minimum attribute of available variables)

  • high (float) – upper end of valid range for input variable. If None, then the corresponding value from the default settings for this variable are used (cf. maximum attribute of available variables)

  • unit_ref (str) – reference unit for assessment of input outlier ranges: all data needs to be in that unit, else an Exception will be raised

  • move_to_trash (bool) – if True, then all detected outliers will be moved to the trash column of this data object (i.e. column no. specified at UngriddedData._TRASHINDEX).

Returns:

ungridded data object that has all outliers for this variable removed.

Return type:

UngriddedData

Raises:

ValueError – if input move_to_trash is True and in case for some of the measurements there is already data in the trash.

save_as(file_name, save_dir)[source]

Save this object to disk

Note

So far, only storage as pickled object via CacheHandlerUngridded is supported, so input file_name must end with .pkl

Parameters:
  • file_name (str) – name of output file

  • save_dir (str) – name of output directory

Returns:

file path

Return type:

str

set_flags_nan(inplace=False)[source]

Set all flagged datapoints to NaN

Parameters:

inplace (bool) – if True, the flagged datapoints will be set to NaN in this object, otherwise a new oject will be created and returned

Returns:

data object that has all flagged data values set to NaN

Return type:

UngriddedData

Raises:

AttributeError – if no flags are assigned

property shape

Shape of data array

property station_coordinates

dictionary with station coordinates

Returns:

dictionary containing station coordinates (latitude, longitude, altitude -> values) for all stations (keys) where these parameters are accessible.

Return type:

dict

property station_name

Latitudes of data

property time

Time dimension of data

to_station_data(meta_idx, vars_to_convert=None, start=None, stop=None, freq=None, ts_type_preferred=None, merge_if_multi=True, merge_pref_attr=None, merge_sort_by_largest=True, insert_nans=False, allow_wildcards_station_name=True, add_meta_keys=None, resample_how=None, min_num_obs=None)[source]

Convert data from one station to StationData

Parameters:
  • meta_idx (float) – index of station or name of station.

  • vars_to_convert (list or str, optional) – variables that are supposed to be converted. If None, use all variables that are available for this station

  • start – start time, optional (if not None, input must be convertible into pandas.Timestamp)

  • stop – stop time, optional (if not None, input must be convertible into pandas.Timestamp)

  • freq (str) – pandas frequency string (e.g. ‘D’ for daily, ‘M’ for month end) or valid pyaerocom ts_type

  • merge_if_multi (bool) – if True and if data request results in multiple instances of StationData objects, then these are attempted to be merged into one StationData object using merge_station_data()

  • merge_pref_attr – only relevant for merging of multiple matches: preferred attribute that is used to sort the individual StationData objects by relevance. Needs to be available in each of the individual StationData objects. For details cf. pref_attr in docstring of merge_station_data(). Example could be revision_date. If None, then the stations will be sorted based on the number of available data points (if merge_sort_by_largest is True, which is default).

  • merge_sort_by_largest (bool) – only relevant for merging of multiple matches: cf. prev. attr. and docstring of merge_station_data() method.

  • insert_nans (bool) – if True, then the retrieved StationData objects are filled with NaNs

  • allow_wildcards_station_name (bool) – if True and if input meta_idx is a string (i.e. a station name or pattern), metadata matches will be identified applying wildcard matches between input meta_idx and all station names in this object.

Returns:

StationData object(s) containing results. list is only returned if input for meta_idx is station name and multiple matches are detected for that station (e.g. data from different instruments), else single instance of StationData. All variable time series are inserted as pandas Series

Return type:

StationData or list

to_station_data_all(vars_to_convert=None, start=None, stop=None, freq=None, ts_type_preferred=None, by_station_name=True, ignore_index=None, **kwargs)[source]

Convert all data to StationData objects

Creates one instance of StationData for each metadata block in this object.

Parameters:
  • vars_to_convert (list or str, optional) – variables that are supposed to be converted. If None, use all variables that are available for this station

  • start – start time, optional (if not None, input must be convertible into pandas.Timestamp)

  • stop – stop time, optional (if not None, input must be convertible into pandas.Timestamp)

  • freq (str) – pandas frequency string (e.g. ‘D’ for daily, ‘M’ for month end) or valid pyaerocom ts_type (e.g. ‘hourly’, ‘monthly’).

  • by_station_name (bool) – if True, then iter over unique_station_name (and merge multiple matches if applicable), else, iter over metadata index

  • **kwargs – additional keyword args passed to to_station_data() (e.g. merge_if_multi, merge_pref_attr, merge_sort_by_largest, insert_nans)

Returns:

4-element dictionary containing following key / value pairs:

  • stats: list of StationData objects

  • station_name: list of corresponding station names

  • latitude: list of latitude coordinates

  • longitude: list of longitude coordinates

Return type:

dict

property unique_station_names

List of unique station names

pyaerocom.ungriddeddata.reduce_array_closest(arr_nominal, arr_to_be_reduced)[source]

Co-located data

class pyaerocom.colocation.colocated_data.ColocatedData(data: Path | str | xr.DataArray | np.ndarray | None = None, **kwargs)[source]

Class representing colocated and unified data from two sources

Sources may be instances of UngriddedData or GriddedData that have been compared to each other.

Note

It is intended that this object can either be instantiated from scratch OR created in and returned by pyaerocom objects / methods that perform colocation. This is particauarly true as pyaerocom will now be expected to read in colocated files created outside of pyaerocom. (Related CAMS2_82 development)

The purpose of this object is not the creation of colocated objects, but solely the analysis of such data as well as I/O features (e.g. save as / read from .nc files, convert to pandas.DataFrame, plot station time series overlays, scatter plots, etc.).

In the current design, such an object comprises 3 or 4 dimensions, where the first dimension (data_source, index 0) is ALWAYS length 2 and specifies the two datasets that were co-located (index 0 is obs, index 1 is model). The second dimension is time and in case of 3D colocated data the 3rd dimension is station_name while for 4D colocated data the 3rd and 4th dimension are latitude and longitude, respectively.

3D colocated data is typically created when a model is colocated with station based ground based observations ( cf pyaerocom.colocation.colocate_gridded_ungridded()) while 4D colocated data is created when a model is colocated with another model or satellite observations, that cover large parts of Earth’s surface (other than discrete lat/lon pairs in the case of ground based station locations).

Parameters:
  • data (xarray.DataArray or numpy.ndarray or str, optional) – Colocated data. If str, then it is attempted to be loaded from file. Else, it is assumed that data is numpy array and that all further supplementary inputs (e.g. coords, dims) for the instantiation of DataArray is provided via **kwargs.

  • **kwargs – Additional keyword args that are passed to init of DataArray in case input data is numpy array.

Raises:

ValidationError – if init fails

apply_country_filter(region_id, use_country_code=False, inplace=False)[source]

Apply country filter

Parameters:
  • region_id (str) – country name or code.

  • use_country_code (bool, optional) – If True, input value for country is evaluated against country codes rather than country names. Defaults to False.

  • inplace (bool, optional) – Apply filter to this object directly or to a copy. The default is False.

Raises:

NotImplementedError – if data is 4D (i.e. it has latitude and longitude dimensions).

Returns:

filtered data object.

Return type:

ColocatedData

apply_latlon_filter(lat_range=None, lon_range=None, region_id=None, inplace=False)[source]

Apply rectangular latitude/longitude filter

Parameters:
  • lat_range (list, optional) – latitude range that is supposed to be applied. If specified, then also lon_range need to be specified, else, region_id is checked against AeroCom default regions (and used if applicable)

  • lon_range (list, optional) – longitude range that is supposed to be applied. If specified, then also lat_range need to be specified, else, region_id is checked against AeroCom default regions (and used if applicable)

  • region_id (str) – name of region to be applied. If provided (i.e. not None) then input args lat_range and lon_range are ignored

  • inplace (bool, optional) – Apply filter to this object directly or to a copy. The default is False.

Raises:

ValueError – if lower latitude bound exceeds upper latitude bound.

Returns:

filtered data object

Return type:

ColocatedData

apply_region_mask(region_id, inplace=False)[source]

Apply a binary regions mask filter to data object. Available binary regions IDs can be found at pyaerocom.const.HTAP_REGIONS.

Parameters:
  • region_id (str) – ID of binary regions.

  • inplace (bool, optional) – If True, the current instance, is modified, else a new instance of ColocatedData is created and filtered. The default is False.

Raises:

DataCoverageError – if filtering results in empty data object.

Returns:

data – Filtered data object.

Return type:

ColocatedData

property area_weights

Wrapper for calc_area_weights()

calc_area_weights()[source]

Calculate area weights

Note

Only applies to colocated data that has latitude and longitude dimension.

Returns:

array containing weights for each datapoint (same shape as self.data[0])

Return type:

ndarray

calc_nmb_array()[source]

Calculate data array with normalised bias (NMB) values

Returns:

NMBs at each coordinate

Return type:

DataArray

calc_spatial_statistics(aggr=None, use_area_weights=False, **kwargs)[source]

Calculate spatial statistics from model and obs data

Spatial statistics is computed by averaging first the time dimension and then, if data is 4D, flattening lat / lon dimensions into new station_name dimension, so that the resulting dimensions are data_source and station_name. These 2D data are then used to calculate standard statistics using pyaerocom.stats.stats.calculate_statistics().

See also calc_statistics() and calc_temporal_statistics().

Parameters:
  • aggr (str, optional) – aggreagator to be used, currently only mean and median are supported. Defaults to mean.

  • use_area_weights (bool) – if True and if data is 4D (i.e. has lat and lon dimension), then area weights are applied when caluclating the statistics based on the coordinate cell sizes. Defaults to False.

  • **kwargs – additional keyword args passed to pyaerocom.stats.stats.calculate_statistics()

Returns:

dictionary containing statistical parameters

Return type:

dict

calc_statistics(use_area_weights=False, **kwargs)[source]

Calculate statistics from model and obs data

Calculate standard statistics for model assessment. This is done by taking all model and obs data points in this object as input for pyaerocom.stats.stats.calculate_statistics(). For instance, if the object is 3D with dimensions data_source (obs, model), time (e.g. 12 monthly values) and station_name (e.g. 4 sites), then the input arrays for model and obs into pyaerocom.stats.stats.calculate_statistics() will be each of size 12x4.

See also calc_temporal_statistics() and calc_spatial_statistics().

Parameters:
  • use_area_weights (bool) – if True and if data is 4D (i.e. has lat and lon dimension), then area weights are applied when caluclating the statistics based on the coordinate cell sizes. Defaults to False.

  • **kwargs – additional keyword args passed to pyaerocom.stats.stats.calculate_statistics()

Returns:

dictionary containing statistical parameters

Return type:

dict

calc_temporal_statistics(aggr=None, **kwargs)[source]

Calculate temporal statistics from model and obs data

Temporal statistics is computed by averaging first the spatial dimension(s) (that is, station_name for 3D data, and latitude and longitude for 4D data), so that only data_source and time remains as dimensions. These 2D data are then used to calculate standard statistics using pyaerocom.stats.stats.calculate_statistics().

See also calc_statistics() and calc_spatial_statistics().

Parameters:
  • aggr (str, optional) – aggreagator to be used, currently only mean and median are supported. Defaults to mean.

  • **kwargs – additional keyword args passed to pyaerocom.stats.stats.calculate_statistics()

Returns:

dictionary containing statistical parameters

Return type:

dict

check_set_countries(inplace=True, assign_to_dim=None)[source]

Checks if country information is available and assigns if not

If not country information is available, countries will be assigned for each lat / lon coordinate using pyaerocom.geodesy.get_country_info_coords().

Parameters:
  • inplace (bool, optional) – If True, modify and return this object, else a copy. The default is True.

  • assign_to_dim (str, optional) – name of dimension to which the country coordinate is assigned. Default is None, in which case station_name is used.

Raises:

DataDimensionError – If data is 4D (i.e. if latitude and longitude are othorgonal dimensions)

Returns:

data object with countries assigned

Return type:

ColocatedData

property coords

Coordinates of data array

copy()[source]

Copy this object

property countries_available

Alphabetically sorted list of country names available

Raises:

MetaDataError – if no country information is available

Returns:

list of countries available in these data

Return type:

list

property country_codes_available

Alphabetically sorted list of country codes available

Raises:

MetaDataError – if no country information is available

Returns:

list of countries available in these data

Return type:

list

data: Path | str | xr.DataArray | np.ndarray | None
property data_source

Coordinate array containing data sources (z-axis)

property dims

Names of dimensions

filter_altitude(alt_range, inplace=False)[source]

Apply altitude filter

Parameters:
  • alt_range (list or tuple) – altitude range to be applied to data (2-element list)

  • inplace (bool, optional) – Apply filter to this object directly or to a copy. The default is False.

Raises:

NotImplementedError – If data is 4D, i.e. it contains latitude and longitude dimensions.

Returns:

Filtered data object .

Return type:

ColocatedData

filter_region(region_id, check_mask=True, check_country_meta=False, inplace=False)[source]

Filter object by region

Parameters:
  • region_id (str) – ID of region

  • inplace (bool) – if True, the filtering is done directly in this instance, else a new instance is returned

  • check_mask (bool) – if True and region_id a valid name for a binary mask, then the filtering is done based on that binary mask.

  • check_country_meta (bool) – if True, then the input region_id is first checked against available country names in metadata. If that fails, it is assumed that this regions is either a valid name for registered rectangular regions or for available binary masks.

Returns:

filtered data object

Return type:

ColocatedData

flatten_latlondim_station_name()[source]

Stack (flatten) lat / lon dimension into new dimension station_name

Returns:

new colocated data object with dimension station_name and lat lon arrays as additional coordinates

Return type:

ColocatedData

from_csv(file_path)[source]

Read data from CSV file

static from_dataframe(df: DataFrame) ColocatedData[source]

Create colocated Data object from dataframe

Note

This is intended to be used as back-conversion from to_dataframe() and methods that use the latter (e.g. to_csv()).

get_coords_valid_obs()[source]

Get latitude / longitude coordinates where obsdata is available

Returns:

  • list – latitute coordinates

  • list – longitude coordinates

get_country_codes()[source]

Get country names and codes for all locations contained in these data

Raises:

MetaDataError – if no country information is available

Returns:

dictionary of unique country names (keys) and corresponding country codes (values)

Return type:

dict

static get_meta_from_filename(file_path)[source]

Get meta information from file name

Note

This does not yet include IDs of model and obs data as these should be included in the data anyways (e.g. column names in CSV file) and may include the delimiter _ in their name.

Returns:

dicitonary with meta information

Return type:

dict

get_meta_item(key: str)[source]

Get metadata value

Parameters:

key (str) – meta item key.

Raises:

AttributeError – If key is not available.

Returns:

value of metadata.

Return type:

object

get_regional_timeseries(region_id, **filter_kwargs)[source]

Compute regional timeseries both for model and obs

Parameters:
  • region_id (str) – name of region for which regional timeseries is supposed to be retrieved

  • **filter_kwargs – additional keyword args passed to filter_region().

Returns:

dictionary containing regional timeseries for model (key mod) and obsdata (key obs) and name of region.

Return type:

dict

get_time_resampling_settings()[source]

Returns a dictionary with relevant settings for temporal resampling

Return type:

dict

property has_latlon_dims

Boolean specifying whether data has latitude and longitude dimensions

property has_time_dim

Boolean specifying whether data has a time dimension

property lat_range

Latitude range covered by this data object

property latitude

Array of latitude coordinates

property lon_range

Longitude range covered by this data object

property longitude

Array of longitude coordinates

max()[source]

Wrapper for xarray.DataArray.max() called from data

Returns:

maximum of data

Return type:

xarray.DataArray

property metadata

Meta data dictionary (wrapper to data.attrs

min()[source]

Wrapper for xarray.DataArray.min() called from data

Returns:

minimum of data

Return type:

xarray.DataArray

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'allow', 'protected_namespaces': (), 'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

property model_name
property ndim

Dimension of data array

property num_coords

Total number of lat/lon coordinate pairs

property num_coords_with_data

Number of lat/lon coordinate pairs that contain at least one datapoint

Note

Occurrence of valid data is only checked for obsdata (first index in data_source dimension).

property obs_name
open(file_path)[source]

High level helper for reading from supported file sources

Parameters:

file_path (str) – file path

plot_coordinates(marker='x', markersize=12, fontsize_base=10, **kwargs)[source]

Plot station coordinates

Uses pyaerocom.plot.plotcoordinates.plot_coordinates().

Parameters:
  • marker (str, optional) – matplotlib marker name used to plot site locations. The default is ‘x’.

  • markersize (int, optional) – Size of site markers. The default is 12.

  • fontsize_base (int, optional) – Basic fontsize. The default is 10.

  • **kwargs – additional keyword args passed to pyaerocom.plot.plotcoordinates.plot_coordinates()

Return type:

matplotlib.axes.Axes

plot_scatter(**kwargs)[source]

Create scatter plot of data

Parameters:

**kwargs – keyword args passed to pyaerocom.plot.plotscatter.plot_scatter()

Returns:

matplotlib axes instance

Return type:

Axes

read_netcdf(file_path)[source]

Read data from NetCDF file

Parameters:

file_path (str) – file path

rename_variable(var_name, new_var_name, data_source, inplace=True)[source]

Rename a variable in this object

Parameters:
  • var_name (str) – current variable name

  • new_var_name (str) – new variable name

  • data_source (str) – name of data source (along data_source dimension)

  • inplace (bool) – replace here or create new instance

Returns:

instance with renamed variable

Return type:

ColocatedData

Raises:
resample_time(to_ts_type, how=None, min_num_obs=None, colocate_time=False, settings_from_meta=False, inplace=False, **kwargs)[source]

Resample time dimension

The temporal resampling is done using TimeResampler

Parameters:
  • to_ts_type (str) – desired output frequency.

  • how (str or dict, optional) – aggregator used for resampling (e.g. max, min, mean, median). Can also be hierarchical scheme via dict, similar to min_num_obs. The default is None.

  • min_num_obs (int or dict, optional) – Minimum number of observations required to resample from current frequency (ts_type) to desired output frequency.

  • colocate_time (bool, optional) – If True, the modeldata is invalidated where obs is NaN, before resampling. The default is False (updated in v0.11.0, before was True).

  • settings_from_meta (bool) – if True, then input args how, min_num_obs and colocate_time are ignored and instead the corresponding values set in metadata are used. Defaults to False.

  • inplace (bool, optional) – If True, modify this object directly, else make a copy and resample that one. The default is False (updated in v0.11.0, before was True).

  • **kwargs – Addtitional keyword args passed to TimeResampler.resample().

Returns:

Resampled colocated data object.

Return type:

ColocatedData

property savename_aerocom

Default save name for data object following AeroCom convention

set_zeros_nan(inplace=True)[source]

Replace all 0’s with NaN in data

Parameters:

inplace (bool) – Whether to modify this object or a copy. The default is True.

Returns:

cd – modified data object

Return type:

ColocatedData

property shape

Shape of data array

stack(inplace=False, **kwargs)[source]

Stack one or more dimensions

For details see xarray.DataArray.stack().

Parameters:
  • inplace (bool) – modify this object or a copy.

  • **kwargs – input arguments passed to DataArray.stack()

Returns:

stacked data object

Return type:

ColocatedData

property start

Start datetime of data

property start_str

Start date of data as str with format YYYYMMDD

Type:

str

property stop

Stop datetime of data

property stop_str

Stop date of data as str with format YYYYMMDD

Type:

str

property time

Array containing time stamps

to_csv(out_dir, savename=None)[source]

Save data object as .csv file

Converts data to pandas.DataFrame and then saves as csv

Parameters:
  • out_dir (str) – output directory

  • savename (str, optional) – name of file, if None, the default save name is used (cf. savename_aerocom)

to_dataframe()[source]

Convert this object into pandas.DataFrame

The resulting DataFrame will have the following columns: station: The name of the station for a given value.

The following columns will be available in the resulting dataframe: - time: Time. - station_name: Station name. - data_source_obs: Data source obs (eg. EBASMC). - data_source_mod: Data source model (eg. EMEP). - latitude. - longitude. - altitude. - {var_name}_obs: Variable value of observation. - {var_name}_mod: Variable value of model.

{var_name} is the aerocom variable name of the variable name.

to_netcdf(out_dir, savename=None, **kwargs)[source]

Save data object as NetCDF file

Wrapper for method xarray.DataArray.to_netdcf()

Parameters:
  • out_dir (str) – output directory

  • savename (str, optional) – name of file, if None, the default save name is used (cf. savename_aerocom)

  • **kwargs – additional, optional keyword arguments passed to xarray.DataArray.to_netdcf()

Returns:

file path of stored object.

Return type:

str

property ts_type

String specifying temporal resolution of data

property units

Unit of data

property unitstr

String representation of obs and model units in this object

unstack(inplace=False, **kwargs)[source]

Unstack one or more dimensions

For details see xarray.DataArray.unstack().

Parameters:
  • inplace (bool) – modify this object or a copy.

  • **kwargs – input arguments passed to DataArray.unstack()

Returns:

unstacked data object

Return type:

ColocatedData

validate_data()[source]
property var_name

Coordinate array containing data sources (z-axis)

pyaerocom.colocation.colocated_data.ensure_correct_dimensions(data: DataArray)[source]

Ensure the dimensions on an xarray.DataArray passed to ColocatedData. If a ColocatedData object is created outside of pyaerocom, this checking is needed. This function is used as part of the model validator.

Station data

class pyaerocom.stationdata.StationData(**meta_info)[source]

Dict-like base class for single station data

ToDo: write more detailed introduction

Note

Variable data (e.g. numpy array or pandas Series) can be directly assigned to the object. When assigning variable data it is recommended to add variable metadata (e.g. unit, ts_type) in var_info, where key is variable name and value is dict with metadata entries.

dtime

list / array containing time index values

Type:

list

var_info

dictionary containing information about each variable

Type:

dict

data_err

dictionary that may be used to store uncertainty timeseries or data arrays associated with the different variable data.

Type:

dict

overlap

dictionary that may be filled to store overlapping timeseries data associated with one variable. This is, for instance, used in merge_vardata() to store overlapping data from another station.

Type:

dict

PROTECTED_KEYS = ['dtime', 'var_info', 'station_coords', 'data_err', 'overlap', 'numobs', 'data_flagged']

Keys that are ignored when accessing metadata

STANDARD_COORD_KEYS = ['latitude', 'longitude', 'altitude']

List of keys that specify standard metadata attribute names. This is used e.g. in get_meta()

STANDARD_META_KEYS = ['filename', 'station_id', 'station_name', 'instrument_name', 'PI', 'country', 'country_code', 'ts_type', 'latitude', 'longitude', 'altitude', 'data_id', 'dataset_name', 'data_product', 'data_version', 'data_level', 'framework', 'instr_vert_loc', 'revision_date', 'website', 'ts_type_src', 'stat_merge_pref_attr']
VALID_TS_TYPES = ['minutely', 'hourly', 'daily', 'weekly', 'monthly', 'yearly', 'native', 'coarsest']
calc_climatology(var_name, start=None, stop=None, min_num_obs=None, clim_mincount=None, clim_freq=None, set_year=None, resample_how=None)[source]

Calculate climatological timeseries for input variable

Parameters:
  • var_name (str) – name of data variable

  • start – start time of data used to compute climatology

  • stop – start time of data used to compute climatology

  • min_num_obs (dict or int, optional) – minimum number of observations required per period (when downsampling). For details see pyaerocom.time_resampler.TimeResampler.resample())

  • clim_micount (int, optional) – minimum number of of monthly values required per month of climatology

  • set_year (int, optional) – if specified, the output data will be assigned the input year. Else the middle year of the climatological interval is used.

  • resample_how (str) – how should the resampled data be averaged (e.g. mean, median)

  • **kwargs – Additional keyword args passed to pyaerocom.time_resampler.TimeResampler.resample()

Returns:

new instance of StationData containing climatological data

Return type:

StationData

check_dtime()[source]

Checks if dtime attribute is array or list

check_if_3d(var_name)[source]

Checks if altitude data is available in this object

check_unit(var_name, unit=None)[source]

Check if variable unit corresponds to a certain unit

Parameters:
  • var_name (str) – variable name for which unit is to be checked

  • unit (str, optional) – unit to be checked, if None, AeroCom default unit is used

Raises:
  • MetaDataError – if unit information is not accessible for input variable name

  • UnitConversionError – if current unit cannot be converted into specified unit (e.g. 1 vs m-1)

  • DataUnitError – if current unit is not equal to input unit but can be converted (e.g. 1/Mm vs 1/m)

check_var_unit_aerocom(var_name)[source]

Check if unit of input variable is AeroCom default, if not, convert

Parameters:

var_name (str) – name of variable

Raises:
  • MetaDataError – if unit information is not accessible for input variable name

  • UnitConversionError – if current unit cannot be converted into specified unit (e.g. 1 vs m-1)

  • DataUnitError – if current unit is not equal to AeroCom default and cannot be converted.

convert_unit(var_name, to_unit)[source]

Try to convert unit of data

Requires that unit of input variable is available in var_info

Parameters:
  • var_name (str) – name of variable

  • to_unit (str) – new unit

Raises:
copy()[source]
property default_vert_grid

AeroCom default grid for vertical regridding

For details, see DEFAULT_VERT_GRID_DEF in Config

Returns:

numpy array specifying default coordinates

Return type:

ndarray

dist_other(other)[source]

Distance to other station in km

Parameters:

other (StationData) – other data object

Returns:

distance between this and other station in km

Return type:

float

get_meta(force_single_value=True, quality_check=True, add_none_vals=False, add_meta_keys=None)[source]

Return meta-data as dictionary

By default, only default metadata keys are considered, use parameter add_meta_keys to add additional metadata.

Parameters:
  • force_single_value (bool) – if True, then each meta value that is list or array,is converted to a single value.

  • quality_check (bool) – if True, and coordinate values are lists or arrays, then the standarad deviation in the values is compared to the upper limits allowed in the local variation. The upper limits are specified in attr. COORD_MAX_VAR.

  • add_none_vals (bool) – Add metadata keys which have value set to None.

  • add_meta_keys (str or list, optional) – Add none-standard metadata.

Returns:

dictionary containing the retrieved meta-data

Return type:

dict

Raises:
  • AttributeError – if one of the meta entries is invalid

  • MetaDataError – in case of consistencies in meta data between individual time-stamps

get_station_coords(force_single_value=True)[source]

Return coordinates as dictionary

This method uses the standard coordinate names defined in STANDARD_COORD_KEYS (latitude, longitude and altitude) to get the station coordinates. For each of these parameters tt first looks in station_coords if the parameter is defined (i.e. it is not None) and if not it checks if this object has an attribute that has this name and uses that one.

Parameters:

force_single_value (bool) – if True and coordinate values are lists or arrays, then they are collapsed to single value using mean

Returns:

dictionary containing the retrieved coordinates

Return type:

dict

Raises:
  • AttributeError – if one of the coordinate values is invalid

  • CoordinateError – if local variation in either of the three spatial coordinates is found too large

get_unit(var_name)[source]

Get unit of variable data

Parameters:

var_name (str) – name of variable

Returns:

unit of variable

Return type:

str

Raises:

MetaDataError – if unit cannot be accessed for variable

get_var_ts_type(var_name, try_infer=True)[source]

Get ts_type for a certain variable

Note

Converts to ts_type string if assigned ts_type is in pandas format

Parameters:
  • var_name (str) – data variable name for which the ts_type is supposed to be retrieved

  • try_infer (bool) – if ts_type is not available, try inferring it from data

Returns:

the corresponding data time resolution

Return type:

str

Raises:

MetaDataError – if no metadata is available for this variable (e.g. if var_name cannot be found in var_info)

has_var(var_name)[source]

Checks if input variable is available in data object

Parameters:

var_name (str) – name of variable

Returns:

True, if variable data is available, else False

Return type:

bool

insert_nans_timeseries(var_name)[source]

Fill up missing values with NaNs in an existing time series

Note

This method does a resample of the data onto a regular grid. Thus, if the input ts_type is different from the actual current ts_type of the data, this method will not only insert NaNs but at the same.

Parameters:
  • var_name (str) – variable name

  • inplace (bool) – if True, the actual data in this object will be overwritten with the new data that contains NaNs

Returns:

the modified station data object

Return type:

StationData

merge_meta_same_station(other, coord_tol_km=None, check_coords=True, inplace=True, add_meta_keys=None, raise_on_error=False)[source]

Merge meta information from other object

Note

Coordinate attributes (latitude, longitude and altitude) are not copied as they are required to be the same in both stations. The latter can be checked and ensured using input argument check_coords

Parameters:
  • other (StationData) – other data object

  • coord_tol_km (float) – maximum distance in km between coordinates of input StationData object and self. Only relevant if check_coords is True. If None, then _COORD_MAX_VAR is used which is defined in the class header.

  • check_coords (bool) – if True, the coordinates are compared and checked if they are lying within a certain distance to each other (cf. coord_tol_km).

  • inplace (bool) – if True, the metadata from the other station is added to the metadata of this station, else, a new station is returned with the merged attributes.

  • add_meta_keys (str or list, optional) – additional non-standard metadata keys that are supposed to be considered for merging.

  • raise_on_error (bool) – if True, then an Exception will be raised in case one of the metadata items cannot be merged, which is most often due to unresolvable type differences of metadata values between the two objects

merge_other(other, var_name, add_meta_keys=None, **kwargs)[source]

Merge other station data object

Parameters:
  • other (StationData) – other data object

  • var_name (str) – variable name for which info is to be merged (needs to be both available in this object and the provided other object)

  • add_meta_keys (str or list, optional) – additional non-standard metadata keys that are supposed to be considered for merging.

  • kwargs – keyword args passed on to merge_vardata() (e.g time resampling settings)

Returns:

this object that has merged the other station

Return type:

StationData

merge_vardata(other, var_name, **kwargs)[source]

Merge variable data from other object into this object

Note

This merges also the information about this variable in the dict var_info. It is required, that variable meta-info is specified in both StationData objects.

Note

This method removes NaN’s from the existing time series in the data objects. In order to fill up the time-series with NaNs again after merging, call insert_nans_timeseries()

Parameters:
  • other (StationData) – other data object

  • var_name (str) – variable name for which info is to be merged (needs to be both available in this object and the provided other object)

  • kwargs – keyword args passed on to _merge_vardata_2d()

Returns:

this object merged with other object

Return type:

StationData

merge_varinfo(other, var_name)[source]

Merge variable specific meta information from other object

Parameters:
  • other (StationData) – other data object

  • var_name (str) – variable name for which info is to be merged (needs to be both available in this object and the provided other object)

plot_timeseries(var_name, add_overlaps=False, legend=True, tit=None, **kwargs)[source]

Plot timeseries for variable

Note

If you set input arg add_overlaps = True the overlapping timeseries data - if it exists - will be plotted on top of the actual timeseries using red colour and dashed line. As the overlapping data may be identical with the actual data, you might want to increase the line width of the actual timeseries using an additional input argument lw=4, or similar.

Parameters:
  • var_name (str) – name of variable (e.g. “od550aer”)

  • add_overlaps (bool) – if True and if overlapping data exists for this variable, it will be added to the plot.

  • tit (str, optional) – title of plot, if None, default title is used

  • **kwargs – additional keyword args passed to matplotlib plot method

Returns:

matplotlib.axes instance of plot

Return type:

axes

Raises:
  • KeyError – if variable key does not exist in this dictionary

  • ValueError – if length of data array does not equal the length of the time array

remove_outliers(var_name, low=None, high=None, check_unit=True)[source]

Remove outliers from one of the variable timeseries

Parameters:
  • var_name (str) – variable name

  • low (float) – lower end of valid range for input variable. If None, then the corresponding value from the default settings for this variable are used (cf. minimum attribute of available variables)

  • high (float) – upper end of valid range for input variable. If None, then the corresponding value from the default settings for this variable are used (cf. maximum attribute of available variables)

  • check_unit (bool) – if True, the unit of the data is checked against AeroCom default

remove_variable(var_name)[source]

Remove variable data

Parameters:

var_name (str) – name of variable that is to be removed

Returns:

current instance of this object, with data removed

Return type:

StationData

Raises:

VarNotAvailableError – if the input variable is not available in this object

resample_time(var_name, ts_type, how=None, min_num_obs=None, inplace=False, **kwargs)[source]

Resample one of the time-series in this object

Parameters:
  • var_name (str) – name of data variable

  • ts_type (str) – new frequency string (can be pyaerocom ts_type or valid pandas frequency string)

  • how (str) – how should the resampled data be averaged (e.g. mean, median)

  • min_num_obs (dict or int, optional) – minimum number of observations required per period (when downsampling). For details see pyaerocom.time_resampler.TimeResampler.resample())

  • inplace (bool) – if True, then the current data object stored in self, will be overwritten with the resampled time-series

  • **kwargs – Additional keyword args passed to pyaerocom.time_resampler.TimeResampler.resample()

Returns:

with resampled variable timeseries

Return type:

StationData

resample_timeseries(var_name, **kwargs)[source]

Wrapper for resample_time() (for backwards compatibility)

Note

For backwards compatibility, this method will return a pandas Series instead of the actual StationData object

same_coords(other, tol_km=None)[source]

Compare station coordinates of other station with this station

Parameters:
  • other (StationData) – other data object

  • tol_km (float) – distance tolerance in km

Returns:

if True, then the two object are located within the specified tolerance range

Return type:

bool

select_altitude(var_name, altitudes)[source]

Extract variable data within certain altitude range

Note

Beta version

Parameters:
  • var_name (str) – name of variable for which metadata is supposed to be extracted

  • altitudes (list) – altitude range in m, e.g. [0, 1000]

Returns:

data object within input altitude range

Return type:

pandas. Series or xarray.DataArray

to_timeseries(var_name, **kwargs)[source]

Get pandas.Series object for one of the data columns

Parameters:

var_name (str) – name of variable (e.g. “od550aer”)

Returns:

time series object

Return type:

Series

Raises:
  • KeyError – if variable key does not exist in this dictionary

  • ValueError – if length of data array does not equal the length of the time array

property units

Dictionary containing units of all variables in this object

property vars_available

Number of variables available in this data object

Other data classes

class pyaerocom.vertical_profile.VerticalProfile(data: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], altitude: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], dtime, var_name: str, data_err: _Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None, var_unit: str, altitude_unit: str)[source]

Object representing single variable profile data

property altitude

Array containing altitude values corresponding to data

property data

Array containing data values corresponding to data

property data_err

Array containing data values corresponding to data

plot(plot_errs=True, whole_alt_range=False, rot_xlabels=30, errs_shaded=True, errs_alpha=0.1, add_vertbar_zero=True, figsize=None, ax=None, **kwargs)[source]

Simple plot method for vertical profile

Co-location routines

High-level co-location engine

Classes and methods to perform high-level colocation.

class pyaerocom.colocation.colocator.Colocator(colocation_setup: ColocationSetup | dict, **kwargs)[source]

High level class for running co-location

Note

This object requires and instance from ColocationSetup.

get_model_name()[source]

Get name of model

Note

Not to be confused with model_id which is always the database ID of the model, while model_name can differ from that and is used for output files, etc.

Raises:

AttributeError – If neither model_id or model_name are set

Returns:

preferably model_name, else model_id

Return type:

str

get_nc_files_in_coldatadir()[source]

Get list of NetCDF files in colocated data directory

Returns:

list of NetCDF file paths found

Return type:

list

get_obs_name()[source]

Get name of obsdata source

Note

Not to be confused with obs_id which is always the database ID of the observation dataset, while obs_name can differ from that and is used for output files, etc.

Raises:

AttributeError – If neither obs_id or obs_name are set

Returns:

preferably obs_name, else obs_id

Return type:

str

property model_reader

Model data reader

property model_vars

List of all model variables specified in config

Note

This method does not check if the variables are valid or available.

Returns:

list of all model variables specified in this setup.

Return type:

list

property obs_is_ungridded

True if obs_id refers to an ungridded observation, else False

Type:

bool

property obs_is_vertical_profile

True if obs_id refers to a VerticalProfile, else False

Type:

bool

property obs_reader

Observation data reader

property output_dir

Output directory for colocated data NetCDF files

Type:

str

prepare_run(var_list: list | None = None) dict[source]

Prepare colocation run for current setup.

Parameters:

var_name (str, optional) – Variable name that is supposed to be analysed. The default is None, in which case all defined variables are attempted to be colocated.

Raises:

AttributeError – If no observation variables are defined (obs_vars empty).

Returns:

vars_to_process – Mapping of variables to be processed, keys are model vars, values are obs vars.

Return type:

dict

run(var_list: list | None = None)[source]

Perform colocation for current setup

See also prepare_run().

Parameters:

var_list (list, optional) – list of variables supposed to be analysed. The default is None, in which case all defined variables are attempted to be colocated.

Returns:

nested dictionary, where keys are model variables, values are dictionaries comprising key / value pairs of obs variables and associated instances of ColocatedData.

Return type:

dict

class pyaerocom.colocation.colocation_setup.ColocationSetup(model_id: str | None = None, pyaro_config: PyaroConfig | None = None, obs_id: str | None = None, obs_vars: tuple[str, ...] | None = (), ts_type: str = 'monthly', start: Timestamp | int | None = None, stop: Timestamp | int | None = None, basedir_coldata: str = '/home/docs/MyPyaerocom/colocated_data', save_coldata: bool = False, *, OBS_VERT_TYPES_ALT: dict[str, str] = {'2D': '2D', 'Surface': 'ModelLevel'}, CRASH_ON_INVALID: bool = False, FORBIDDEN_KEYS: list[str] = ['var_outlier_ranges', 'var_ref_outlier_ranges', 'remove_outliers'], filter_name: str = 'ALL-wMOUNTAINS', obs_name: str | None = None, obs_data_dir: Path | str | None = None, obs_use_climatology: bool = False, obs_cache_only: bool = False, obs_vert_type: str | None = None, obs_ts_type_read: str | dict | None = None, obs_filters: dict = {}, colocation_layer_limits: tuple[LayerLimits, ...] | None = None, profile_layer_limits: tuple[LayerLimits, ...] | None = None, read_opts_ungridded: dict | None = {}, model_name: str | None = None, model_data_dir: Path | str | None = None, model_read_opts: dict | None = {}, model_use_vars: dict[str, str] | None = {}, model_rename_vars: dict[str, str] | None = {}, model_add_vars: dict[str, tuple[str, ...]] | None = {}, model_to_stp: bool = False, model_ts_type_read: str | dict | None = None, model_read_aux: dict[str, dict[Literal['vars_required', 'fun'], list[str] | Callable]] | None = {}, model_use_climatology: bool = False, gridded_reader_id: dict[str, str] = {'model': 'ReadGridded', 'obs': 'ReadGridded'}, flex_ts_type: bool = True, min_num_obs: dict | int | None = None, resample_how: str | dict | None = 'mean', obs_remove_outliers: bool = False, model_remove_outliers: bool = False, obs_outlier_ranges: dict[str, tuple[float, float]] | None = {}, model_outlier_ranges: dict[str, tuple[float, float]] | None = {}, zeros_to_nan: bool = False, harmonise_units: bool = False, regrid_res_deg: float | RegridResDeg | None = None, colocate_time: bool = False, reanalyse_existing: bool = True, raise_exceptions: bool = False, keep_data: bool = True, add_meta: dict | None = {}, model_kwargs: dict = {}, main_freq: str = 'monthly', freqs: list[str] = ['monthly', 'yearly'])[source]

Setup class for high-level model / obs co-location.

An instance of this setup class can be used to run a colocation analysis between a model and an observation network and will create a number of pya.ColocatedData instances, which can be saved automatically as NetCDF files.

Apart from co-location, this class also handles reading of the input data for co-location. Supported co-location options are:

1. gridded vs. ungridded data For instance 3D model data (instance of GriddedData) with lat, lon and time dimension that is co-located with station based observations which are represented in pyaerocom through UngriddedData objects. The co-location function used is pyaerocom.colocation.colocated_gridded_ungridded(). For this type of co-location, the output co-located data object will be 3-dimensional, with dimensions data_source (index 0: obs, index 1: model), time and station_name.

2. gridded vs. gridded data For instance 3D model data that is co-located with 3D satellite data (both instances of GriddedData), both objects with lat, lon and time dimensions. The co-location function used is pyaerocom.colocation.colocated_gridded_gridded(). For this type of co-location, the output co-located data object will be 4-dimensional, with dimensions data_source (index 0: obs, index 1: model), time and latitude and longitude.

model_id

ID of model to be used.

Type:

str

pyaro_config

In the case Pyaro is used, a config must be provided. In that case obs_id(see below) is ignored and only the config is used.

Type:

PyaroConfig

obs_id

ID of observation network to be used.

Type:

str

obs_vars

Variables to be analysed (need to be available in input obs dataset). Variables that are not available in the model data output will be skipped. Alternatively, model variables to be used for a given obs variable can also be specified via attributes model_use_vars and model_add_vars.

Type:

tuple[str, …]

ts_type

String specifying colocation output frequency.

Type:

str

start

Start time of colocation. Input can be integer denoting the year or anything that can be converted into pandas.Timestamp using pyaerocom.helpers.to_pandas_timestamp(). If None, than the first available date in the model data is used.

Type:

pandas._libs.tslibs.timestamps.Timestamp | int | str | None

stop

stop time of colocation. int or anything that can be converted into pandas.Timestamp using pyaerocom.helpers.to_pandas_timestamp() or None. If None and if start is on resolution of year (e.g. start=2010) then stop will be automatically set to the end of that year. Else, it will be set to the last available timestamp in the model data.

Type:

pandas._libs.tslibs.timestamps.Timestamp | int | str | None

filter_name

name of filter to be applied. If None, no filter is used (to be precise, if None, then

pyaerocom.const.DEFAULT_REG_FILTER is used which should default to ALL-wMOUNTAINS, that is, no filtering).

Type:

str

basedir_coldata

Base directory for storing of colocated data files.

Type:

str | Path

save_coldata

if True, colocated data objects are saved as NetCDF file.

Type:

bool

obs_name

if provided, this string will be used in colocated data filename to specify obsnetwork, else obs_id will be used.

Type:

str, optional

obs_data_dir

location of obs data. If None, attempt to infer obs location based on obs ID.

Type:

str, optional

obs_use_climatology

BETA if True, pyaerocom default climatology is computed from observation stations (so far only possible for unrgidded / gridded colocation).

Type:

bool

obs_vert_type

AeroCom vertical code encoded in the model filenames (only AeroCom 3 and later). Specifies which model file should be read in case there are multiple options (e.g. surface level data can be read from a Surface.nc file as well as from a ModelLevel.nc file). If input is string (e.g. ‘Surface’), then the corresponding vertical type code is used for reading of all variables that are colocated (i.e. that are specified in obs_vars).

Type:

str

obs_ts_type_read

may be specified to explicitly define the reading frequency of the observation data (so far, this does only apply to gridded obsdata such as satellites), either as str (same for all obs variables) or variable specific as dict. For ungridded reading, the frequency may be specified via obs_id, where applicable (e.g. AeronetSunV3Lev2.daily). Not to be confused with ts_type, which specifies the frequency used for colocation. Can be specified variable specific in form of dictionary.

Type:

str or dict, optional

obs_filters

filters applied to the observational dataset before co-location. In case of gridded / gridded, these are filters that can be passed to pyaerocom.io.ReadGridded.read_var(), for instance, flex_ts_type, or constraints. In case the obsdata is ungridded (gridded / ungridded co-locations) these are filters that are handled through keyword filter_post in pyaerocom.io.ReadUngridded.read(). These filters are applied to the UngriddedData objects after reading and caching the data, so changing them, will not invalidate the latest cache of the UngriddedData.

Type:

dict

read_opts_ungridded

dictionary that specifies reading constraints for ungridded reading, and are passed as **kwargs to pyaerocom.io.ReadUngridded.read(). Note that - other than for obs_filters these filters are applied during the reading of the UngriddedData objects and specifying them will deactivate caching.

Type:

dict, optional

model_name

if provided, this string will be used in colocated data filename to specify model, else obs_id will be used.

Type:

str, optional

model_data_dir

Location of model data. If None, attempt to infer model location based on model ID.

Type:

str, optional

model_read_opts

options for model reading (passed as keyword args to pyaerocom.io.ReadUngridded.read()).

Type:

dict, optional

model_use_vars

dictionary that specifies mapping of model variables. Keys are observation variables, values are the corresponding model variables (e.g. model_use_vars=dict(od550aer=’od550csaer’)). Example: your observation has var od550aer but your model model uses a different variable name for that variable, say od550. Then, you can specify this via model_use_vars = {‘od550aer’ : ‘od550’}. NOTE: in this case, a model variable od550aer will be ignored, even if it exists (cf model_add_vars).

Type:

dict, optional

model_rename_vars

rename certain model variables after co-location, before storing the associated ColocatedData object on disk. Keys are model variables, values are new names (e.g. model_rename_vars={‘od550aer’:’MyAOD’}). Note: this does not impact which variables are read from the model.

Type:

dict, optional

model_add_vars

additional model variables to be processed for one obs variable. E.g. model_add_vars={‘od550aer’: [‘od550so4’, ‘od550gt1aer’]} would co-locate both model SO4 AOD (od550so4) and model coarse mode AOD (od550gt1aer) with total AOD (od550aer) from obs (in addition to od550aer vs od550aer if applicable).

Type:

dict, optional

model_to_stp

ALPHA (please do not use): convert model data values to STP conditions after co-location. Note: this only works for very particular settings at the moment and needs revision, as it relies on access to meteorological data.

Type:

bool

model_ts_type_read

may be specified to explicitly define the reading frequency of the model data, either as str (same for all obs variables) or variable specific as dict. Not to be confused with ts_type, which specifies the output frequency of the co-located data.

Type:

str or dict, optional

model_read_aux

may be used to specify additional computation methods of variables from models. Keys are variables to be computed, values are dictionaries with keys vars_required (list of required variables for computation of var and fun (method that takes list of read data objects and computes and returns var).

Type:

dict, optional

model_use_climatology

if True, attempt to use climatological model data field. Note: this only works if model data is in AeroCom conventions (climatological fields are indicated with 9999 as year in the filename) and if this is active, only single year analysis are supported (i.e. provide int to start to specify the year and leave stop empty).

Type:

bool

model_kwargs

Key word arguments to be given to the model reader class’s read_var and init function

Type:

dict

gridded_reader_id

BETA: dictionary specifying which gridded reader is supposed to be used for model (and gridded obs) reading. Note: this is a workaround solution and will likely be removed in the future when the gridded reading API is more harmonised (see https://github.com/metno/pyaerocom/issues/174).

Type:

dict

flex_ts_type

Bboolean specifying whether reading frequency of gridded data is allowed to be flexible. This includes all gridded data, whether it is model or gridded observation (e.g. satellites). Defaults to True.

Type:

bool

min_num_obs

time resampling constraints applied, defaults to None, in which case no constraints are applied. For instance, say your input is in daily resolution and you want output in monthly and you want to make sure to have roughly 50% daily coverage for the monthly averages. Then you may specify min_num_obs=15 which will ensure that at least 15 daily averages are available to compute a monthly average. However, you may also define a hierarchical scheme that first goes from daily to weekly and then from weekly to monthly, via a dict. E.g. min_num_obs=dict(monthly=dict(weekly=4), weekly=dict(daily=3)) would ensure that each week has at least 3 daily values, as well as that each month has at least 4 weekly values.

Type:

dict or int, optional

resample_how

string specifying how data should be aggregated when resampling in time. Default is “mean”. Can also be a nested dictionary, e.g. resample_how={‘conco3’: ‘daily’: {‘hourly’ : ‘max’}} would use the maximum value to aggregate from hourly to daily for variable conco3, rather than the mean.

Type:

str or dict, optional

obs_remove_outliers

if True, outliers are removed from obs data before colocation, else not. Default is False. Custom outlier ranges for each variable can be specified via obs_outlier_ranges, and for all other variables, the pyaerocom default outlier ranges are used. The latter are specified in variables.ini file via minimum and maximum attributes and can also be accessed through pyaerocom.variable.Variable.minimum and pyaerocom.variable.Variable.maximum, respectively.

Type:

bool

model_remove_outliers

if True, outliers are removed from model data (normally this should be set to False, as the models are supposed to be assessed, including outlier cases). Default is False. Custom outlier ranges for each variable can be specified via model_outlier_ranges, and for all other variables, the pyaerocom default outlier ranges are used. The latter are specified in variables.ini file via minimum and maximum attributes and can also be accessed through pyaerocom.variable.Variable.minimum and pyaerocom.variable.Variable.maximum, respectively.

Type:

bool

obs_outlier_ranges

dictionary specifying outlier ranges for individual obs variables. (e.g. dict(od550aer = [-0.05, 10], ang4487aer=[0,4])). Only relevant if obs_remove_outliers is True.

Type:

dict, optional

model_outlier_ranges

like obs_outlier_ranges but for model variables. Only relevant if model_remove_outliers is True.

Type:

dict, optional

zeros_to_nan

If True, zero’s in output co-located data object will be converted to NaN. Default is False.

Type:

bool

harmonise_units

if True, units are attempted to be harmonised during co-location (note: raises Exception if True and in case units cannot be harmonised).

Type:

bool

regrid_res_deg

regrid resolution in degrees. If specified, the input gridded data objects will be regridded in lon / lat dimension to the input resolution (if input is float, both lat and lon are regridded to that resolution, if input is dict, use keys lat_res_deg and lon_res_deg to specify regrid resolutions, respectively). Default is None.

Type:

float or dict, optional

colocate_time

if True and if obs and model sampling frequency (e.g. daily) are higher than output colocation frequency (e.g. monthly), then the datasets are first colocated in time (e.g. on a daily basis), before the monthly averages are calculated. Default is False.

Type:

bool

reanalyse_existing

if True, always redo co-location, even if there is already an existing co-located NetCDF file (under the output location specified by basedir_coldata ) for the given variable combination to be co-located. If False and output already exists, then co-location is skipped for the associated variable. This flag is also used for contour-plots. Default is True.

Type:

bool

raise_exceptions

if True, Exceptions that may occur for individual variables to be processed, are raised, else the analysis is skipped for such cases.

Type:

bool

keep_data

if True, then all colocated data objects computed when running run() will be stored in data. Defaults to True.

Type:

bool

add_meta

additional metadata that is supposed to be added to each output ColocatedData object.

Type:

dict

main_freq

Main output frequency for AeroVal (some of the AeroVal processing steps are only done for this resolution, since they would create too much output otherwise, such as statistics timeseries or scatter plot in “Overall Evaluation” tab on AeroVal). Note that this frequency needs to be included in next setting “freqs”.

Type:

str

freqs

Frequencies for which statistical parameters are computed

Type:

list[str]

CRASH_ON_INVALID: bool

do not raise Exception if invalid item is attempted to be assigned (Overwritten from base class)

OBS_VERT_TYPES_ALT: dict[str, str]

Dictionary specifying alternative vertical types that may be used to read model data. E.g. consider the variable is ec550aer, obs_vert_type=’Surface’ and obs_vert_type_alt=dict(Surface=’ModelLevel’). Now, if a model that is used for the analysis does not contain a data file for ec550aer at the surface (’ec550aer*Surface.nc’), then, the colocation routine will look for ‘ec550aer*ModelLevel.nc’ and if this exists, it will load it and extract the surface level.

add_glob_meta(**kwargs)[source]

Add global metadata to add_meta

Parameters:

kwargs – metadata to be added

Return type:

None

model_config: ClassVar[ConfigDict] = {'allow': 'extra', 'arbitrary_types_allowed': True, 'protected_namespaces': ()}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

Low-level co-location functions

Methods and / or classes to perform colocation

pyaerocom.colocation.colocation_utils.check_time_ival(data, start, stop)[source]
pyaerocom.colocation.colocation_utils.check_ts_type(data, ts_type)[source]
pyaerocom.colocation.colocation_utils.colocate_gridded_gridded(data, data_ref, ts_type=None, start=None, stop=None, filter_name=None, regrid_res_deg: float | RegridResDeg | None = None, harmonise_units=True, regrid_scheme: str = 'areaweighted', update_baseyear_gridded=None, min_num_obs=None, colocate_time=False, resample_how=None, **kwargs)[source]

Colocate 2 gridded data objects

Parameters:
  • data (GriddedData) – gridded data (e.g. model results)

  • data_ref (GriddedData) – reference data (e.g. gridded satellite object) that is co-located with data. observation data or other model)

  • ts_type (str, optional) – desired temporal resolution of output colocated data (e.g. “monthly”). Defaults to None, in which case the highest possible resolution is used.

  • start (str or datetime64 or similar, optional) – start time for colocation, if None, the start time of the input GriddedData object is used

  • stop (str or datetime64 or similar, optional) – stop time for colocation, if None, the stop time of the input GriddedData object is used

  • filter_name (str, optional) – string specifying filter used (cf. pyaerocom.filter.Filter for details). If None, then it is set to ‘ALL-wMOUNTAINS’, which corresponds to no filtering (world with mountains). Use ALL-noMOUNTAINS to exclude mountain sites.

  • regrid_res_deg (int or dict, optional) – regrid resolution in degrees. If specified, the input gridded data objects will be regridded in lon / lat dimension to the input resolution (if input is integer, both lat and lon are regridded to that resolution, if input is dict, use keys lat_res_deg and lon_res_deg to specify regrid resolutions, respectively).

  • harmonise_units (bool) – if True, units are attempted to be harmonised (note: raises Exception if True and units cannot be harmonised). Defaults to True.

  • regrid_scheme (str) – iris scheme used for regridding (defaults to area weighted regridding)

  • update_baseyear_gridded (int, optional) – optional input that can be set in order to redefine the time dimension in the first gridded data object `data`to be analysed. E.g., if the data object is a climatology (one year of data) that has set the base year of the time dimension to a value other than the specified input start / stop time this may be used to update the time in order to make co-location possible.

  • min_num_obs (int or dict, optional) – minimum number of observations for resampling of time

  • colocate_time (bool) – if True and if original time resolution of data is higher than desired time resolution (ts_type), then both datasets are colocated in time before resampling to lower resolution.

  • resample_how (str or dict) – string specifying how data should be aggregated when resampling in time. Default is “mean”. Can also be a nested dictionary, e.g. resample_how={‘daily’: {‘hourly’ : ‘max’}} would use the maximum value to aggregate from hourly to daily, rather than the mean.

  • **kwargs – additional keyword args (not used here, but included such that factory class can handle different methods with different inputs)

Returns:

instance of colocated data

Return type:

ColocatedData

pyaerocom.colocation.colocation_utils.colocate_gridded_ungridded(data, data_ref, ts_type=None, start=None, stop=None, filter_name=None, regrid_res_deg: float | RegridResDeg | None = None, harmonise_units=True, regrid_scheme: str = 'areaweighted', var_ref=None, update_baseyear_gridded=None, min_num_obs=None, colocate_time=False, use_climatology_ref=False, resample_how=None, **kwargs)[source]

Colocate gridded with ungridded data (low level method)

For high-level colocation see pyaerocom.colocation.Colocator and pyaerocom.ColocationSetup

Note

Uses the variable that is contained in input GriddedData object (since these objects only contain a single variable). If this variable is not contained in observation data (or contained but using a different variable name) you may specify the obs variable to be used via input arg var_ref

Parameters:
  • data (GriddedData) – gridded data object (e.g. model results).

  • data_ref (UngriddedData) – ungridded data object (e.g. observations).

  • ts_type (str) – desired temporal resolution of colocated data (must be valid AeroCom ts_type str such as daily, monthly, yearly.).

  • start (str or datetime64 or similar, optional) – start time for colocation, if None, the start time of the input GriddedData object is used.

  • stop (str or datetime64 or similar, optional) – stop time for colocation, if None, the stop time of the input GriddedData object is used

  • filter_name (str) – string specifying filter used (cf. pyaerocom.filter.Filter for details). If None, then it is set to ‘ALL-wMOUNTAINS’, which corresponds to no filtering (world with mountains). Use ALL-noMOUNTAINS to exclude mountain sites.

  • regrid_res_deg (int or dict, optional) – regrid resolution in degrees. If specified, the input gridded data object will be regridded in lon / lat dimension to the input resolution (if input is integer, both lat and lon are regridded to that resolution, if input is dict, use keys lat_res_deg and lon_res_deg to specify regrid resolutions, respectively).

  • harmonise_units (bool) – if True, units are attempted to be harmonised (note: raises Exception if True and units cannot be harmonised).

  • var_ref (str, optional) – variable against which data in arg data is supposed to be compared. If None, then the same variable is used (i.e. data.var_name).

  • update_baseyear_gridded (int, optional) – optional input that can be set in order to re-define the time dimension in the gridded data object to be analysed. E.g., if the data object is a climatology (one year of data) that has set the base year of the time dimension to a value other than the specified input start / stop time this may be used to update the time in order to make colocation possible.

  • min_num_obs (int or dict, optional) – minimum number of observations for resampling of time

  • colocate_time (bool) – if True and if original time resolution of data is higher than desired time resolution (ts_type), then both datasets are colocated in time before resampling to lower resolution.

  • use_climatology_ref (bool) – if True, climatological timeseries are used from observations

  • resample_how (str or dict) – string specifying how data should be aggregated when resampling in time. Default is “mean”. Can also be a nested dictionary, e.g. resample_how={‘daily’: {‘hourly’ : ‘max’}} would use the maximum value to aggregate from hourly to daily, rather than the mean.

  • **kwargs – additional keyword args (passed to UngriddedData.to_station_data_all())

Returns:

instance of colocated data

Return type:

ColocatedData

Raises:
  • VarNotAvailableError – if grid data variable is not available in ungridded data object

  • AttributeError – if instance of input UngriddedData object contains more than one dataset

  • TimeMatchError – if gridded data time range does not overlap with input time range

  • ColocationError – if none of the data points in input UngriddedData matches the input colocation constraints

pyaerocom.colocation.colocation_utils.correct_model_stp_coldata(coldata, p0=None, t0=273.15, inplace=False)[source]

Correct modeldata in colocated data object to STP conditions

Note

BETA version, quite unelegant coded (at 8pm 3 weeks before IPCC deadline), but should do the job for 2010 monthly colocated data files (AND NOTHING ELSE)!

pyaerocom.colocation.colocation_utils.resolve_var_name(data)[source]

Check variable name of GriddedData against AeroCom default

Checks whether the variable name set in the data corresponds to the AeroCom variable name, or whether it is an alias. Returns both the variable name set and the AeroCom variable name.

Parameters:

data (GriddedData) – Data to be checked.

Returns:

  • str – variable name as set in data (may be alias, but may also be AeroCom variable name, in which case first and second return parameter are the same).

  • str – corresponding AeroCom variable name

Methods and / or classes to perform 3D colocation

class pyaerocom.colocation.colocation_3d.ColocatedDataLists(colocateddata_for_statistics, colocateddata_for_profile_viz)[source]
colocateddata_for_profile_viz: list[ColocatedData]

Alias for field number 1

colocateddata_for_statistics: list[ColocatedData]

Alias for field number 0

pyaerocom.colocation.colocation_3d.colocate_vertical_profile_gridded(data, data_ref, ts_type: str | None = None, start: str | None = None, stop: str | None = None, filter_name: str | None = None, regrid_res_deg: float | RegridResDeg | None = None, harmonise_units: bool = True, regrid_scheme: str = 'areaweighted', var_ref: str | None = None, update_baseyear_gridded: int | None = None, min_num_obs: int | dict | None = None, colocate_time: bool = False, use_climatology_ref: bool = False, resample_how: str | dict | None = None, colocation_layer_limits: tuple[LayerLimits, ...] | None = None, profile_layer_limits: tuple[LayerLimits, ...] | None = None, **kwargs) ColocatedDataLists[source]

Colocated vertical profile data with gridded (model) data

The guts of this function are placed in a helper function as not to repeat the code. This is done because colocation must occur twice:

  1. at the the statistics are computed

  2. at a finder vertical resoltuion for profile vizualization

Some things you do not want to compute twice, however. So (most of) the things that correspond to both colocation instances are computed here, and then passed to the helper function.

Returns

colocated_data_lists : ColocatedDataLists


Co-locating ungridded observations

pyaerocom.combine_vardata_ungridded.combine_vardata_ungridded(data_ids_and_vars, match_stats_how='closest', match_stats_tol_km=1, merge_how='combine', merge_eval_fun=None, var_name_out=None, data_id_out=None, var_unit_out=None, resample_how=None, min_num_obs=None, add_meta_keys=None)[source]

Combine and colocate different variables from UngriddedData

This method allows to combine different variable timeseries from different ungridded observation records in multiple ways. The source data may be all included in a single instance of UngriddedData or in multiple, for details see first input parameter :param:`data_ids_and_vars`. Merging can be done in flexible ways, e.g. by combining measurements of the same variable from 2 different datasets or by computing new variables based on 2 measured variables (e.g. concox=concno2+conco3). Doing this requires colocation of site locations and timestamps of both input observation records, which is done in this method.

It comprises 2 major steps:

  1. Compute list of StationData objects for both input data combinations (data_id1 & var1; data_id2 & var2) and based on these, find the coincident locations. Finding coincident sites can either be done based on site location name or based on

    their lat/lon locations. The method to use can be specified via input arg :param:`match_stats_how`.

  2. For all coincident locations, a new instance of StationData is computed that has merged the 2 timeseries in the way

    that can be specified through input args :param:`merge_how` and :param:`merge_eval_fun`. If the 2 original timeseries from both sites come in different temporal resolutions, they will be resampled to the lower of both resolutions. Resampling constraints that are supposed to be applied in that case can be provided via the respective input args for temporal resampling. Default is pyaerocom default, which corresponds to ~25% coverage constraint (as of 22.10.2020) for major resolution steps, such as daily->monthly.

Note

Currently, only 2 variables can be combined to a new one (e.g. concox=conco3+concno2).

Note

Be aware of unit conversion issues that may arise if your input data is not in AeroCom default units. For details see below.

Parameters:
  • data_ids_and_vars (list) – list of 3 element tuples, each containing, in the following order 1. instance of UngriddedData; 2. dataset ID (remember that UngriddedData can contain more than one dataset); and 3. variable name. Note that currently only 2 of such tuples can be combined.

  • match_stats_how (str, optional) – String specifying how site locations are supposed to be matched. The default is ‘closest’. Supported are ‘closest’ and ‘station_name’.

  • match_stats_tol_km (float, optional) – radius tolerance in km for matching site locations when using ‘closest’ for site location matching. The default is 1.

  • merge_how (str, optional) – String specifying how to merge variable data at site locations. The default is ‘combine’. If both input variables are the same and combine is used, then the first input variable will be preferred over the other. Supported are ‘combine’, ‘mean’ and ‘eval’, for the latter, merge_eval_fun needs to be specified explicitly.

  • merge_eval_fun (str, optional) – String specifying how var1 and var2 data should be evaluated (only relevant if merge_how=’eval’ is used) . The default is None. E.g. if one wants to retrieve the column aerosol fine mode fraction at 550nm (fmf550aer) through AERONET, this could be done through the SDA product by prodiding data_id1 and var1 are ‘AeronetSDA’ and ‘od550aer’ and second input data_id2 and var2 are ‘AeronetSDA’ and ‘od550lt1aer’ and merge_eval_fun could then be ‘fmf550aer=(AeronetSDA;od550lt1aer/AeronetSDA;od550aer)*100’. Note that the input variables will be converted to their AeroCom default units, so the specification of merge_eval_fun should take that into account in case the originally read obsdata is not in default units.

  • var_name_out (str, optional) – Name of output variable. Default is None, in which case it is attempted to be inferred.

  • data_id_out (str, optional) – data_id set in output StationData objects. Default is None, in which case it is inferred from input data_ids (e.g. in above example of merge_eval_fun, the output data_id would be ‘AeronetSDA’ since both input IDs are the same.

  • var_unit_out (str) – unit of output variable.

  • resample_how (str, optional) – String specifying how temporal resampling should be done. The default is ‘mean’.

  • min_num_obs (int or dict, optional) – Minimum number of observations for temporal resampling. The default is None in which case pyaerocom default is used, which is available via pyaerocom.const.OBS_MIN_NUM_RESAMPLE.

  • add_meta_keys (list, optional) – additional metadata keys to be added to output StationData objects from input data. If None, then only the pyaerocom default keys are added (see StationData.STANDARD_META_KEYS).

Raises:
  • ValueError – If input for merge_how or match_stats_how is invalid.

  • NotImplementedError – If one of the input UngriddedData objects contains more than one dataset.

Returns:

merged_stats – list of StationData objects containing the colocated and combined variable data.

Return type:

list

Reading of gridded data

Gridded data specifies any dataset that can be represented and stored on a regular grid within a certain domain (e.g. lat, lon time), for instance, model output or level 3 satellite data, stored, for instance, as NetCDF files. In pyaerocom, the underlying data object is GriddedData and pyaerocom supports reading of such data for different file naming conventions.

Gridded data using AeroCom conventions

class pyaerocom.io.readgridded.ReadGridded(data_id=None, data_dir=None, file_convention='aerocom3')[source]

Class for reading gridded files using AeroCom file conventions

data_id

string ID for model or obsdata network (see e.g. Aerocom interface map plots lower left corner)

Type:

str

data

imported data object

Type:

GriddedData

data_dir

directory containing result files for this model

Type:

str

start

start time for data import

Type:

pandas.Timestamp

stop

stop time for data import

Type:

pandas.Timestamp

file_convention

class specifying details of the file naming convention for the model

Type:

FileConventionRead

files

list containing all filenames that were found. Filled, e.g. in ReadGridded.get_model_files()

Type:

list

from_files

List of all netCDF files that were used to concatenate the current data cube (i.e. that can be based on certain matching settings such as var_name or time interval).

Type:

list

ts_types

list of all sampling frequencies (e.g. hourly, daily, monthly) that were inferred from filenames (based on Aerocom file naming convention) of all files that were found

Type:

list

vars

list containing all variable names (e.g. od550aer) that were inferred from filenames based on Aerocom model file naming convention

Type:

list

years

list of available years as inferred from the filenames in the data directory.

Type:

list

Parameters:
  • data_id (str) – string ID of model (e.g. “AATSR_SU_v4.3”,”CAM5.3-Oslo_CTRL2016”)

  • data_dir (str, optional) – directory containing data files. If provided, only this directory is considered for data files, else the input data_id is used to search for the corresponding directory.

  • file_convention (str) – string ID specifying the file convention of this model (cf. installation file file_conventions.ini)

  • init (bool) – if True, the model directory is searched (search_data_dir()) on instantiation and if it is found, all valid files for this model are searched using search_all_files().

AUX_ADD_ARGS = {'concprcpoxn': {'prlim': 0.0001, 'prlim_set_under': nan, 'prlim_units': 'm d-1', 'ts_type': 'daily'}, 'concprcpoxs': {'prlim': 0.0001, 'prlim_set_under': nan, 'prlim_units': 'm d-1', 'ts_type': 'daily'}, 'concprcprdn': {'prlim': 0.0001, 'prlim_set_under': nan, 'prlim_units': 'm d-1', 'ts_type': 'daily'}}

Additional arguments passed to computation methods for auxiliary data This is optional and defined per-variable like in AUX_FUNS

AUX_ALT_VARS = {'ac550dryaer': ['ac550aer'], 'od440aer': ['od443aer'], 'od870aer': ['od865aer']}
AUX_FUNS = {'ang4487aer': <function compute_angstrom_coeff_cubes>, 'angabs4487aer': <function compute_angstrom_coeff_cubes>, 'conc*': <function multiply_cubes>, 'concNhno3': <function calc_concNhno3_from_vmr>, 'concNnh3': <function calc_concNnh3_from_vmr>, 'concNnh4': <function calc_concNnh4>, 'concNno3pm10': <function calc_concNno3pm10>, 'concNno3pm25': <function calc_concNno3pm25>, 'concNtnh': <function calc_concNtnh>, 'concNtno3': <function calc_concNtno3>, 'concno3': <function add_cubes>, 'concno3pm10': <function calc_concno3pm10>, 'concno3pm25': <function calc_concno3pm25>, 'concox': <function add_cubes>, 'concprcpoxn': <function compute_concprcp_from_pr_and_wetdep>, 'concprcpoxs': <function compute_concprcp_from_pr_and_wetdep>, 'concprcprdn': <function compute_concprcp_from_pr_and_wetdep>, 'concsspm10': <function add_cubes>, 'concsspm25': <function calc_sspm25>, 'dryoa': <function add_cubes>, 'fmf550aer': <function divide_cubes>, 'mmr*': <function mmr_from_vmr>, 'od550gt1aer': <function subtract_cubes>, 'sc550dryaer': <function subtract_cubes>, 'vmrox': <function add_cubes>, 'wetoa': <function add_cubes>}
AUX_REQUIRES = {'ang4487aer': ('od440aer', 'od870aer'), 'angabs4487aer': ('abs440aer', 'abs870aer'), 'conc*': ('mmr*', 'rho'), 'concNhno3': ('vmrhno3',), 'concNnh3': ('vmrnh3',), 'concNnh4': ('concnh4',), 'concNno3pm10': ('concno3f', 'concno3c'), 'concNno3pm25': ('concno3f', 'concno3c'), 'concNtnh': ('concnh4', 'vmrnh3'), 'concNtno3': ('concno3f', 'concno3c', 'vmrhno3'), 'concno3': ('concno3c', 'concno3f'), 'concno3pm10': ('concno3f', 'concno3c'), 'concno3pm25': ('concno3f', 'concno3c'), 'concox': ('concno2', 'conco3'), 'concprcpoxn': ('wetoxn', 'pr'), 'concprcpoxs': ('wetoxs', 'pr'), 'concprcprdn': ('wetrdn', 'pr'), 'concsspm10': ('concss25', 'concsscoarse'), 'concsspm25': ('concss25', 'concsscoarse'), 'dryoa': ('drypoa', 'drysoa'), 'fmf550aer': ('od550lt1aer', 'od550aer'), 'mmr*': ('vmr*',), 'od550gt1aer': ('od550aer', 'od550lt1aer'), 'rho': ('ts', 'ps'), 'sc550dryaer': ('ec550dryaer', 'ac550dryaer'), 'vmrox': ('vmrno2', 'vmro3'), 'wetoa': ('wetpoa', 'wetsoa')}
CONSTRAINT_OPERATORS = {'!=': <ufunc 'not_equal'>, '<': <ufunc 'less'>, '<=': <ufunc 'less_equal'>, '==': <ufunc 'equal'>, '>': <ufunc 'greater'>, '>=': <ufunc 'greater_equal'>}
property TS_TYPES

List with valid filename encryptions specifying temporal resolution

Update 7.11.2019: not in use anymore due to improved handling of all possible frequencies now using TsType class.

VERT_ALT = {'Surface': 'ModelLevel'}
add_aux_compute(var_name, vars_required, fun)[source]

Register new variable to be computed

Parameters:
  • var_name (str) – variable name to be computed

  • vars_required (list) – list of variables to read, that are required to compute var_name

  • fun (callable) – function that takes a list of GriddedData objects as input and that are read using variable names specified by vars_required.

apply_read_constraint(data, constraint, **kwargs)[source]

Filter a GriddeData object by value in another variable

Note

BETA version, that was hacked down in a rush to be able to apply AOD>0.1 threshold when reading AE.

Parameters:
  • data (GriddedData) – data object to which constraint is applied

  • constraint (dict) – dictionary defining read constraint (see check_constraint_valid() for minimum requirement). If constraint contains key var_name (not mandatory), then the corresponding variable is attemted to be read and is used to evaluate constraint and the corresponding boolean mask is then applied to input data. Wherever this mask is True (i.e. constraint is met), the current value in input data will be replaced with numpy.ma.masked or, if specified, with entry new_val in input constraint dict.

  • **kwargs (TYPE) – reading arguments in case additional variable data needs to be loaded, to determine filter mask (i.e. if var_name is specified in input constraint). Parse to read_var().

Raises:

ValueError – If constraint is invalid (cf. check_constraint_valid() for details).

Returns:

modified data objects (all grid-points that met constraint are replaced with either numpy.ma.masked or with a value that can be specified via key new_val in input constraint).

Return type:

GriddedData

browser

This object can be used to

check_compute_var(var_name)[source]

Check if variable name belongs to family that can be computed

For instance, if input var_name is concdust this method will check AUX_REQUIRES to see if there is a variable family pattern (conc*) defined that specifies how to compute these variables. If a match is found, the required variables and computation method is added via add_aux_compute().

Parameters:

var_name (str) – variable name to be checked

Returns:

True if match is found, else False

Return type:

bool

check_constraint_valid(constraint)[source]

Check if reading constraint is valid

Parameters:

constraint (dict) – reading constraint. Requires at lest entries for following keys: - operator (str): for valid operators see CONSTRAINT_OPERATORS - filter_val (float): value against which data is evaluated wrt to operator

Raises:

ValueError – If constraint is invalid

Return type:

None.

compute_var(var_name, start=None, stop=None, ts_type=None, experiment=None, vert_which=None, flex_ts_type=True, prefer_longer=False, vars_to_read=None, aux_fun=None, try_convert_units=True, aux_add_args=None, rename_var=None, **kwargs)[source]

Compute auxiliary variable

Like read_var() but for auxiliary variables (cf. AUX_REQUIRES)

Parameters:
  • var_name (str) – variable that are supposed to be read

  • start (Timestamp or str, optional) – start time of data import (if valid input, then the current start will be overwritten)

  • stop (Timestamp or str, optional) – stop time of data import

  • ts_type (str) – string specifying temporal resolution (choose from hourly, 3hourly, daily, monthly). If None, prioritised of the available resolutions is used

  • experiment (str) – name of experiment (only relevant if this dataset contains more than one experiment)

  • vert_which (str) – valid AeroCom vertical info string encoded in name (e.g. Column, ModelLevel)

  • flex_ts_type (bool) – if True and if applicable, then another ts_type is used in case the input ts_type is not available for this variable

  • prefer_longer (bool) – if True and applicable, the ts_type resulting in the longer time coverage will be preferred over other possible frequencies that match the query.

  • try_convert_units (bool) – if True, units of GriddedData objects are attempted to be converted to AeroCom default. This applies both to the GriddedData objects being read for computation as well as the variable computed from the forme objects. This is, for instance, useful when computing concentration in precipitation from wet deposition and precipitation amount.

  • rename_var (str) – if this is set, the var_name attribute of the output GriddedData object will be updated accordingly.

  • **kwargs – additional keyword args passed to _load_var()

Returns:

loaded data object

Return type:

GriddedData

concatenate_cubes(cubes)[source]

Concatenate list of cubes into one cube

Parameters:

CubeList – list of individual cubes

Returns:

Single cube that contains concatenated cubes from input list

Return type:

Cube

Raises:

iris.exceptions.ConcatenateError – if concatenation of all cubes failed

property data_dir: str

Directory where data files are located

property data_id: str

Data ID of dataset

property experiments: list

List of all experiments that are available in this dataset

property file_type

File type of data files

property files: list

List of data files

filter_files(var_name=None, ts_type=None, start=None, stop=None, experiment=None, vert_which=None, is_at_stations=False, df=None)[source]

Filter file database

Parameters:
  • var_name (str) – variable that are supposed to be read

  • ts_type (str) – string specifying temporal resolution (choose from “hourly”, “3hourly”, “daily”, “monthly”). If None, prioritised of the available resolutions is used

  • start (Timestamp or str, optional) – start time of data import

  • stop (Timestamp or str, optional) – stop time of data import

  • experiment (str) – name of experiment (only relevant if this dataset contains more than one experiment)

  • vert_which (str or dict, optional) – valid AeroCom vertical info string encoded in name (e.g. Column, ModelLevel) or dictionary containing var_name as key and vertical coded string as value, accordingly

  • flex_ts_type (bool) – if True and if applicable, then another ts_type is used in case the input ts_type is not available for this variable

  • prefer_longer (bool) – if True and applicable, the ts_type resulting in the longer time coverage will be preferred over other possible frequencies that match the query.

filter_query(var_name, ts_type=None, start=None, stop=None, experiment=None, vert_which=None, is_at_stations=False, flex_ts_type=True, prefer_longer=False)[source]

Filter files for read query based on input specs

Returns:

dataframe containing filtered dataset

Return type:

DataFrame

find_common_ts_type(vars_to_read, start=None, stop=None, ts_type=None, experiment=None, vert_which=None, flex_ts_type=True)[source]

Find common ts_type for list of variables to be read

Parameters:
  • vars_to_read (list) – list of variables that is supposed to be read

  • start (Timestamp or str, optional) – start time of data import (if valid input, then the current start will be overwritten)

  • stop (Timestamp or str, optional) – stop time of data import (if valid input, then the current start will be overwritten)

  • ts_type (str) – string specifying temporal resolution (choose from hourly, 3hourly, daily, monthly). If None, prioritised of the available resolutions is used

  • experiment (str) – name of experiment (only relevant if this dataset contains more than one experiment)

  • vert_which (str) – valid AeroCom vertical info string encoded in name (e.g. Column, ModelLevel)

  • flex_ts_type (bool) – if True and if applicable, then another ts_type is used in case the input ts_type is not available for this variable

Returns:

common ts_type for input variable

Return type:

str

Raises:

DataCoverageError – if no match can be found

get_files(var_name, ts_type=None, start=None, stop=None, experiment=None, vert_which=None, is_at_stations=False, flex_ts_type=True, prefer_longer=False)[source]

Get data files based on input specs

get_var_info_from_files() dict[source]

Creates dicitonary that contains variable specific meta information

Returns:

dictionary where keys are available variables and values (for each variable) contain information about available ts_types, years, etc.

Return type:

dict

has_var(var_name)[source]

Check if variable is available

Parameters:

var_name (str) – variable to be checked

Return type:

bool

property name

Deprecated name of attribute data_id

read(vars_to_retrieve=None, start=None, stop=None, ts_type=None, experiment=None, vert_which=None, flex_ts_type=True, prefer_longer=False, require_all_vars_avail=False, **kwargs)[source]

Read all variables that could be found

Reads all variables that are available (i.e. in vars_filename)

Parameters:
  • vars_to_retrieve (list or str, optional) – variables that are supposed to be read. If None, all variables that are available are read.

  • start (Timestamp or str, optional) – start time of data import

  • stop (Timestamp or str, optional) – stop time of data import

  • ts_type (str, optional) – string specifying temporal resolution (choose from “hourly”, “3hourly”, “daily”, “monthly”). If None, prioritised of the available resolutions is used

  • experiment (str) – name of experiment (only relevant if this dataset contains more than one experiment)

  • vert_which (str or dict, optional) – valid AeroCom vertical info string encoded in name (e.g. Column, ModelLevel) or dictionary containing var_name as key and vertical coded string as value, accordingly

  • flex_ts_type (bool) – if True and if applicable, then another ts_type is used in case the input ts_type is not available for this variable

  • prefer_longer (bool) – if True and applicable, the ts_type resulting in the longer time coverage will be preferred over other possible frequencies that match the query.

  • require_all_vars_avail (bool) – if True, it is strictly required that all input variables are available.

  • **kwargs – optional and support for deprecated input args

Returns:

loaded data objects (type GriddedData)

Return type:

tuple

Raises:
  • IOError – if input variable names is not list or string

  • VarNotAvailableError

    1. if require_all_vars_avail=True and one or more of the desired variables is not available in this class 2. if require_all_vars_avail=True and if none of the input variables is available in this object

read_var(var_name, start=None, stop=None, ts_type=None, experiment=None, vert_which=None, flex_ts_type=True, prefer_longer=False, aux_vars=None, aux_fun=None, constraints=None, try_convert_units=True, rename_var=None, **kwargs)[source]

Read model data for a specific variable

This method searches all valid files for a given variable and for a provided temporal resolution (e.g. daily, monthly), optionally within a certain time window, that may be specified on class instantiation or using the corresponding input parameters provided in this method.

The individual NetCDF files for a given temporal period are loaded as instances of the iris.Cube object and appended to an instance of the iris.cube.CubeList object. The latter is then used to concatenate the individual cubes in time into a single instance of the pyaerocom.GriddedData class. In order to ensure that this works, several things need to be ensured, which are listed in the following and which may be controlled within the global settings for NetCDF import using the attribute GRID_IO (instance of OnLoad) in the default instance of the pyaerocom.config.Config object accessible via pyaerocom.const.

Parameters:
  • var_name (str) – variable that are supposed to be read

  • start (Timestamp or str, optional) – start time of data import

  • stop (Timestamp or str, optional) – stop time of data import

  • ts_type (str) – string specifying temporal resolution (choose from “hourly”, “3hourly”, “daily”, “monthly”). If None, prioritised of the available resolutions is used

  • experiment (str) – name of experiment (only relevant if this dataset contains more than one experiment)

  • vert_which (str or dict, optional) – valid AeroCom vertical info string encoded in name (e.g. Column, ModelLevel) or dictionary containing var_name as key and vertical coded string as value, accordingly

  • flex_ts_type (bool) – if True and if applicable, then another ts_type is used in case the input ts_type is not available for this variable

  • prefer_longer (bool) – if True and applicable, the ts_type resulting in the longer time coverage will be preferred over other possible frequencies that match the query.

  • aux_vars (list) – only relevant if var_name is not available for reading but needs to be computed: list of variables that are required to compute var_name

  • aux_fun (callable) – only relevant if var_name is not available for reading but needs to be computed: custom method for computation (cf. add_aux_compute() for details)

  • constraints (list, optional) – list of reading constraints (dict type). See check_constraint_valid() and apply_read_constraint() for details related to format of the individual constraints.

  • try_convert_units (bool) – if True, then the unit of the variable data is checked against AeroCom default unit for that variable and if it deviates, it is attempted to be converted to the AeroCom default unit. Default is True.

  • rename_var (str) – if this is set, the var_name attribute of the output GriddedData object will be updated accordingly.

  • **kwargs – additional keyword args parsed to _load_var()

Returns:

loaded data object

Return type:

GriddedData

Raises:
property registered_var_patterns

List of string patterns for computation of variables

The information is extracted from AUX_REQUIRES

Returns:

list of variable patterns

Return type:

list

reinit()[source]

Reinit everything that is loaded specific to data_dir

search_all_files(update_file_convention=True)[source]

Search all valid model files for this model

This method browses the data directory and finds all valid files, that is, file that are named according to one of the aerocom file naming conventions. The file list is stored in files.

Note

It is presumed, that naming conventions of files in the data directory are not mixed but all correspond to either of the conventions defined in

Parameters:

update_file_convention (bool) – if True, the first file in data_dir is used to identify the file naming convention (cf. FileConventionRead)

Raises:

DataCoverageError – if no valid files could be found

search_data_dir()[source]

Search data directory based on model ID

Wrapper for method search_data_dir_aerocom()

Returns:

data directory

Return type:

str

Raises:

IOError – if directory cannot be found

property start

First available year in the dataset (inferred from filenames)

Note

This is not variable or ts_type specific, so it is not necessarily given that data from this year is available for all variables in vars or all frequencies liste in ts_types

property stop

Last available year in the dataset (inferred from filenames)

Note

This is not variable or ts_type specific, so it is not necessarily given that data from this year is available for all variables in vars or all frequencies liste in ts_types

property ts_types

Available frequencies

update(**kwargs)[source]

Update one or more valid parameters

Parameters:

**kwargs – keyword args that will be used to update (overwrite) valid class attributes such as data, data_dir, files

property vars
property vars_filename
property vars_provided

Variables provided by this dataset

property years_avail: list

Years available in dataset

pyaerocom.io.readgridded.is_3d(var_name: str) bool[source]

Gridded data using EMEP conventions

Reading of ungridded data

Other than gridded data, ungridded data represents data that is irregularly sampled in space and time, for instance, observations at different locations around the globe. Such data is represented in pyaerocom by UngriddedData which is essentially a point-cloud dataset. Reading of UngriddedData is typically specific for different observational data records, as they typically come in various data formats using various metadata conventions, which need to be harmonised, which is done during the data import.

The following flowchart illustrates the architecture of ungridded reading in pyaerocom. Below are information about the individual reading classes for each dataset (blue in flowchart), the abstract template base classes the reading classes are based on (dark green) and the factory class ReadUngridded (orange) which has registered all individual reading classes. The data classes that are returned by the reading class are indicated in light green.

_images/pyaerocom_ungridded_io_flowchart.png

ReadUngridded factory class

Factory class that has all reading class for the individual datasets registered.

class pyaerocom.io.readungridded.ReadUngridded(data_ids=None, ignore_cache=False, data_dirs=None, configs: PyaroConfig | list[PyaroConfig] | None = None)[source]

Factory class for reading of ungridded data based on obsnetwork ID

This class also features reading functionality that goes beyond reading of inidividual observation datasets; including, reading of multiple datasets and post computation of new variables based on datasets that can be read.

Parameters:

SOON (COMING)

DONOTCACHE_NAME = 'DONOTCACHE'
property INCLUDED_DATASETS
INCLUDED_READERS = [<class 'pyaerocom.io.read_aeronet_invv3.ReadAeronetInvV3'>, <class 'pyaerocom.io.read_aeronet_sdav3.ReadAeronetSdaV3'>, <class 'pyaerocom.io.read_aeronet_sunv3.ReadAeronetSunV3'>, <class 'pyaerocom.io.read_earlinet.ReadEarlinet'>, <class 'pyaerocom.io.read_ebas.ReadEbas'>, <class 'pyaerocom.io.read_aasetal.ReadAasEtal'>, <class 'pyaerocom.io.read_airnow.ReadAirNow'>, <class 'pyaerocom.io.read_eea_aqerep.ReadEEAAQEREP'>, <class 'pyaerocom.io.read_eea_aqerep_v2.ReadEEAAQEREP_V2'>, <class 'pyaerocom.io.cams2_83.read_obs.ReadCAMS2_83'>, <class 'pyaerocom.io.gaw.reader.ReadGAW'>, <class 'pyaerocom.io.ghost.reader.ReadGhost'>, <class 'pyaerocom.io.cnemc.reader.ReadCNEMC'>, <class 'pyaerocom.io.icos.reader.ReadICOS'>, <class 'pyaerocom.io.icpforests.reader.ReadICPForest'>]
property SUPPORTED_DATASETS

Returns list of strings containing all supported dataset names

SUPPORTED_READERS = [<class 'pyaerocom.io.read_aeronet_invv3.ReadAeronetInvV3'>, <class 'pyaerocom.io.read_aeronet_sdav3.ReadAeronetSdaV3'>, <class 'pyaerocom.io.read_aeronet_sunv3.ReadAeronetSunV3'>, <class 'pyaerocom.io.read_earlinet.ReadEarlinet'>, <class 'pyaerocom.io.read_ebas.ReadEbas'>, <class 'pyaerocom.io.read_aasetal.ReadAasEtal'>, <class 'pyaerocom.io.read_airnow.ReadAirNow'>, <class 'pyaerocom.io.read_eea_aqerep.ReadEEAAQEREP'>, <class 'pyaerocom.io.read_eea_aqerep_v2.ReadEEAAQEREP_V2'>, <class 'pyaerocom.io.cams2_83.read_obs.ReadCAMS2_83'>, <class 'pyaerocom.io.gaw.reader.ReadGAW'>, <class 'pyaerocom.io.ghost.reader.ReadGhost'>, <class 'pyaerocom.io.cnemc.reader.ReadCNEMC'>, <class 'pyaerocom.io.icos.reader.ReadICOS'>, <class 'pyaerocom.io.icpforests.reader.ReadICPForest'>, <class 'pyaerocom.io.pyaro.read_pyaro.ReadPyaro'>]
add_config(config: PyaroConfig) None[source]

Adds single PyaroConfig to self.configs

Parameters:

config (PyaroConfig)

Raises:

ValueError – If config is not PyaroConfig

add_pyaro_reader(config: PyaroConfig) ReadUngriddedBase[source]
property configs

List configs

property data_dirs

Data directory(ies) for dataset(s) to read (keys are data IDs)

Type:

dict

property data_id

ID of dataset

Note

Only works if exactly one dataset is assigned to the reader, that is, length of data_ids is 1.

Raises:

AttributeError – if number of items in data_ids is unequal one.

Returns:

data ID

Return type:

str

property data_ids

List of datasets supposed to be read

dataset_provides_variables(data_id=None)[source]

List of variables provided by a certain dataset

get_lowlevel_reader(data_id: str | None = None) ReadUngriddedBase[source]

Helper method that returns initiated reader class for input ID

Parameters:

data_id (str) – Name of dataset

Returns:

instance of reading class (needs to be implementation of base class ReadUngriddedBase).

Return type:

ReadUngriddedBase

get_reader(data_id)[source]
get_vars_supported(obs_id: str, vars_desired: list[str])[source]

Filter input list of variables by supported ones for a certain data ID

Parameters:
  • obs_id (str) – ID of observation network

  • vars_desired (list) – List of variables that are desired

Returns:

list of variables that can be read through the input network

Return type:

list

property ignore_cache

Boolean specifying whether caching is active or not

property post_compute

Information about datasets that can be computed in post

read(data_ids=None, vars_to_retrieve=None, only_cached=False, filter_post=None, configs: PyaroConfig | list[PyaroConfig] | None = None, **kwargs)[source]

Read observations

Iter over all datasets in data_ids, call read_dataset() and append to data object

Parameters:
  • data_ids (str or list) – data ID or list of all datasets to be imported

  • vars_to_retrieve (str or list) – variable or list of variables to be imported

  • only_cached (bool) – if True, then nothing is reloaded but only data is loaded that is available as cached objects (not recommended to use but may be used if working offline without connection to database)

  • filter_post (dict, optional) – filters applied to UngriddedData object AFTER it is read into memory, via UngriddedData.apply_filters(). This option was introduced in pyaerocom version 0.10.0 and should be used preferably over **kwargs. There is a certain flexibility with respect to how these filters can be defined, for instance, sub dicts for each data_id. The most common way would be to provide directly the input needed for UngriddedData.apply_filters. If you want to read multiple variables from one or more datasets, and if you want to apply variable specific filters, it is recommended to read the data individually for each variable and corresponding set of filters and then merge the individual filtered UngriddedData objects afterwards, e.g. using data_var1 & data_var2.

  • **kwargs – Additional input options for reading of data, which are applied WHILE the data is read. If any such additional options are provided that are applied during the reading, then automatic caching of the output UngriddedData object will be deactivated. Thus, it is recommended to handle data filtering via filter_post argument whenever possible, which will result in better performance as the unconstrained original data is read in and cached, and then the filtering is applied.

Example

>>> import pyaerocom.io.readungridded as pio
>>> from pyaerocom import const
>>> obj = pio.ReadUngridded(data_id=const.AERONET_SUN_V3L15_AOD_ALL_POINTS_NAME)
>>> obj.read()
>>> print(obj)
>>> print(obj.metadata[0.]['latitude'])
read_dataset(data_id, vars_to_retrieve=None, only_cached=False, filter_post=None, **kwargs)[source]

Read dataset into an instance of ReadUngridded

Parameters:
  • data_id (str) – name of dataset

  • vars_to_retrieve (list) – variable or list of variables to be imported

  • only_cached (bool) – if True, then nothing is reloaded but only data is loaded that is available as cached objects (not recommended to use but may be used if working offline without connection to database)

  • filter_post (dict, optional) – filters applied to UngriddedData object AFTER it is read into memory, via UngriddedData.apply_filters(). This option was introduced in pyaerocom version 0.10.0 and should be used preferably over **kwargs. There is a certain flexibility with respect to how these filters can be defined, for instance, sub dicts for each data_id. The most common way would be to provide directly the input needed for UngriddedData.apply_filters. If you want to read multiple variables from one or more datasets, and if you want to apply variable specific filters, it is recommended to read the data individually for each variable and corresponding set of filters and then merge the individual filtered UngriddedData objects afterwards, e.g. using data_var1 & data_var2.

  • **kwargs – Additional input options for reading of data, which are applied WHILE the data is read. If any such additional options are provided that are applied during the reading, then automatic caching of the output UngriddedData object will be deactivated. Thus, it is recommended to handle data filtering via filter_post argument whenever possible, which will result in better performance as the unconstrained original data is read in and cached, and then the filtering is applied.

Returns:

data object

Return type:

UngriddedData

read_dataset_post(data_id, vars_to_retrieve, only_cached=False, filter_post=None, **kwargs)[source]

Read dataset into an instance of ReadUngridded

Parameters:
  • data_id (str) – name of dataset

  • vars_to_retrieve (list) – variable or list of variables to be imported

  • only_cached (bool) – if True, then nothing is reloaded but only data is loaded that is available as cached objects (not recommended to use but may be used if working offline without connection to database)

  • filter_post (dict, optional) – filters applied to UngriddedData object AFTER it is read into memory, via UngriddedData.apply_filters(). This option was introduced in pyaerocom version 0.10.0 and should be used preferably over **kwargs. There is a certain flexibility with respect to how these filters can be defined, for instance, sub dicts for each data_id. The most common way would be to provide directly the input needed for UngriddedData.apply_filters. If you want to read multiple variables from one or more datasets, and if you want to apply variable specific filters, it is recommended to read the data individually for each variable and corresponding set of filters and then merge the individual filtered UngriddedData objects afterwards, e.g. using data_var1 & data_var2.

  • **kwargs – Additional input options for reading of data, which are applied WHILE the data is read. If any such additional options are provided that are applied during the reading, then automatic caching of the output UngriddedData object will be deactivated. Thus, it is recommended to handle data filtering via filter_post argument whenever possible, which will result in better performance as the unconstrained original data is read in and cached, and then the filtering is applied.

Returns:

data object

Return type:

UngriddedData

property supported_datasets

Wrapper for SUPPORTED_DATASETS

ReadUngriddedBase template class

All ungridded reading routines are based on this template class.

class pyaerocom.io.readungriddedbase.ReadUngriddedBase(data_id: str | None = None, data_dir: str | None = None)[source]

TEMPLATE: Abstract base class template for reading of ungridded data

Note

The two dictionaries AUX_REQUIRES and AUX_FUNS can be filled with variables that are not contained in the original data files but are computed during the reading. The former specifies what additional variables are required to perform the computation and the latter specifies functions used to perform the computations of the auxiliary variables. See, for instance, the class ReadAeronetSunV3, which includes the computation of the AOD at 550nm and the Angstrom coefficient (in 440-870 nm range) from AODs measured at other wavelengths.

AUX_FUNS = {}

Functions that are used to compute additional variables (i.e. one for each variable defined in AUX_REQUIRES)

AUX_REQUIRES = {}

dictionary containing information about additionally required variables for each auxiliary variable (i.e. each variable that is not provided by the original data but computed on import)

property AUX_VARS

List of auxiliary variables (keys of attr. AUX_REQUIRES)

Auxiliary variables are those that are not included in original files but are computed from other variables during import

property DATASET_PATH

Wrapper for data_dir.

abstract property DATA_ID

Name of dataset (OBS_ID)

Note

  • May be implemented as global constant in header of derieved class

  • May be multiple that can be specified on init (see example below)

abstract property DEFAULT_VARS

List containing default variables to read

IGNORE_META_KEYS = []
abstract property PROVIDES_VARIABLES

List of variables that are provided by this dataset

Note

May be implemented as global constant in header

property REVISION_FILE

Name of revision file located in data directory

abstract property SUPPORTED_DATASETS

List of all datasets supported by this interface

Note

  • best practice to specify in header of class definition

  • needless to mention that DATA_ID needs to be in this list

abstract property TS_TYPE

Temporal resolution of dataset

This should be defined in the header of an implementation class if it can be globally defined for the corresponding obs-network or in other cases it should be initated as string undefined and then, if applicable, updated in the reading routine of a file.

The TS_TYPE information should ultimately be written into the meta-data of objects returned by the implementation of read_file() (e.g. instance of StationData or a normal dictionary) and the method read() (which should ALWAYS return an instance of the UngriddedData class).

Note

  • Please use "undefined" if the derived class is not sampled on a regular basis.

  • If applicable please use Aerocom ts_type (i.e. hourly, 3hourly, daily, monthly, yearly)

  • Note also, that the ts_type in a derived class may or may not be defined in a general case. For instance, in the EBAS database the resolution code can be found in the file header and may thus be intiated as "undefined" in the initiation of the reading class and then updated when the class is being read

  • For derived implementation classes that support reading of multiple network versions, you may also assign

check_vars_to_retrieve(vars_to_retrieve)[source]

Separate variables that are in file from those that are computed

Some of the provided variables by this interface are not included in the data files but are computed within this class during data import (e.g. od550aer, ang4487aer).

The latter may require additional parameters to be retrieved from the file, which is specified in the class header (cf. attribute AUX_REQUIRES).

This function checks the input list that specifies all required variables and separates them into two lists, one that includes all variables that can be read from the files and a second list that specifies all variables that are computed in this class.

Parameters:

vars_to_retrieve (list) – all parameter names that are supposed to be loaded

Returns:

2-element tuple, containing

  • list: list containing all variables to be read

  • list: list containing all variables to be computed

Return type:

tuple

compute_additional_vars(data, vars_to_compute)[source]

Compute all additional variables

The computations for each additional parameter are done using the specified methods in AUX_FUNS.

Parameters:
  • data (dict-like) – data object containing data vectors for variables that are required for computation (cf. input param vars_to_compute)

  • vars_to_compute (list) – list of variable names that are supposed to be computed. Variables that are required for the computation of the variables need to be specified in AUX_VARS and need to be available as data vectors in the provided data dictionary (key is the corresponding variable name of the required variable).

Returns:

updated data object now containing also computed variables

Return type:

dict

property data_dir: str

Location of the dataset

Note

This can be set explicitly when instantiating the class (e.g. if data is available on local machine). If unspecified, the data location is attempted to be inferred via get_obsnetwork_dir()

Raises:

FileNotFoundError – if data directory does not exist or cannot be retrieved automatically

Type:

str

property data_id

ID of dataset

property data_revision

Revision string from file Revision.txt in the main data directory

find_in_file_list(pattern=None)[source]

Find all files that match a certain wildcard pattern

Parameters:

pattern (str, optional) – wildcard pattern that may be used to narrow down the search (e.g. use pattern=*Berlin* to find only files that contain Berlin in their filename)

Returns:

list containing all files in files that match pattern

Return type:

list

Raises:

IOError – if no matches can be found

get_file_list(pattern=None)[source]

Search all files to be read

Uses _FILEMASK (+ optional input search pattern, e.g. station_name) to find valid files for query.

Parameters:

pattern (str, optional) – file name pattern applied to search

Returns:

list containing retrieved file locations

Return type:

list

Raises:

IOError – if no files can be found

logger

Class own instance of logger class

abstract read(vars_to_retrieve=None, files=[], first_file=None, last_file=None)[source]

Method that reads list of files as instance of UngriddedData

Parameters:
  • vars_to_retrieve (list or similar, optional,) – list containing variable IDs that are supposed to be read. If None, all variables in PROVIDES_VARIABLES are loaded

  • files (list, optional) – list of files to be read. If None, then the file list is used that is returned on get_file_list().

  • first_file (int, optional) – index of first file in file list to read. If None, the very first file in the list is used

  • last_file (int, optional) – index of last file in list to read. If None, the very last file in the list is used

Returns:

instance of ungridded data object containing data from all files.

Return type:

UngriddedData

abstract read_file(filename, vars_to_retrieve=None)[source]

Read single file

Parameters:
  • filename (str) – string specifying filename

  • vars_to_retrieve (list or similar, optional,) – list containing variable IDs that are supposed to be read. If None, all variables in PROVIDES_VARIABLES are loaded

Returns:

imported data in a suitable format that can be handled by read() which is supposed to append the loaded results from this method (which reads one datafile) to an instance of UngriddedData for all files.

Return type:

dict or StationData, or other…

read_first_file(**kwargs)[source]

Read first file returned from get_file_list()

Note

This method may be used for test purposes.

Parameters:

**kwargs – keyword args passed to read_file() (e.g. vars_to_retrieve)

Returns:

dictionary or similar containing loaded results from first file

Return type:

dict-like

read_station(station_id_filename, **kwargs)[source]

Read data from a single station into UngriddedData

Find all files that contain the station ID in their filename and then call read(), providing the reduced filelist as input, in order to read all files from this station into data object.

Parameters:
  • station_id_filename (str) – name of station (MUST be encrypted in filename)

  • **kwargs – additional keyword args passed to read() (e.g. vars_to_retrieve)

Returns:

loaded data

Return type:

UngriddedData

Raises:

IOError – if no files can be found for this station ID

remove_outliers(data, vars_to_retrieve, **valid_rng_vars)[source]

Remove outliers from data

Parameters:
  • data (dict-like) – data object containing data vectors for variables that are required for computation (cf. input param vars_to_compute)

  • vars_to_retrieve (list) – list of variable names for which outliers will be removed from data

  • **valid_rng_vars – additional keyword args specifying variable name and corresponding min / max interval (list or tuple) that specifies valid range for the variable. For each variable that is not explicitely defined here, the default minimum / maximum value is used (accessed via pyaerocom.const.VARS[var_name])

var_supported(var_name)[source]

Check if input variable is supported

Parameters:

var_name (str) – AeroCom variable name or alias

Raises:

VariableDefinitionError – if input variable is not supported by pyaerocom

Returns:

True, if variable is supported by this interface, else False

Return type:

bool

property verbosity_level

Current level of verbosity of logger

AERONET

Aerosol Robotic Network (AERONET)

AERONET base class

All AERONET reading classes are based on the template ReadAeronetBase class which, in turn inherits from ReadUngriddedBase.

class pyaerocom.io.readaeronetbase.ReadAeronetBase(data_id=None, data_dir=None)[source]

Bases: ReadUngriddedBase

TEMPLATE: Abstract base class template for reading of Aeronet data

Extended abstract base class, derived from low-level base class ReadUngriddedBase that contains some more functionality.

ALT_VAR_NAMES_FILE = {}

dictionary specifying alternative column names for variables defined in VAR_NAMES_FILE

Type:

OPTIONAL

AUX_FUNS = {}

Functions that are used to compute additional variables (i.e. one for each variable defined in AUX_REQUIRES)

AUX_REQUIRES = {}

dictionary containing information about additionally required variables for each auxiliary variable (i.e. each variable that is not provided by the original data but computed on import)

property AUX_VARS

List of auxiliary variables (keys of attr. AUX_REQUIRES)

Auxiliary variables are those that are not included in original files but are computed from other variables during import

COL_DELIM = ','

column delimiter in data block of files

property DATASET_PATH

Wrapper for data_dir.

abstract property DATA_ID

Name of dataset (OBS_ID)

Note

  • May be implemented as global constant in header of derieved class

  • May be multiple that can be specified on init (see example below)

DEFAULT_UNIT = '1'

Default data unit that is assigned to all variables that are not specified in UNITS dictionary (cf. UNITS)

abstract property DEFAULT_VARS

List containing default variables to read

IGNORE_META_KEYS = ['date', 'time', 'day_of_year']
INSTRUMENT_NAME = 'sun_photometer'

name of measurement instrument

META_NAMES_FILE = {}

dictionary specifying the file column names (values) for each metadata key (cf. attributes of StationData, e.g. ‘station_name’, ‘longitude’, ‘latitude’, ‘altitude’)

META_NAMES_FILE_ALT = ({},)
abstract property PROVIDES_VARIABLES

List of variables that are provided by this dataset

Note

May be implemented as global constant in header

property REVISION_FILE

Name of revision file located in data directory

abstract property SUPPORTED_DATASETS

List of all datasets supported by this interface

Note

  • best practice to specify in header of class definition

  • needless to mention that DATA_ID needs to be in this list

property TS_TYPE

Default implementation of string for temporal resolution

TS_TYPES = {}

dictionary assigning temporal resolution flags for supported datasets that are provided in a defined temporal resolution. Key is the name of the dataset and value is the corresponding ts_type

UNITS = {}

Variable specific units, only required for variables that deviate from DEFAULT_UNIT (is irrelevant for all variables that are so far supported by the implemented Aeronet products, i.e. all variables are dimensionless as specified in DEFAULT_UNIT)

VAR_NAMES_FILE = {}

dictionary specifying the file column names (values) for each Aerocom variable (keys)

VAR_PATTERNS_FILE = {}

Mappings for identifying variables in file (may be specified in addition to explicit variable names specified in VAR_NAMES_FILE)

check_vars_to_retrieve(vars_to_retrieve)

Separate variables that are in file from those that are computed

Some of the provided variables by this interface are not included in the data files but are computed within this class during data import (e.g. od550aer, ang4487aer).

The latter may require additional parameters to be retrieved from the file, which is specified in the class header (cf. attribute AUX_REQUIRES).

This function checks the input list that specifies all required variables and separates them into two lists, one that includes all variables that can be read from the files and a second list that specifies all variables that are computed in this class.

Parameters:

vars_to_retrieve (list) – all parameter names that are supposed to be loaded

Returns:

2-element tuple, containing

  • list: list containing all variables to be read

  • list: list containing all variables to be computed

Return type:

tuple

property col_index

Dictionary that specifies the index for each data column

Note

Implementation depends on the data. For instance, if the variable information is provided in all files (of all stations) and always in the same column, then this can be set as a fixed dictionary in the __init__ function of the implementation (see e.g. class ReadAeronetSunV2). In other cases, it may not be ensured that each variable is available in all files or the column definition may differ between different stations. In the latter case you may automise the column index retrieval by providing the header names for each meta and data column you want to extract using the attribute dictionaries META_NAMES_FILE and VAR_NAMES_FILE by calling _update_col_index() in your implementation of read_file() when you reach the line that contains the header information.

compute_additional_vars(data, vars_to_compute)

Compute all additional variables

The computations for each additional parameter are done using the specified methods in AUX_FUNS.

Parameters:
  • data (dict-like) – data object containing data vectors for variables that are required for computation (cf. input param vars_to_compute)

  • vars_to_compute (list) – list of variable names that are supposed to be computed. Variables that are required for the computation of the variables need to be specified in AUX_VARS and need to be available as data vectors in the provided data dictionary (key is the corresponding variable name of the required variable).

Returns:

updated data object now containing also computed variables

Return type:

dict

property data_dir: str

Location of the dataset

Note

This can be set explicitly when instantiating the class (e.g. if data is available on local machine). If unspecified, the data location is attempted to be inferred via get_obsnetwork_dir()

Raises:

FileNotFoundError – if data directory does not exist or cannot be retrieved automatically

Type:

str

property data_id

ID of dataset

property data_revision

Revision string from file Revision.txt in the main data directory

find_in_file_list(pattern=None)

Find all files that match a certain wildcard pattern

Parameters:

pattern (str, optional) – wildcard pattern that may be used to narrow down the search (e.g. use pattern=*Berlin* to find only files that contain Berlin in their filename)

Returns:

list containing all files in files that match pattern

Return type:

list

Raises:

IOError – if no matches can be found

get_file_list(pattern=None)

Search all files to be read

Uses _FILEMASK (+ optional input search pattern, e.g. station_name) to find valid files for query.

Parameters:

pattern (str, optional) – file name pattern applied to search

Returns:

list containing retrieved file locations

Return type:

list

Raises:

IOError – if no files can be found

infer_wavelength_colname(colname, low=250, high=2000)[source]

Get variable wavelength from column name

Parameters:
  • colname (str) – string of column name

  • low (int) – lower limit of accepted value range

  • high (int) – upper limit of accepted value range

Returns:

wavelength in nm as floating str

Return type:

str

Raises:

ValueError – if None or more than one number is detected in variable string

logger

Class own instance of logger class

print_all_columns()[source]
read(vars_to_retrieve=None, files=None, first_file=None, last_file=None, file_pattern=None, common_meta=None)[source]

Method that reads list of files as instance of UngriddedData

Parameters:
  • vars_to_retrieve (list or similar, optional,) – list containing variable IDs that are supposed to be read. If None, all variables in PROVIDES_VARIABLES are loaded

  • files (list, optional) – list of files to be read. If None, then the file list is used that is returned on get_file_list().

  • first_file (int, optional) – index of first file in file list to read. If None, the very first file in the list is used. Note: is ignored if input parameter file_pattern is specified.

  • last_file (int, optional) – index of last file in list to read. If None, the very last file in the list is used. Note: is ignored if input parameter file_pattern is specified.

  • file_pattern (str, optional) – string pattern for file search (cf get_file_list())

  • common_meta (dict, optional) – dictionary that contains additional metadata shared for this network (assigned to each metadata block of the UngriddedData object that is returned)

Returns:

data object

Return type:

UngriddedData

abstract read_file(filename, vars_to_retrieve=None)

Read single file

Parameters:
  • filename (str) – string specifying filename

  • vars_to_retrieve (list or similar, optional,) – list containing variable IDs that are supposed to be read. If None, all variables in PROVIDES_VARIABLES are loaded

Returns:

imported data in a suitable format that can be handled by read() which is supposed to append the loaded results from this method (which reads one datafile) to an instance of UngriddedData for all files.

Return type:

dict or StationData, or other…

read_first_file(**kwargs)

Read first file returned from get_file_list()

Note

This method may be used for test purposes.

Parameters:

**kwargs – keyword args passed to read_file() (e.g. vars_to_retrieve)

Returns:

dictionary or similar containing loaded results from first file

Return type:

dict-like

read_station(station_id_filename, **kwargs)

Read data from a single station into UngriddedData

Find all files that contain the station ID in their filename and then call read(), providing the reduced filelist as input, in order to read all files from this station into data object.

Parameters:
  • station_id_filename (str) – name of station (MUST be encrypted in filename)

  • **kwargs – additional keyword args passed to read() (e.g. vars_to_retrieve)

Returns:

loaded data

Return type:

UngriddedData

Raises:

IOError – if no files can be found for this station ID

remove_outliers(data, vars_to_retrieve, **valid_rng_vars)

Remove outliers from data

Parameters:
  • data (dict-like) – data object containing data vectors for variables that are required for computation (cf. input param vars_to_compute)

  • vars_to_retrieve (list) – list of variable names for which outliers will be removed from data

  • **valid_rng_vars – additional keyword args specifying variable name and corresponding min / max interval (list or tuple) that specifies valid range for the variable. For each variable that is not explicitely defined here, the default minimum / maximum value is used (accessed via pyaerocom.const.VARS[var_name])

var_supported(var_name)

Check if input variable is supported

Parameters:

var_name (str) – AeroCom variable name or alias

Raises:

VariableDefinitionError – if input variable is not supported by pyaerocom

Returns:

True, if variable is supported by this interface, else False

Return type:

bool

property verbosity_level

Current level of verbosity of logger

AERONET Sun (V3)

class pyaerocom.io.read_aeronet_sunv3.ReadAeronetSunV3(data_id=None, data_dir=None)[source]

Bases: ReadAeronetBase

Interface for reading Aeronet direct sun version 3 Level 1.5 and 2.0 data

See also

Base classes ReadAeronetBase and ReadUngriddedBase

ALT_VAR_NAMES_FILE = {}

dictionary specifying alternative column names for variables defined in VAR_NAMES_FILE

Type:

OPTIONAL

AUX_FUNS = {'ang44&87aer': <function calc_ang4487aer>, 'od550aer': <function calc_od550aer>, 'od550lt1ang': <function calc_od550lt1ang>, 'proxyod550aerh2o': <function calc_od550aer>, 'proxyod550bc': <function calc_od550aer>, 'proxyod550dust': <function calc_od550aer>, 'proxyod550nh4': <function calc_od550aer>, 'proxyod550no3': <function calc_od550aer>, 'proxyod550oa': <function calc_od550aer>, 'proxyod550so4': <function calc_od550aer>, 'proxyod550ss': <function calc_od550aer>, 'proxyzaerosol': <function calc_od550aer>, 'proxyzdust': <function calc_od550aer>}

Functions that are used to compute additional variables (i.e. one for each variable defined in AUX_REQUIRES)

AUX_REQUIRES = {'ang44&87aer': ['od440aer', 'od870aer'], 'od550aer': ['od440aer', 'od500aer', 'ang4487aer'], 'od550lt1ang': ['od440aer', 'od500aer', 'ang4487aer'], 'proxyod550aerh2o': ['od440aer', 'od500aer', 'ang4487aer'], 'proxyod550bc': ['od440aer', 'od500aer', 'ang4487aer'], 'proxyod550dust': ['od440aer', 'od500aer', 'ang4487aer'], 'proxyod550nh4': ['od440aer', 'od500aer', 'ang4487aer'], 'proxyod550no3': ['od440aer', 'od500aer', 'ang4487aer'], 'proxyod550oa': ['od440aer', 'od500aer', 'ang4487aer'], 'proxyod550so4': ['od440aer', 'od500aer', 'ang4487aer'], 'proxyod550ss': ['od440aer', 'od500aer', 'ang4487aer'], 'proxyzaerosol': ['od440aer', 'od500aer', 'ang4487aer'], 'proxyzdust': ['od440aer', 'od500aer', 'ang4487aer']}

dictionary containing information about additionally required variables for each auxiliary variable (i.e. each variable that is not provided by the original data but computed on import)

property AUX_VARS

List of auxiliary variables (keys of attr. AUX_REQUIRES)

Auxiliary variables are those that are not included in original files but are computed from other variables during import

COL_DELIM = ','

column delimiter in data block of files

property DATASET_PATH

Wrapper for data_dir.

DATA_ID = 'AeronetSunV3Lev2.daily'

Name of dataset (OBS_ID)

DEFAULT_UNIT = '1'

Default data unit that is assigned to all variables that are not specified in UNITS dictionary (cf. UNITS)

DEFAULT_VARS = ['od550aer', 'ang4487aer']

default variables for read method

IGNORE_META_KEYS = ['date', 'time', 'day_of_year']
INSTRUMENT_NAME = 'sun_photometer'

name of measurement instrument

META_NAMES_FILE = {'altitude': 'Site_Elevation(m)', 'data_quality_level': 'Data_Quality_Level', 'date': 'Date(dd:mm:yyyy)', 'day_of_year': 'Day_of_Year', 'instrument_number': 'AERONET_Instrument_Number', 'latitude': 'Site_Latitude(Degrees)', 'longitude': 'Site_Longitude(Degrees)', 'station_name': 'AERONET_Site', 'time': 'Time(hh:mm:ss)'}

dictionary specifying the file column names (values) for each metadata key (cf. attributes of StationData, e.g. ‘station_name’, ‘longitude’, ‘latitude’, ‘altitude’)

META_NAMES_FILE_ALT = {'AERONET_Site': ['AERONET_Site_Name']}
NAN_VAL = -999.0
PROVIDES_VARIABLES = ['od340aer', 'od440aer', 'od500aer', 'od870aer', 'ang4487aer']

List of variables that are provided by this dataset (will be extended by auxiliary variables on class init, for details see __init__ method of base class ReadUngriddedBase)

property REVISION_FILE

Name of revision file located in data directory

SUPPORTED_DATASETS = ['AeronetSunV3Lev1.5.daily', 'AeronetSunV3Lev1.5.AP', 'AeronetSunV3Lev2.daily', 'AeronetSunV3Lev2.AP']

List of all datasets supported by this interface

property TS_TYPE

Default implementation of string for temporal resolution

TS_TYPES = {'AeronetSunV3Lev1.5.daily': 'daily', 'AeronetSunV3Lev2.daily': 'daily'}

dictionary assigning temporal resolution flags for supported datasets that are provided in a defined temporal resolution

UNITS = {'proxyzaerosol': 'km', 'proxyzdust': 'km'}

Variable specific units, only required for variables that deviate from DEFAULT_UNIT (is irrelevant for all variables that are so far supported by the implemented Aeronet products, i.e. all variables are dimensionless as specified in DEFAULT_UNIT)

VAR_NAMES_FILE = {'ang4487aer': '440-870_Angstrom_Exponent', 'od340aer': 'AOD_340nm', 'od440aer': 'AOD_440nm', 'od500aer': 'AOD_500nm', 'od870aer': 'AOD_870nm'}

dictionary specifying the file column names (values) for each Aerocom variable (keys)

VAR_PATTERNS_FILE = {'AOD_([0-9]*)nm': 'od*aer'}

Mappings for identifying variables in file

check_vars_to_retrieve(vars_to_retrieve)

Separate variables that are in file from those that are computed

Some of the provided variables by this interface are not included in the data files but are computed within this class during data import (e.g. od550aer, ang4487aer).

The latter may require additional parameters to be retrieved from the file, which is specified in the class header (cf. attribute AUX_REQUIRES).

This function checks the input list that specifies all required variables and separates them into two lists, one that includes all variables that can be read from the files and a second list that specifies all variables that are computed in this class.

Parameters:

vars_to_retrieve (list) – all parameter names that are supposed to be loaded

Returns:

2-element tuple, containing

  • list: list containing all variables to be read

  • list: list containing all variables to be computed

Return type:

tuple

property col_index

Dictionary that specifies the index for each data column

Note

Implementation depends on the data. For instance, if the variable information is provided in all files (of all stations) and always in the same column, then this can be set as a fixed dictionary in the __init__ function of the implementation (see e.g. class ReadAeronetSunV2). In other cases, it may not be ensured that each variable is available in all files or the column definition may differ between different stations. In the latter case you may automise the column index retrieval by providing the header names for each meta and data column you want to extract using the attribute dictionaries META_NAMES_FILE and VAR_NAMES_FILE by calling _update_col_index() in your implementation of read_file() when you reach the line that contains the header information.

compute_additional_vars(data, vars_to_compute)

Compute all additional variables

The computations for each additional parameter are done using the specified methods in AUX_FUNS.

Parameters:
  • data (dict-like) – data object containing data vectors for variables that are required for computation (cf. input param vars_to_compute)

  • vars_to_compute (list) – list of variable names that are supposed to be computed. Variables that are required for the computation of the variables need to be specified in AUX_VARS and need to be available as data vectors in the provided data dictionary (key is the corresponding variable name of the required variable).

Returns:

updated data object now containing also computed variables

Return type:

dict

property data_dir: str

Location of the dataset

Note

This can be set explicitly when instantiating the class (e.g. if data is available on local machine). If unspecified, the data location is attempted to be inferred via get_obsnetwork_dir()

Raises:

FileNotFoundError – if data directory does not exist or cannot be retrieved automatically

Type:

str

property data_id

ID of dataset

property data_revision

Revision string from file Revision.txt in the main data directory

find_in_file_list(pattern=None)

Find all files that match a certain wildcard pattern

Parameters:

pattern (str, optional) – wildcard pattern that may be used to narrow down the search (e.g. use pattern=*Berlin* to find only files that contain Berlin in their filename)

Returns:

list containing all files in files that match pattern

Return type:

list

Raises:

IOError – if no matches can be found

get_file_list(pattern=None)

Search all files to be read

Uses _FILEMASK (+ optional input search pattern, e.g. station_name) to find valid files for query.

Parameters:

pattern (str, optional) – file name pattern applied to search

Returns:

list containing retrieved file locations

Return type:

list

Raises:

IOError – if no files can be found

infer_wavelength_colname(colname, low=250, high=2000)

Get variable wavelength from column name

Parameters:
  • colname (str) – string of column name

  • low (int) – lower limit of accepted value range

  • high (int) – upper limit of accepted value range

Returns:

wavelength in nm as floating str

Return type:

str

Raises:

ValueError – if None or more than one number is detected in variable string

logger

Class own instance of logger class

print_all_columns()
read(vars_to_retrieve=None, files=None, first_file=None, last_file=None, file_pattern=None, common_meta=None)

Method that reads list of files as instance of UngriddedData

Parameters:
  • vars_to_retrieve (list or similar, optional,) – list containing variable IDs that are supposed to be read. If None, all variables in PROVIDES_VARIABLES are loaded

  • files (list, optional) – list of files to be read. If None, then the file list is used that is returned on get_file_list().

  • first_file (int, optional) – index of first file in file list to read. If None, the very first file in the list is used. Note: is ignored if input parameter file_pattern is specified.

  • last_file (int, optional) – index of last file in list to read. If None, the very last file in the list is used. Note: is ignored if input parameter file_pattern is specified.

  • file_pattern (str, optional) – string pattern for file search (cf get_file_list())

  • common_meta (dict, optional) – dictionary that contains additional metadata shared for this network (assigned to each metadata block of the UngriddedData object that is returned)

Returns:

data object

Return type:

UngriddedData

read_file(filename, vars_to_retrieve=None, vars_as_series=False)[source]

Read Aeronet Sun V3 level 1.5 or 2 file

Parameters:
  • filename (str) – absolute path to filename to read

  • vars_to_retrieve (list, optional) – list of str with variable names to read. If None, use DEFAULT_VARS

  • vars_as_series (bool) – if True, the data columns of all variables in the result dictionary are converted into pandas Series objects

Returns:

dict-like object containing results

Return type:

StationData

read_first_file(**kwargs)

Read first file returned from get_file_list()

Note

This method may be used for test purposes.

Parameters:

**kwargs – keyword args passed to read_file() (e.g. vars_to_retrieve)

Returns:

dictionary or similar containing loaded results from first file

Return type:

dict-like

read_station(station_id_filename, **kwargs)

Read data from a single station into UngriddedData

Find all files that contain the station ID in their filename and then call read(), providing the reduced filelist as input, in order to read all files from this station into data object.

Parameters:
  • station_id_filename (str) – name of station (MUST be encrypted in filename)

  • **kwargs – additional keyword args passed to read() (e.g. vars_to_retrieve)

Returns:

loaded data

Return type:

UngriddedData

Raises:

IOError – if no files can be found for this station ID

remove_outliers(data, vars_to_retrieve, **valid_rng_vars)

Remove outliers from data

Parameters:
  • data (dict-like) – data object containing data vectors for variables that are required for computation (cf. input param vars_to_compute)

  • vars_to_retrieve (list) – list of variable names for which outliers will be removed from data

  • **valid_rng_vars – additional keyword args specifying variable name and corresponding min / max interval (list or tuple) that specifies valid range for the variable. For each variable that is not explicitely defined here, the default minimum / maximum value is used (accessed via pyaerocom.const.VARS[var_name])

var_supported(var_name)

Check if input variable is supported

Parameters:

var_name (str) – AeroCom variable name or alias

Raises:

VariableDefinitionError – if input variable is not supported by pyaerocom

Returns:

True, if variable is supported by this interface, else False

Return type:

bool

property verbosity_level

Current level of verbosity of logger

AERONET SDA (V3)

class pyaerocom.io.read_aeronet_sdav3.ReadAeronetSdaV3(data_id=None, data_dir=None)[source]

Bases: ReadAeronetBase

Interface for reading Aeronet Sun SDA V3 Level 1.5 and 2.0 data

See also

Base classes ReadAeronetBase and ReadUngriddedBase

ALT_VAR_NAMES_FILE = {}

dictionary specifying alternative column names for variables defined in VAR_NAMES_FILE

Type:

OPTIONAL

AUX_FUNS = {'od550aer': <function calc_od550aer>, 'od550dust': <function calc_od550gt1aer>, 'od550gt1aer': <function calc_od550gt1aer>, 'od550lt1aer': <function calc_od550lt1aer>}

Functions that are used to compute additional variables (i.e. one for each variable defined in AUX_REQUIRES)

AUX_REQUIRES = {'od550aer': ['od500aer', 'ang4487aer'], 'od550dust': ['od500gt1aer', 'ang4487aer'], 'od550gt1aer': ['od500gt1aer', 'ang4487aer'], 'od550lt1aer': ['od500lt1aer', 'ang4487aer']}

dictionary containing information about additionally required variables for each auxiliary variable (i.e. each variable that is not provided by the original data but computed on import)

property AUX_VARS

List of auxiliary variables (keys of attr. AUX_REQUIRES)

Auxiliary variables are those that are not included in original files but are computed from other variables during import

COL_DELIM = ','

column delimiter in data block of files

property DATASET_PATH

Wrapper for data_dir.

DATA_ID = 'AeronetSDAV3Lev2.daily'

Name of dataset (OBS_ID)

DEFAULT_UNIT = '1'

Default data unit that is assigned to all variables that are not specified in UNITS dictionary (cf. UNITS)

DEFAULT_VARS = ['od550aer', 'od550gt1aer', 'od550lt1aer', 'od550dust']

default variables for read method

IGNORE_META_KEYS = ['date', 'time', 'day_of_year']
INSTRUMENT_NAME = 'sun_photometer'

name of measurement instrument

META_NAMES_FILE = {'altitude': 'Site_Elevation(m)', 'data_quality_level': 'Data_Quality_Level', 'date': 'Date_(dd:mm:yyyy)', 'day_of_year': 'Day_of_Year', 'instrument_number': 'AERONET_Instrument_Number', 'latitude': 'Site_Latitude(Degrees)', 'longitude': 'Site_Longitude(Degrees)', 'station_name': 'AERONET_Site', 'time': 'Time_(hh:mm:ss)'}

dictionary specifying the file column names (values) for each metadata key (cf. attributes of StationData, e.g. ‘station_name’, ‘longitude’, ‘latitude’, ‘altitude’)

META_NAMES_FILE_ALT = ({},)
NAN_VAL = -999.0

value corresponding to invalid measurement

PROVIDES_VARIABLES = ['od500gt1aer', 'od500lt1aer', 'od500aer', 'ang4487aer', 'od500dust']

List of variables that are provided by this dataset (will be extended by auxiliary variables on class init, for details see __init__ method of base class ReadUngriddedBase)

property REVISION_FILE

Name of revision file located in data directory

SUPPORTED_DATASETS = ['AeronetSDAV3Lev1.5.daily', 'AeronetSDAV3Lev2.daily']

List of all datasets supported by this interface

property TS_TYPE

Default implementation of string for temporal resolution

TS_TYPES = {'AeronetSDAV3Lev1.5.daily': 'daily', 'AeronetSDAV3Lev2.daily': 'daily'}

dictionary assigning temporal resolution flags for supported datasets that are provided in a defined temporal resolution

UNITS = {}

Variable specific units, only required for variables that deviate from DEFAULT_UNIT (is irrelevant for all variables that are so far supported by the implemented Aeronet products, i.e. all variables are dimensionless as specified in DEFAULT_UNIT)

VAR_NAMES_FILE = {'ang4487aer': 'Angstrom_Exponent(AE)-Total_500nm[alpha]', 'od500aer': 'Total_AOD_500nm[tau_a]', 'od500dust': 'Coarse_Mode_AOD_500nm[tau_c]', 'od500gt1aer': 'Coarse_Mode_AOD_500nm[tau_c]', 'od500lt1aer': 'Fine_Mode_AOD_500nm[tau_f]'}

dictionary specifying the file column names (values) for each Aerocom variable (keys)

VAR_PATTERNS_FILE = {}

Mappings for identifying variables in file (may be specified in addition to explicit variable names specified in VAR_NAMES_FILE)

check_vars_to_retrieve(vars_to_retrieve)

Separate variables that are in file from those that are computed

Some of the provided variables by this interface are not included in the data files but are computed within this class during data import (e.g. od550aer, ang4487aer).

The latter may require additional parameters to be retrieved from the file, which is specified in the class header (cf. attribute AUX_REQUIRES).

This function checks the input list that specifies all required variables and separates them into two lists, one that includes all variables that can be read from the files and a second list that specifies all variables that are computed in this class.

Parameters:

vars_to_retrieve (list) – all parameter names that are supposed to be loaded

Returns:

2-element tuple, containing

  • list: list containing all variables to be read

  • list: list containing all variables to be computed

Return type:

tuple

property col_index

Dictionary that specifies the index for each data column

Note

Implementation depends on the data. For instance, if the variable information is provided in all files (of all stations) and always in the same column, then this can be set as a fixed dictionary in the __init__ function of the implementation (see e.g. class ReadAeronetSunV2). In other cases, it may not be ensured that each variable is available in all files or the column definition may differ between different stations. In the latter case you may automise the column index retrieval by providing the header names for each meta and data column you want to extract using the attribute dictionaries META_NAMES_FILE and VAR_NAMES_FILE by calling _update_col_index() in your implementation of read_file() when you reach the line that contains the header information.

compute_additional_vars(data, vars_to_compute)

Compute all additional variables

The computations for each additional parameter are done using the specified methods in AUX_FUNS.

Parameters:
  • data (dict-like) – data object containing data vectors for variables that are required for computation (cf. input param vars_to_compute)

  • vars_to_compute (list) – list of variable names that are supposed to be computed. Variables that are required for the computation of the variables need to be specified in AUX_VARS and need to be available as data vectors in the provided data dictionary (key is the corresponding variable name of the required variable).

Returns:

updated data object now containing also computed variables

Return type:

dict

property data_dir: str

Location of the dataset

Note

This can be set explicitly when instantiating the class (e.g. if data is available on local machine). If unspecified, the data location is attempted to be inferred via get_obsnetwork_dir()

Raises:

FileNotFoundError – if data directory does not exist or cannot be retrieved automatically

Type:

str

property data_id

ID of dataset

property data_revision

Revision string from file Revision.txt in the main data directory

find_in_file_list(pattern=None)

Find all files that match a certain wildcard pattern

Parameters:

pattern (str, optional) – wildcard pattern that may be used to narrow down the search (e.g. use pattern=*Berlin* to find only files that contain Berlin in their filename)

Returns:

list containing all files in files that match pattern

Return type:

list

Raises:

IOError – if no matches can be found

get_file_list(pattern=None)

Search all files to be read

Uses _FILEMASK (+ optional input search pattern, e.g. station_name) to find valid files for query.

Parameters:

pattern (str, optional) – file name pattern applied to search

Returns:

list containing retrieved file locations

Return type:

list

Raises:

IOError – if no files can be found

infer_wavelength_colname(colname, low=250, high=2000)

Get variable wavelength from column name

Parameters:
  • colname (str) – string of column name

  • low (int) – lower limit of accepted value range

  • high (int) – upper limit of accepted value range

Returns:

wavelength in nm as floating str

Return type:

str

Raises:

ValueError – if None or more than one number is detected in variable string

logger

Class own instance of logger class

print_all_columns()
read(vars_to_retrieve=None, files=None, first_file=None, last_file=None, file_pattern=None, common_meta=None)

Method that reads list of files as instance of UngriddedData

Parameters:
  • vars_to_retrieve (list or similar, optional,) – list containing variable IDs that are supposed to be read. If None, all variables in PROVIDES_VARIABLES are loaded

  • files (list, optional) – list of files to be read. If None, then the file list is used that is returned on get_file_list().

  • first_file (int, optional) – index of first file in file list to read. If None, the very first file in the list is used. Note: is ignored if input parameter file_pattern is specified.

  • last_file (int, optional) – index of last file in list to read. If None, the very last file in the list is used. Note: is ignored if input parameter file_pattern is specified.

  • file_pattern (str, optional) – string pattern for file search (cf get_file_list())

  • common_meta (dict, optional) – dictionary that contains additional metadata shared for this network (assigned to each metadata block of the UngriddedData object that is returned)

Returns:

data object

Return type:

UngriddedData

read_file(filename, vars_to_retrieve=None, vars_as_series=False)[source]

Read Aeronet SDA V3 file and return it in a dictionary

Parameters:
  • filename (str) – absolute path to filename to read

  • vars_to_retrieve (list, optional) – list of str with variable names to read. If None, use DEFAULT_VARS

  • vars_as_series (bool) – if True, the data columns of all variables in the result dictionary are converted into pandas Series objects

Returns:

dict-like object containing results

Return type:

StationData

read_first_file