Merging of several StationData objects into one

This notebook illustrates to can merge several instances StationData objects into one objects. This merging only works for data from the same station and a typical case is if the data source files for one station are separated into many single files containing only parts of the data (e.g. if the files contain only one year of data).

In the following, the example of the EBAS database is used for illustration. In particular, we will focus on the retrieval of the aerosol light scattering coefficients at 550 nm (scatc550aer in AEROCOM naming convention) for the station Jungfraujoch, located in Germany.

[1]:
import pyaerocom as pya
Initating pyaerocom configuration
Checking database access...
Checking access to: /lustre/storeA
Access to lustre database: True
Init data paths for lustre
Expired time: 0.016 s

Get list of all files containing scattering data for EBAS station Jungfraujoch

[2]:
reader = pya.io.ReadEbas()
data = reader.read(vars_to_retrieve='scatc550aer',
                            datalevel=2, station_names='Jungfraujoch')
print(data)
Retrieving EBAS files for variables
['scatc550aer']
Reading files 1-3 of 28 (ReadEbas) | 18:10:51 (delta = 0 s')
Reading files 3-5 of 28 (ReadEbas) | 18:10:52 (delta = 0 s')
Reading files 5-7 of 28 (ReadEbas) | 18:10:52 (delta = 0 s')
Reading files 7-9 of 28 (ReadEbas) | 18:10:53 (delta = 0 s')
Reading files 9-11 of 28 (ReadEbas) | 18:10:53 (delta = 0 s')
Reading files 11-13 of 28 (ReadEbas) | 18:10:54 (delta = 0 s')
Reading files 13-15 of 28 (ReadEbas) | 18:10:55 (delta = 0 s')
Reading files 15-17 of 28 (ReadEbas) | 18:10:55 (delta = 0 s')
Reading files 17-19 of 28 (ReadEbas) | 18:10:56 (delta = 0 s')
Reading files 19-21 of 28 (ReadEbas) | 18:10:57 (delta = 0 s')
Reading files 21-23 of 28 (ReadEbas) | 18:10:58 (delta = 0 s')
Reading files 23-25 of 28 (ReadEbas) | 18:10:59 (delta = 1 s')
Reading files 25-27 of 28 (ReadEbas) | 18:11:00 (delta = 1 s')
Reading files 27-29 of 28 (ReadEbas) | 18:11:01 (delta = 1 s')

Pyaerocom UngriddedData
-----------------------
Contains networks: ['EBASMC']
Contains variables: ['scatc550aer']
Contains instruments: ['IN3563', 'TSI_3563_JFJ_dry', 'Ecotech_Aurora3000_JFJ_dry']
Total no. of meta-blocks: 28
Filters that were applied:
 Filter time log: 20191003181051
        prefer_statistics: ['arithmetic mean', 'median']
        wavelength_tol_nm: 50
        datalevel: 2
        station_names: Jungfraujoch
 Filter time log: 20191003181102
        Removed 0 metadata blocks that have no data assigned

As you can see, the data has successfully been imported into an instance of the UngriddedData class. This class is organised by file, that is, for each of the 26 files that were imported, there is one metadata dictionary assigned. Let’s look at the metadata from the first file:

[3]:
data.metadata[0]
[3]:
OrderedDict([('latitude', 46.5475),
             ('longitude', 7.985),
             ('altitude', 3578.0),
             ('filename',
              'CH0001G.19950101000000.20181031145000.nephelometer..aerosol.1y.1h.CH02L_IN3563.CH02L_backscat_coef.lev2.nas'),
             ('station_id', 'CH0001G'),
             ('station_name', 'Jungfraujoch'),
             ('instrument_name', 'IN3563'),
             ('PI', 'Baltensperger, Urs; Weingartner, Ernest'),
             ('country', None),
             ('ts_type', 'hourly'),
             ('data_id', 'EBASMC'),
             ('dataset_name', None),
             ('data_product', None),
             ('data_version', None),
             ('data_level', 2),
             ('revision_date', numpy.datetime64('2018-10-31T00:00:00')),
             ('website', None),
             ('ts_type_src', None),
             ('stat_merge_pref_attr', None),
             ('data_revision', '20190701'),
             ('var_info',
              OrderedDict([('scatc550aer',
                            OrderedDict([('name',
                                          'aerosol_light_scattering_coefficient'),
                                         ('units', '1/Mm'),
                                         ('wavelength', '550.0 nm'),
                                         ('method_ref', 'CH02L_scat_coef'),
                                         ('matrix', 'aerosol'),
                                         ('statistics',
                                          'arithmetic mean')]))])),
             ('variables', ['scatc550aer'])])

And the last one:

[4]:
data.metadata[25]
[4]:
OrderedDict([('latitude', 46.5475),
             ('longitude', 7.985),
             ('altitude', 3580.0),
             ('filename',
              'CH0001G.20170101000000.20190524143212.nephelometer..aerosol.1y.1h.CH02L_Ecotech_Aurora3000_JFJ_dry.CH02L_Neph_Aurora3000.lev2.nas'),
             ('station_id', 'CH0001G'),
             ('station_name', 'Jungfraujoch'),
             ('instrument_name', 'Ecotech_Aurora3000_JFJ_dry'),
             ('PI', 'Bukowiecki, Nicolas; Baltensperger, Urs'),
             ('country', None),
             ('ts_type', 'hourly'),
             ('data_id', 'EBASMC'),
             ('dataset_name', None),
             ('data_product', None),
             ('data_version', None),
             ('data_level', 2),
             ('revision_date', numpy.datetime64('2019-05-24T00:00:00')),
             ('website', None),
             ('ts_type_src', None),
             ('stat_merge_pref_attr', None),
             ('data_revision', '20190701'),
             ('var_info',
              OrderedDict([('scatc550aer',
                            OrderedDict([('name',
                                          'aerosol_light_scattering_coefficient'),
                                         ('units', '1/Mm'),
                                         ('wavelength', '525.0 nm'),
                                         ('statistics', 'arithmetic mean'),
                                         ('matrix', 'aerosol')]))])),
             ('variables', ['scatc550aer'])])

As you can see, both files contain scattering data but do not share all the same metadata attributes (e.g. instrument_name is different, which might be due to technological updates over time).

Let’s have a look at the respective time-series for both stations. First, convert into instance of StationData class and then plot.

[5]:
first_file = data.to_station_data(0, vars_to_convert='scatc550aer')
print(first_file)

Pyaerocom StationData
---------------------
var_info (BrowseDict):
  scatc550aer (OrderedDict):
    name: aerosol_light_scattering_coefficient
    units: 1/Mm
    wavelength: 550.0 nm
    method_ref: CH02L_scat_coef
    matrix: aerosol
    statistics: arithmetic mean
    overlap: False
station_coords (dict):
  latitude: 46.5475
  longitude: 7.985
  altitude: 3578.0
data_err (BrowseDict): <empty_dict>
overlap (BrowseDict): <empty_dict>
data_flagged (BrowseDict):
  scatc550aer (ndarray, 8760 items): [1.00, 1.00, ..., 0.0, 0.0]
filename: CH0001G.19950101000000.20181031145000.nephelometer..aerosol.1y.1h.CH02L_IN3563.CH02L_backscat_coef.lev2.nas
station_id: CH0001G
station_name: Jungfraujoch
instrument_name: IN3563
PI: Baltensperger, Urs; Weingartner, Ernest
country: None
ts_type: hourly
latitude: 46.5475
longitude: 7.985
altitude: 3578.0
data_id: EBASMC
dataset_name: None
data_product: None
data_version: None
data_level: 2
revision_date: 2018-10-31T00:00:00
website: None
ts_type_src: hourly
stat_merge_pref_attr: None
data_revision: 20190701

Data arrays
.................
dtime (ndarray, 8760 items): [1995-01-01T00:30:00.000000000, 1995-01-01T01:29:59.000000000, ..., 1995-12-31T22:29:59.000000000, 1995-12-31T23:29:59.000000000]
Pandas Series
.................
scatc550aer (Series, 8760 items)
[6]:
last_file = data.to_station_data(25)
print(last_file)

Pyaerocom StationData
---------------------
var_info (BrowseDict):
  scatc550aer (OrderedDict):
    name: aerosol_light_scattering_coefficient
    units: 1/Mm
    wavelength: 525.0 nm
    statistics: arithmetic mean
    matrix: aerosol
    overlap: False
station_coords (dict):
  latitude: 46.5475
  longitude: 7.985
  altitude: 3580.0
data_err (BrowseDict): <empty_dict>
overlap (BrowseDict): <empty_dict>
data_flagged (BrowseDict):
  scatc550aer (ndarray, 8760 items): [1.00, 1.00, ..., 1.00, 0.0]
filename: CH0001G.20170101000000.20190524143212.nephelometer..aerosol.1y.1h.CH02L_Ecotech_Aurora3000_JFJ_dry.CH02L_Neph_Aurora3000.lev2.nas
station_id: CH0001G
station_name: Jungfraujoch
instrument_name: Ecotech_Aurora3000_JFJ_dry
PI: Bukowiecki, Nicolas; Baltensperger, Urs
country: None
ts_type: hourly
latitude: 46.5475
longitude: 7.985
altitude: 3580.0
data_id: EBASMC
dataset_name: None
data_product: None
data_version: None
data_level: 2
revision_date: 2019-05-24T00:00:00
website: None
ts_type_src: hourly
stat_merge_pref_attr: None
data_revision: 20190701

Data arrays
.................
dtime (ndarray, 8760 items): [2017-01-01T00:30:00.000000000, 2017-01-01T01:29:59.000000000, ..., 2017-12-31T22:29:59.000000000, 2017-12-31T23:29:59.000000000]
Pandas Series
.................
scatc550aer (Series, 8760 items)
[7]:
first_file.plot_timeseries('scatc550aer');
../../_images/pyaerocom-tutorials_obsolete_add04_stationdata_merging_11_0.png
[8]:
last_file.plot_timeseries('scatc550aer');
../../_images/pyaerocom-tutorials_obsolete_add04_stationdata_merging_12_0.png

As you can see, the files contain data from different years. Now, how can we get these objects into one object that contains the timeseries of both files from this station?

This is actually very easy:

[9]:
merged = first_file.merge_other(last_file, 'scatc550aer')
print(merged)

Pyaerocom StationData
---------------------
var_info (BrowseDict):
  scatc550aer (OrderedDict):
    name: aerosol_light_scattering_coefficient
    units: 1/Mm
    wavelength: 550.0 nm;525.0 nm
    method_ref: CH02L_scat_coef
    matrix: aerosol
    statistics: arithmetic mean
    overlap: False
    ts_type: hourly
    apply_constraints: False
    min_num_obs: None
station_coords (dict):
  latitude: 46.5475
  longitude: 7.985
  altitude: 3578.0
data_err (BrowseDict): <empty_dict>
overlap (BrowseDict): <empty_dict>
data_flagged (BrowseDict):
  scatc550aer (ndarray, 8760 items): [1.00, 1.00, ..., 0.0, 0.0]
filename: CH0001G.19950101000000.20181031145000.nephelometer..aerosol.1y.1h.CH02L_IN3563.CH02L_backscat_coef.lev2.nas; CH0001G.20170101000000.20190524143212.nephelometer..aerosol.1y.1h.CH02L_Ecotech_Aurora3000_JFJ_dry.CH02L_Neph_Aurora3000.lev2.nas
station_id: CH0001G
station_name: Jungfraujoch
instrument_name: IN3563; Ecotech_Aurora3000_JFJ_dry
PI: Baltensperger, Urs; Weingartner, Ernest; Bukowiecki, Nicolas
country: None
ts_type: hourly
latitude: 46.5475
longitude: 7.985
altitude: 3578.0
data_id: EBASMC
dataset_name: None
data_product: None
data_version: None
data_level: 2
revision_date (list, 2 items): [numpy.datetime64('2018-10-31T00:00:00'), numpy.datetime64('2019-05-24T00:00:00')]
website: None
ts_type_src: hourly
stat_merge_pref_attr: None
data_revision: 20190701

Data arrays
.................
dtime (ndarray, 9794 items): [1995-07-08T23:00:00.000000000, 1995-07-09T00:00:00.000000000, ..., 2017-12-24T19:00:00.000000000, 2017-12-31T23:00:00.000000000]
Pandas Series
.................
scatc550aer (Series, 9794 items)

As you can see in the output, the merging comprises not only the data arrays but also registers any differences in the assoicated metdadata (cf. e.g., sampling wavelength 550 nm vs. 525 nm, instrument name, PI)

Now, have a look at the merged timeseries data.

[10]:
merged.plot_timeseries('scatc550aer');
../../_images/pyaerocom-tutorials_obsolete_add04_stationdata_merging_17_0.png

Looks okay. Let’s merge all 26 files and see if we get a nice long time series.

Retrieve list of StationData objects:

[11]:
stats = data.to_station_data('Jungfraujoch', 'scatc550aer', merge_if_multi=False)
print('Number of StationData objects retrieved: {}'.format(len(stats)))
Number of StationData objects retrieved: 28

Now merge them into one long time series:

[12]:
merged = pya.helpers.merge_station_data(stats, var_name='scatc550aer')
print(merged)

Pyaerocom StationData
---------------------
var_info (BrowseDict):
  scatc550aer (OrderedDict):
    name: aerosol_light_scattering_coefficient
    units: 1/Mm
    wavelength: 550.0 nm;525.0 nm
    statistics: arithmetic mean
    matrix: aerosol
    overlap: False
    ts_type: hourly
    apply_constraints: False
    min_num_obs: None
    method_ref: CH02L_scat_coef
station_coords (dict):
  latitude: 46.5475
  longitude: 7.985
  altitude: 3580.0
data_err (BrowseDict): <empty_dict>
overlap (BrowseDict):
  scatc550aer: 2015-01-01 01:00:00    0.255198
2015-01-01 02:00:00   -0.284809
2015-01-01 03:00:00    0.123830
2015-01-01 04:00:00    0.127599
2015-01-01 05:00:00    0.052224
                         ...
2018-12-31 19:00:00   -0.030204
2018-12-31 20:00:00    0.625988
2018-12-31 21:00:00    0.178854
2018-12-31 22:00:00    0.175893
2018-12-31 23:00:00    0.329280
Length: 25787, dtype: float64
data_flagged (BrowseDict):
  scatc550aer (ndarray, 8760 items): [0.0, 0.0, ..., 0.0, 0.0]
filename: CH0001G.20010101000000.20190524142901.nephelometer..aerosol.1y.1h.CH02L_TSI_3563_JFJ_dry.CH02L_Neph_3563.lev2.nas; CH0001G.20100101000000.20190524142901.nephelometer..aerosol.1y.1h.CH02L_TSI_3563_JFJ_dry.CH02L_Neph_3563.lev2.nas; CH0001G.20060101000000.20190524142901.nephelometer..aerosol.1y.1h.CH02L_TSI_3563_JFJ_dry.CH02L_Neph_3563.lev2.nas; CH0001G.20180101000000.20190520125514.nephelometer..aerosol.1y.1h.CH02L_Ecotech_Aurora3000_JFJ_dry.CH02L_Neph_Aurora3000.lev2.nas; CH0001G.20040101000000.20190524142901.nephelometer..aerosol.1y.1h.CH02L_TSI_3563_JFJ_dry.CH02L_Neph_3563.lev2.nas; CH0001G.20180101000000.20190520124723.nephelometer..aerosol.1y.1h.CH02L_TSI_3563_JFJ_dry.CH02L_Neph_3563.lev2.nas; CH0001G.20080101000000.20190524142901.nephelometer..aerosol.1y.1h.CH02L_TSI_3563_JFJ_dry.CH02L_Neph_3563.lev2.nas; CH0001G.20050101000000.20190524142901.nephelometer..aerosol.1y.1h.CH02L_TSI_3563_JFJ_dry.CH02L_Neph_3563.lev2.nas; CH0001G.20070101000000.20190524142901.nephelometer..aerosol.1y.1h.CH02L_TSI_3563_JFJ_dry.CH02L_Neph_3563.lev2.nas; CH0001G.20110101000000.20190524142901.nephelometer..aerosol.1y.1h.CH02L_TSI_3563_JFJ_dry.CH02L_Neph_3563.lev2.nas; CH0001G.20030101000000.20190524142901.nephelometer..aerosol.1y.1h.CH02L_TSI_3563_JFJ_dry.CH02L_Neph_3563.lev2.nas; CH0001G.20150101000000.20190524143212.nephelometer..aerosol.1y.1h.CH02L_Ecotech_Aurora3000_JFJ_dry.CH02L_Neph_Aurora3000.lev2.nas; CH0001G.20170101000000.20190524142901.nephelometer..aerosol.1y.1h.CH02L_TSI_3563_JFJ_dry.CH02L_Neph_3563.lev2.nas; CH0001G.20160101000000.20190524142901.nephelometer..aerosol.1y.1h.CH02L_TSI_3563_JFJ_dry.CH02L_Neph_3563.lev2.nas; CH0001G.20000101000000.20181031145000.nephelometer..aerosol.1y.1h.CH02L_IN3563.CH02L_backscat_coef.lev2.nas; CH0001G.20140101000000.20190524142901.nephelometer..aerosol.1y.1h.CH02L_TSI_3563_JFJ_dry.CH02L_Neph_3563.lev2.nas; CH0001G.20020101000000.20190524142901.nephelometer..aerosol.1y.1h.CH02L_TSI_3563_JFJ_dry.CH02L_Neph_3563.lev2.nas; CH0001G.20090101000000.20190524142901.nephelometer..aerosol.1y.1h.CH02L_TSI_3563_JFJ_dry.CH02L_Neph_3563.lev2.nas; CH0001G.20120101000000.20190524142901.nephelometer..aerosol.1y.1h.CH02L_TSI_3563_JFJ_dry.CH02L_Neph_3563.lev2.nas; CH0001G.20130101000000.20190524142901.nephelometer..aerosol.1y.1h.CH02L_TSI_3563_JFJ_dry.CH02L_Neph_3563.lev2.nas; CH0001G.19980101000000.20181031145000.nephelometer..aerosol.1y.1h.CH02L_IN3563.CH02L_backscat_coef.lev2.nas; CH0001G.19960101000000.20181031145000.nephelometer..aerosol.1y.1h.CH02L_IN3563.CH02L_backscat_coef.lev2.nas; CH0001G.20170101000000.20190524143212.nephelometer..aerosol.1y.1h.CH02L_Ecotech_Aurora3000_JFJ_dry.CH02L_Neph_Aurora3000.lev2.nas; CH0001G.19990101000000.20181031145000.nephelometer..aerosol.1y.1h.CH02L_IN3563.CH02L_backscat_coef.lev2.nas; CH0001G.20150101000000.20190524142901.nephelometer..aerosol.1y.1h.CH02L_TSI_3563_JFJ_dry.CH02L_Neph_3563.lev2.nas; CH0001G.20160101000000.20190524143212.nephelometer..aerosol.1y.1h.CH02L_Ecotech_Aurora3000_JFJ_dry.CH02L_Neph_Aurora3000.lev2.nas; CH0001G.19970101000000.20181031145000.nephelometer..aerosol.1y.1h.CH02L_IN3563.CH02L_backscat_coef.lev2.nas; CH0001G.19950101000000.20181031145000.nephelometer..aerosol.1y.1h.CH02L_IN3563.CH02L_backscat_coef.lev2.nas
station_id: CH0001G
station_name: Jungfraujoch
instrument_name: TSI_3563_JFJ_dry; Ecotech_Aurora3000_JFJ_dry; IN3563
PI: Baltensperger, Urs; Weingartner, Ernest; Bukowiecki, Nicolas
country: None
ts_type: hourly
latitude: 46.5475
longitude: 7.985
altitude: 3580.0
data_id: EBASMC
dataset_name: None
data_product: None
data_version: None
data_level: 2
revision_date (list, 3 items): [numpy.datetime64('2019-05-24T00:00:00'), numpy.datetime64('2019-05-20T00:00:00'), numpy.datetime64('2018-10-31T00:00:00')]
website: None
ts_type_src: hourly
stat_merge_pref_attr: None
data_revision: 20190701

Data arrays
.................
dtime (ndarray, 205849 items): [1995-07-08T23:00:00.000000000, 1995-07-09T00:00:00.000000000, ..., 2018-12-31T22:00:00.000000000, 2018-12-31T23:00:00.000000000]
Pandas Series
.................
scatc550aer (Series, 205849 items)

And plot…

[13]:
ax = merged.plot_timeseries('scatc550aer')
merged.plot_timeseries('scatc550aer', freq='daily', ax=ax)
merged.plot_timeseries('scatc550aer', freq='monthly', lw=3, ax=ax)
merged.plot_timeseries('scatc550aer', freq='yearly', ls='none', marker='o', ms=10, ax=ax);
../../_images/pyaerocom-tutorials_obsolete_add04_stationdata_merging_23_0.png

Comment for convenience….

Actually, in the default setup you do not really need to think about all this. As you might have recognised, when creating the list of StationData objects from the UngriddedData object (using method to_station_data) we parsed the argument merge_if_multi=False.

The default here is True, so you can just go ahead and do:

[14]:
data.to_station_data('Jungfraujoch', 'scatc550aer').plot_timeseries('scatc550aer')
[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d348975f8>
../../_images/pyaerocom-tutorials_obsolete_add04_stationdata_merging_25_1.png

What’s happening here is, that to_station_data internally creates a list of StationData objects and uses the above illustrated method merge_station_data at the end if the input argument merge_if_multi=True.

What about overlapping data ??

In some situations, there may be overlapping conflicts when merging multiple time series into one long time-series. In the following, we illustrate how these overlaps are handled if they occur.

The method merge_station_data that is illustrated above has some features to handle overlapping data and in any case, all overlaps that were detected are stored in the overlap attribute of the merged StationData object. Let’s check first if there are any overlaps in the Jungfraujoch data:

[15]:
merged.overlap
[15]:
BrowseDict([('scatc550aer', 2015-01-01 01:00:00    0.255198
             2015-01-01 02:00:00   -0.284809
             2015-01-01 03:00:00    0.123830
             2015-01-01 04:00:00    0.127599
             2015-01-01 05:00:00    0.052224
                                      ...
             2018-12-31 19:00:00   -0.030204
             2018-12-31 20:00:00    0.625988
             2018-12-31 21:00:00    0.178854
             2018-12-31 22:00:00    0.175893
             2018-12-31 23:00:00    0.329280
             Length: 25787, dtype: float64)])

Apparently, there is. You can check out these data (in comparison with the retrieved time series) as follows:

[16]:
merged.plot_timeseries('scatc550aer', add_overlaps=True);
../../_images/pyaerocom-tutorials_obsolete_add04_stationdata_merging_30_0.png

How to prioritise certain stations from others, when deciding what goes into overlap and what into the final timeseries?

The method merge_station_data provides a bunch of options to handle that. Things do not be written twice so please read the docstring of the method:

[17]:
help(pya.helpers.merge_station_data)
Help on function merge_station_data in module pyaerocom.helpers:

merge_station_data(stats, var_name, pref_attr=None, sort_by_largest=True, fill_missing_nan=True, **add_meta_keys)
    Merge multiple StationData objects (from one station) into one instance

    Note
    ----
    - all input :class:`StationData` objects need to have same attributes       ``station_name``, ``latitude``, ``longitude`` and ``altitude``

    Parameters
    ----------
    stats : list
        list containing :class:`StationData` objects (note: all of these
        objects must contain variable data for the specified input variable)
    var_name : str
        data variable name that is to be merged
    pref_attr
        optional argument that may be used to specify a metadata attribute
        that is available in all input :class:`StationData` objects and that
        is used to order the input stations by relevance. The associated values
        of this attribute need to be sortable (e.g. revision_date). This is
        only relevant in case overlaps occur. If unspecified the relevance of
        the stations is sorted based on the length of the associated data
        arrays.
    sort_by_largest : bool
        if True, the result from the sorting is inverted. E.g. if
        ``pref_attr`` is unspecified, then the stations will be sorted based on
        the length of the data vectors, starting with the shortest, ending with
        the longest. This sorting result will then be inverted, if
        ``sort_by_largest=True``, so that the longest time series get's highest
        importance. If, e.g. ``pref_attr='revision_date'``, then the stations
        are sorted by the associated revision date value, starting with the
        earliest, ending with the latest (which will also be inverted if
        this argument is set to True)
    fill_missing_nan : bool
        if True, the resulting time series is filled with NaNs. NOTE: this
        requires that information about the temporal resolution (ts_type) of
        the data is available in each of the StationData objects.

In particular, pref_attr and sort_by_largest are relevant here.

NOTE: if pref_attr is unspecified, then the stations are sorted based on the number of valid measurement points for the input variable. This was done in the merged time series that we retrieved above.

Now, in the following, let’s not use the number of available data points (to sort the stations by relevance) but prefer stations that have a more recent data revision date.

[18]:
try:
    merged_pref_awesomeness = pya.helpers.merge_station_data(stats, 'scatc550aer', pref_attr='awesomeness')
except pya.exceptions.MetaDataError as e:
    print('Failed merging, error: {}'.format(repr(e)))
Failed merging, error: MetaDataError('Cannot sort station relevance by attribute awesomeness. At least one of the input stations does not contain this attribute')

Unfortunately, the StationData objects do not contain an attribute awesomeness by which we could sort. Let’s go with revision_date instead:

[19]:
# recompute stations, since we overwrote one above
stats = data.to_station_data('Jungfraujoch', 'scatc550aer', merge_if_multi=False)
merged_pref_recent_revision = pya.helpers.merge_station_data(stats, 'scatc550aer', pref_attr='revision_date')
[20]:
merged_pref_recent_revision.plot_timeseries('scatc550aer', add_overlaps=True)
[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d341f8d68>
../../_images/pyaerocom-tutorials_obsolete_add04_stationdata_merging_37_1.png

Let’s see if there is any difference to the previous method (resample to daily resolution):

[21]:
ax = merged.overlap['scatc550aer'].resample('D').mean().plot(style='x', label='Overlaps (prefer longer timeseries)',
                                                             figsize=(20, 8))

merged_pref_recent_revision.overlap['scatc550aer'].resample('D').mean().plot(style='o',
                                                                             label='Overlaps (prefer more recent)',
                                                                             ax=ax, markerfacecolor='none')
ax.legend();
../../_images/pyaerocom-tutorials_obsolete_add04_stationdata_merging_39_0.png

As you can see, the merging strategy can make an impact and it is important to define a reasonable strategy. In this case, it is certainly more reliable to use the data with the more recent revision date.

[22]:
import matplotlib.pyplot as plt

plt.figure(figsize=(16, 12))
ax = plt.scatter(merged_pref_recent_revision.overlap['scatc550aer'],
                 merged.overlap['scatc550aer'], marker='x')
../../_images/pyaerocom-tutorials_obsolete_add04_stationdata_merging_41_0.png