Clarify use cases and then go for performance
In !13 (closed), and in #11, there are many issues with respect to how exactly open one or more data and mesh-mask files.
Xarray's and Dask's parallel performance is sensitive to how exactly a data set spanning multiple files is created, when and if chunking is applied, when and if coordinates and conventions are decoded, etc.
Xarray's notion of "coordinates" is broader than just indicating an object determining a position in time and space. It rather means any ancillary variable (like grid spacing, masks, etc.). Often, coordinates are everything that is constant in time (see, e.g., http://xgcm.readthedocs.io/en/latest/example_mitgcm.html).
So let's take a step back and find the best approach to create a data set containing
-
all time-dependent model output for a given output frequency as coordinates, and
-
all other fields as coordinates.
Rather than exposing preprocessing to the user, we could do this as follows:
def load_data(data_files, ancillary_files, **kwargs):
ds_aux = ... # load ancillary data set to
ds_list = map(lambda ds: preprocess_orca(ds_aux, ds, ...),
map(xr.open_dataset, ds))
ds = xr.merge(list(ds_list))
return ds