Model Library

Introduction

AbstractModelLibrary is a container designed to allow efficient processing of collections of AbstractDataModel instances created from an association.

AbstractModelLibrary is an ordered collection (like a list) but provides:

access to association metadata: asn
grouping API: group_indices and group_names
compatibility with Step and Pipeline runs
a consistent indexing API that is the same for “in memory” and “on disk” libraries

Loading an association

Most commonly an instance will be created from an association file:

>>> library = ModelLibrary("my_asn.json")

Note

Creating a library does not read any models into memory, as long as the association contains a group_id for each member

Borrowing and shelving models

Interacting with an AbstractModelLibrary involves “borrowing” and “shelving” models, both of which must occur during a with statement (while the library is “open”):

>>> with library:
...    model = library.borrow(0)
...    # do stuff with the model...
...    library.shelve(model)

Iteration is also supported (but don’t forget to return your models!).

>>> with library:
...    for model in library:  # implicitly calls borrow()
...        # do stuff with the model...
...        library.shelve(model)

On Disk Mode

For large associations (like those larger than memory) it is important that the library avoid reading all models at once. The borrow/shelve API above maps closely to the loading/saving of input (or temporary) files containing the models.

>>> library = ModelLibrary("my_big_asn.json", on_disk=True)
>>> with library:
...     model = library.borrow(0)  # the input file for model 0 is loaded
...     library.shelve(model)  # a temporary file for model 0 is written

Note

In the above example, a temporary file was created for model 0. At no point will the library overwrite the input file.

If model is not modified during the time it’s borrowed (for example if the model.dq array was read, but not modified), it is helpful to tell the library that the model was not modified.

>>> with library:
...     model = library.borrow(0)  # the input file for model 0 is loaded
...     # do some read-only stuff with the model
...     library.shelve(model, modify=False)  # No temporary file will be written

This tells the library not to overwrite the model’s temporary file while shelving, saving on both disk space and the time required to write.

Warning

In the above example model remains in scope after the call to shelve (and even after the exit of the with statement). This means model will not be garbage collected (and it’s memory will not be freed) until the end of the scope containing the with library exits. If more work occurs within the scope please consider adding an explicit del model when your code is finished with the model.

Map function

Let’s say you want to get the meta.filename attribute for all models in a library. The above “open”, “borrow”, “shelve” pattern can be quite verbose. Instead, the helper method map_function can be used to generate an iterator that returns the result of a function applied to each model in the library:

>>> def get_model_name(model, index):
...     return model.meta.filename
>>>
>>> filenames = list(library.map_function(get_model_name))

Note

map_function does not require an open library and will handle opening, borrowing, shelving and closing for you.

Grouping

Grouping also doesn’t require an open library (as all grouping is performed on the association metadata).

>>> print(f"All group names: {library.group_names}")
>>> group_index_map = library.group_indices
>>> for group_name in group_index_map:
...     print(f"\tModel indices for {group_name}: {group_index_map[group_name]}")

Warning

Although group_names and group_indices do not require an open library, any “borrows” using the indices do. Be sure to open the library before trying to borrow a model.

Association Information

asn provides read-only access to the association data.

>>> library.asn["products"][0]["name"]
>>> library.asn["table_name"]

Although the specifics of what is returned by asn depends on how the subclass implements AbstractModelLibrary._load_asn, it is required that the association metadata dictionary contain a “members” list. This can be inspected via library.asn["products"][0]["members"] and must contain a dictionary for each “member” including key-value pairs for:

“expname” for the exposure name, with a string value corresponding to the name of the file for this member
“exptype” for the exposure type, with a string value describing the type of exposure (for example “science” or “background”)

Although not required, “group_id” (with a string value corresponding to the group name) should be added to each member dictionary (see Loading an association for more details).

Usage Patterns

What follows is a section about using AbstractModelLibrary in Step and Pipeline code. This section is short at the moment and can be extended with additional patterns as the AbstractModelLibrary is used in more pipeline code.

Step input handling

It is recommended that any Step (or Pipeline) that accept an AbstractModelLibrary consider the performance when processing the input. It likely makes sense for any Step that accepts a AbstractModelLibrary to also accept an association filename as an input. The basic input handling could look something like the following:

>>> def process(self, init):
...     if isinstance(init, ModelLibrary):
...         library = init  # do not copy the input ModelLibrary
...     else:
...         library = ModelLibrary(init, self.on_disk)
...     # process library without making a copy as
...     # that would lead to 2x required file space for
...     # an "on disk" model and 2x the memory for an "in memory"
...     # model
...     return library

The above pattern supports as input (init):

an AbstractModelLibrary
an association filename (via the AbstractModelLibrary constructor)
all other inputs supported by the AbstractModelLibrary constructor

It is generally recommended to expose on_disk in the Step.spec allowing the Step to generate an On Disk Mode AbstractModelLibrary:

>>> class MyStep(Step):
...     spec = """
...         on_disk = boolean(default=False)  # keep models "on disk" to reduce RAM usage
...     """

Note

As mentioned in On Disk Mode at no point will the input files referenced in the association be modified. However, the above pattern does allow Step.process to “modify” init when init is a AbstractModelLibrary (the models in the library will not be copied).

Step.process can extend the above pattern to support additional inputs (for example a single AbstractDataModel or filename containing a AbstractDataModel) to allow more flexible data processings, although some consideration should be given to how to handle input that does not contain association metadata. Does it make sense to construct a AbstractModelLibrary when the association metadata is made up? Alternatively, is it safer (less prone to misattribution of metadata) to have the step process the inputs separately (more on this below)?

Isolated Processing

Let’s say we have a Step, flux_calibration that performs an operation that is only concerned with the data for a single AbstractDataModel at a time. This step applies a function calibrate_model_flux that accepts a single AbstractDataModel and index as an input. Its Step.process function can make good use of map_function to apply this method to each model in the library.

>>> class FluxCalibration(Step):
...     spec = "..." # use spec defined above
...     def process(self, init):
...         # see input pattern described above
...         # list is used here to consume the generator produced by map_function
...         list(library.map_function(calibrate_model_flux))
...         return library

Troubleshooting

ClosedLibraryError

>>> model = library.borrow(0)

ClosedLibraryError: ModelLibrary is not open

The library must be “open” (used in a with statement) before a model can be borrowed. This is important for keeping track of which models were possibly modified.

This error can be avoided by “opening” the library before calling borrow (and being sure to call shelve):

>>> with library:
...     model = library.borrow(0)
...     library.shelve(model)

BorrowError

>>> with library:
...     model = library.borrow(0)
...     # do stuff with the model
...     # forget to shelve it

BorrowError: ModelLibrary has 1 un-returned models

Forgetting to shelve a borrowed model will result in an error. This is important for keeping track of model modifications and is critical when the library uses temporary files to keep models out of memory.

This error can be avoided by making sure to shelve all borrowed models:

>>> with library:
...     model = library.borrow(0)
...     library.shelve(model)

Attempting to “double borrow” a model will also result in a BorrowError.

>>> with library:
...     model_a = library.borrow(0)
...     model_b = library.borrow(0)

BorrowError: Attempt to double-borrow model

This check is also important for the library to track model modifications. The error can be avoided by only borrowing each model once (it’s ok to borrow more than one model if they are at different positions in the library).

BorrowError exceptions can also be triggered when trying to replace a model in the library.

>>> with library:
...     library.shelve(some_other_model)

BorrowError: Attempt to shelve an unknown model

Here the library does not know where to shelve some_other_model (since the some_other_model wasn’t borrowed from the library). To replace a model in the library you will need to first borrow the model at the index you want to use and provide the index to the call to shelve.

>>> with library:
...     library.borrow(0)
...     library.shelve(some_other_model, 0)

Forgetting to first borrow the model at the index will also produce a BorrowError (even if you provide the index).

>>> with library:
...     library.shelve(some_other_model, 0)

BorrowError: Attempt to shelve model at a non-borrowed index

Developer Documentation

What follows are note primarily aimed towards developers and maintainers of AbstractModelLibrary. This section might be useful to provide context to users but shouldn’t be necessary for a user to effectively use AbstractModelLibrary.

Implementing a subclass

Several methods are abstract and will need implementations:

Methods used by stpipe:
- crds_observatory
Methods used by AbstractModelLibrary
- _datamodels_open
- _load_asn
- _filename_to_group_id
- _model_to_group_id

It’s likely that a few other methods might require overriding:

_model_to_filename
_model_to_exptype
_assign_member_to_model

Consult the docstrings (and base implementations) for more details.

It may also be required (depending on your usage) to update stpipe.step.Step._datamodels_open to allow stpipe to open and inspect an AbstractModelLibrary when provided as a Step input.

Motivation

The development of AbstractModelLibrary was largely motivated by the need for a container compatible with stpipe machinery that would allow passing “on disk” models between steps. Existing containers (when used in “memory saving” modes) were not compatible with stpipe. These containers also sometimes allowed input files to be overwritten. It was decided that a new container would be developed to address these and other issues. This would allow gradual migration for pipeline code where specific steps and pipelines could update to AbstractModelLibrary while leaving the existing container unchanged for other steps.

A survey of container usage was performed with a few key findings:

Many uses could be replaced by simpler containers (lists)
When loaded from an association, the container size never changed; that is, no use-cases required adding new models to associations within steps
The order of models was never changed
Must be compatible with stpipe infrastructure (implements crds_observatory, get_crds_parameters, etc methods)
Several steps implemented different memory optimizations
Step code has additional complexity to deal with containers that sometimes returned filenames and sometimes returned models

Additionally, pipelines and steps may be expected to handle large volumes of input data. For one example, consider a pipeline responsible for generating a mosaic of a large number of input imaging observations. As the size of the input data approaches (and exceeds) the available memory it is critical that the pipeline, step, and container code never read and hold all input data in memory.

Design principles

The high level goals of AbstractModelLibrary are:

Replace many uses of existing containers, focusing on areas where large data is expected.
Implement a minimal API that can be later expanded as needs arise.
Provide a consistent API for “on disk” and “in memory” modes so step code does not need to be aware of the mode.
Support all methods required by stpipe to allow a “on disk” container to pass between steps.

Most of the core functionality is public and described in the above user documentation. What follows will be description of other parts of the API (most private) and internal details.

One core issue is how can the container know when to load and save models (to temporary files) if needed? With a typical list __getitem__ can map to load but what will map to save? Initial prototypes used __setitem__ which led to some confusion amongst reviewers. Treating the container like a list also leads to expectations that the container also support append extend and other API that is unnecessary (as determined in the above survey) and would be difficult to implement in a way that would keep the container association information and model information in sync.

Integration with stpipe

An AbstractModelLibrary may interact with stpipe when used as an input or output for a Step.

as a Step input where get_crds_parameters and crds_observatory will be used (sometimes with a limited model set, including only the first member of the input association).
as a Step output where finalize_result will be used.

Future directions

The initial implementation of AbstractModelLibrary was intentionally simple. Several features were discussed but deemed unnecessary for the current code. This section will describe some of the discussed features to in-part provide a record of these discussions.

Borrow limits

As AbstractModelLibrary handles the loading and saving of models (when “on disk”) it could be straightforward to impose a limit to the number and/or combined size of all “borrowed” models. This would help to avoid crashes due to out-of-memory issues (especially important for HPC environments where the memory limit may be defined at the job level). Being able to gracefully recover from this error could also allow pipeline code to load as many models as possible for more efficient batch processing.

Hollowing out models

Currently the AbstractModelLibrary does not close models when they are “shelved” (it relies on the garbage collector). This was done to allow easier integration with existing pipeline code but does mean that the memory used for a model will not be freed until the model is freed. By explicitly closing models and possibly removing references between the model and the data arrays (“hollowing out”) memory could be freed sooner allowing for an overall decrease.

Append

There is no way to append a model to a AbstractModelLibrary (nor is there a way to pop, extend, delete, etc, any operation that changes the number of models in a library). This was an intentional choice as any operation that changes the number of models would obviously invalidate the asn data. It should be possible (albeit complex) to support some if not all of these operations. However serious consideration of their use and exhuasting of alternatives is recommended as the added complexity would likely introduce bugs.

Updating asn on shelve

Related to the note about Append updating the asn data on shelve would allow step code to modify asn-related attributes (like group_id) and have these changes reflected in the asn result. A similar note of caution applies here where some consideration of the complexity required vs the benefits is recommended.

Get sections

AbstractModelLibrary has no replacement for the get_sections API provided with ModelContainer. If its use is generally required it might make sense to model the API off of the existing group_id methods (where the subclass provides 2 methods for efficiently accessing either an in-memory section or an on-disk section for the “in memory” and “on disk” modes).

Parallel map function

map_function is applied to each model in a library sequentially. If this method proves useful and is typically used with an independent and stateless function, extending the method to use parallel application seems straightforward (although a new API might be called for since a parallel application would likely not behave as a generator.