.. _model_library:

=============
Model Library
=============

.. _model_library_introduction:

Introduction
============

`~stpipe.library.AbstractModelLibrary` is a container designed to allow efficient processing of
collections of `~stpipe.datamodel.AbstractDataModel` instances created from an association.

`~stpipe.library.AbstractModelLibrary` is an ordered collection (like a `list`) but provides:

- access to association metadata: `~stpipe.library.AbstractModelLibrary.asn`
- grouping API: `~stpipe.library.AbstractModelLibrary.group_indices` and `~stpipe.library.AbstractModelLibrary.group_names`
- compatibility with `~stpipe.step.Step` and `~stpipe.pipeline.Pipeline` runs
- a consistent indexing API that is the same for "in memory" and "on disk" libraries


.. _library_association:

Loading an association
----------------------

Most commonly an instance will be created from an association file:

.. code-block:: pycon

   >>> library = ModelLibrary("my_asn.json")

.. NOTE::
    Creating a library does not read any models into memory,
    as long as the association contains a ``group_id`` for each member

.. _library_borrowing_and_shelving:

Borrowing and shelving models
-----------------------------

Interacting with an `~stpipe.library.AbstractModelLibrary` involves "borrowing" and "shelving"
models, both of which must occur during a ``with`` statement (while the library
is "open"):

.. code-block:: pycon

   >>> with library:
   ...    model = library.borrow(0)
   ...    # do stuff with the model...
   ...    library.shelve(model)

Iteration is also supported (but don't forget to return your models!).

.. code-block:: pycon

   >>> with library:
   ...    for model in library:  # implicitly calls borrow()
   ...        # do stuff with the model...
   ...        library.shelve(model)


.. _library_on_disk:

On Disk Mode
------------

For large associations (like those larger than memory) it is important
that the library avoid reading all models at once. The borrow/shelve API
above maps closely to the loading/saving of input (or temporary) files
containing the models.

.. code-block:: pycon

   >>> library = ModelLibrary("my_big_asn.json", on_disk=True)
   >>> with library:
   ...     model = library.borrow(0)  # the input file for model 0 is loaded
   ...     library.shelve(model)  # a temporary file for model 0 is written

.. NOTE::
   In the above example, a temporary file was created for model 0. At no
   point will the library overwrite the input file.

If model is not modified during the time it's borrowed (for example if the
``model.dq`` array was read, but not modified), it is helpful to tell the
library that the model was not modified.

.. code-block:: pycon

   >>> with library:
   ...     model = library.borrow(0)  # the input file for model 0 is loaded
   ...     # do some read-only stuff with the model
   ...     library.shelve(model, modify=False)  # No temporary file will be written

This tells the library not to overwrite the model's temporary file while shelving, saving
on both disk space and the time required to write.

.. WARNING::
   In the above example ``model`` remains in scope after the call to
   `~stpipe.library.AbstractModelLibrary.shelve` (and even after
   the exit of the with statement). This means ``model`` will not
   be garbage collected (and it's memory will not be freed) until
   the end of the scope containing the ``with library`` exits. If
   more work occurs within the scope please consider adding an
   explicit ``del model`` when your code is finished with the model.


.. _library_map_function:

Map function
------------

Let's say you want to get the ``meta.filename`` attribute for all models
in a library. The above "open", "borrow", "shelve" pattern can be quite
verbose. Instead, the helper method `~stpipe.library.AbstractModelLibrary.map_function`
can be used to generate an iterator that returns the result of a function
applied to each model in the library:

.. code-block:: pycon

   >>> def get_model_name(model, index):
   ...     return model.meta.filename
   >>>
   >>> filenames = list(library.map_function(get_model_name))

.. NOTE::
   `~stpipe.library.AbstractModelLibrary.map_function` does not require an open library
   and will handle opening, borrowing, shelving and closing for you.


.. _library_grouping:

Grouping
--------

Grouping also doesn't require an open library (as all grouping is
performed on the association metadata).

.. code-block:: pycon

   >>> print(f"All group names: {library.group_names}")
   >>> group_index_map = library.group_indices
   >>> for group_name in group_index_map:
   ...     print(f"\tModel indices for {group_name}: {group_index_map[group_name]}")

.. WARNING::
   Although `~stpipe.library.AbstractModelLibrary.group_names` and
   `~stpipe.library.AbstractModelLibrary.group_indices` do not require an open library,
   any "borrows" using the indices do. Be sure to open the library before
   trying to borrow a model.


.. _library_association_information:

Association Information
-----------------------

`~stpipe.library.AbstractModelLibrary.asn` provides read-only access to the association data.

.. code-block:: pycon

   >>> library.asn["products"][0]["name"]
   >>> library.asn["table_name"]

Although the specifics of what is returned by `~stpipe.library.AbstractModelLibrary.asn`
depends on how the subclass implements ``AbstractModelLibrary._load_asn``, it
is required that the association metadata dictionary contain a "members" list. This
can be inspected via ``library.asn["products"][0]["members"]`` and must contain a
dictionary for each "member" including key-value pairs for:

- "expname" for the exposure name, with a string value corresponding to the
  name of the file for this member
- "exptype" for the exposure type, with a string value describing the type
  of exposure (for example "science" or "background")

Although not required, "group_id" (with a string value corresponding to the
group name) should be added to each member dictionary (see
:ref:`library_association` for more details).

.. _library_usage_patterns:

Usage Patterns
==============

What follows is a section about using `~stpipe.library.AbstractModelLibrary`
in `~stpipe.step.Step` and `~stpipe.pipeline.Pipeline` code. This section
is short at the moment and can be extended with additional patterns as
the `~stpipe.library.AbstractModelLibrary` is used in more pipeline code.

.. _library_step_input_handling:

Step input handling
-------------------

It is recommended that any `~stpipe.step.Step` (or `~stpipe.pipeline.Pipeline`)
that accept an
`~stpipe.library.AbstractModelLibrary` consider the performance when
processing the input. It likely makes sense for any `~stpipe.step.Step`
that accepts a `~stpipe.library.AbstractModelLibrary` to also accept
an association filename as an input. The basic input handling could look
something like the following:

.. code-block:: pycon

   >>> def process(self, init):
   ...     if isinstance(init, ModelLibrary):
   ...         library = init  # do not copy the input ModelLibrary
   ...     else:
   ...         library = ModelLibrary(init, self.on_disk)
   ...     # process library without making a copy as
   ...     # that would lead to 2x required file space for
   ...     # an "on disk" model and 2x the memory for an "in memory"
   ...     # model
   ...     return library

The above pattern supports as input (``init``):

- an `~stpipe.library.AbstractModelLibrary`
- an association filename (via the `~stpipe.library.AbstractModelLibrary` constructor)
- all other inputs supported by the `~stpipe.library.AbstractModelLibrary` constructor

It is generally recommended to expose ``on_disk`` in the ``Step.spec``
allowing the `~stpipe.step.Step` to generate an :ref:`library_on_disk`
`~stpipe.library.AbstractModelLibrary`:

.. code-block:: pycon

   >>> class MyStep(Step):
   ...     spec = """
   ...         on_disk = boolean(default=False)  # keep models "on disk" to reduce RAM usage
   ...     """

.. NOTE::
   As mentioned in :ref:`library_on_disk` at no point will the input files
   referenced in the association be modified. However, the above pattern
   does allow ``Step.process`` to "modify" ``init`` when
   ``init`` is a `~stpipe.library.AbstractModelLibrary` (the models
   in the library will not be copied).

``Step.process`` can extend the above pattern to
support additional inputs (for example a single
`~stpipe.datamodel.AbstractDataModel` or filename containing
a `~stpipe.datamodel.AbstractDataModel`) to allow more
flexible data processings, although some consideration
should be given to how to handle input that does not
contain association metadata. Does it make sense
to construct a `~stpipe.library.AbstractModelLibrary` when the
association metadata is made up? Alternatively, is
it safer (less prone to misattribution of metadata)
to have the step process the inputs separately
(more on this below)?

.. _library_isolated_processing:

Isolated Processing
-------------------

Let's say we have a `~stpipe.step.Step`, ``flux_calibration``
that performs an operation that is only concerned with the data
for a single `~stpipe.datamodel.AbstractDataModel` at a time.
This step applies a function ``calibrate_model_flux`` that
accepts a single `~stpipe.datamodel.AbstractDataModel` and index as an input.
Its ``Step.process`` function can make good use of
`~stpipe.library.AbstractModelLibrary.map_function` to apply
this method to each model in the library.

.. code-block:: pycon

   >>> class FluxCalibration(Step):
   ...     spec = "..." # use spec defined above
   ...     def process(self, init):
   ...         # see input pattern described above
   ...         # list is used here to consume the generator produced by map_function
   ...         list(library.map_function(calibrate_model_flux))
   ...         return library

.. _library_troubleshooting:

Troubleshooting
===============

.. _library_closed_library_error:

ClosedLibraryError
------------------

.. code-block:: pycon

   >>> model = library.borrow(0)

   ClosedLibraryError: ModelLibrary is not open

The library must be "open" (used in a ``with`` statement) before
a model can be borrowed. This is important for keeping track of
which models were possibly modified.

This error can be avoided by "opening" the library before calling
`~stpipe.library.AbstractModelLibrary.borrow` (and being sure to call
`~stpipe.library.AbstractModelLibrary.shelve`):

.. code-block:: pycon

   >>> with library:
   ...     model = library.borrow(0)
   ...     library.shelve(model)

.. _library_borrow_error:

BorrowError
===========

.. code-block:: pycon

   >>> with library:
   ...     model = library.borrow(0)
   ...     # do stuff with the model
   ...     # forget to shelve it

   BorrowError: ModelLibrary has 1 un-returned models

Forgetting to `~stpipe.library.AbstractModelLibrary.shelve` a borrowed model will result in an
error. This is important for keeping track of model modifications and is
critical when the library uses temporary files to keep models out of memory.

This error can be avoided by making sure to `~stpipe.library.AbstractModelLibrary.shelve` all
borrowed models:

.. code-block:: pycon

   >>> with library:
   ...     model = library.borrow(0)
   ...     library.shelve(model)

Attempting to "double borrow" a model will also result in a `~stpipe.library.BorrowError`.

.. code-block:: pycon

   >>> with library:
   ...     model_a = library.borrow(0)
   ...     model_b = library.borrow(0)

   BorrowError: Attempt to double-borrow model

This check is also important for the library to track model modifications. The
error can be avoided by only borrowing each model once (it's ok to borrow
more than one model if they are at different positions in the library).

`~stpipe.library.BorrowError` exceptions can also be triggered when trying to replace
a model in the library.

.. code-block:: pycon

   >>> with library:
   ...     library.shelve(some_other_model)

   BorrowError: Attempt to shelve an unknown model

Here the library does not know where to shelve ``some_other_model`` (since
the ``some_other_model`` wasn't borrowed from the library). To replace
a model in the library you will need to first borrow the model at the index
you want to use and provide the index to the call to
`~stpipe.library.AbstractModelLibrary.shelve`.

.. code-block:: pycon

   >>> with library:
   ...     library.borrow(0)
   ...     library.shelve(some_other_model, 0)

Forgetting to first borrow the model at the index will also produce a
`~stpipe.library.BorrowError` (even if you provide the index).

.. code-block:: pycon

   >>> with library:
   ...     library.shelve(some_other_model, 0)

   BorrowError: Attempt to shelve model at a non-borrowed index

.. _library_developer_documentation:

Developer Documentation
-----------------------

What follows are note primarily aimed towards developers and
maintainers of `~stpipe.library.AbstractModelLibrary`. This section might be useful
to provide context to users but shouldn't be necessary for a user
to effectively use `~stpipe.library.AbstractModelLibrary`.

.. _library_implementing_a_subclass:

Implementing a subclass
^^^^^^^^^^^^^^^^^^^^^^^

Several methods are abstract and will need implementations:

- Methods used by stpipe:

  - `~stpipe.library.AbstractModelLibrary.crds_observatory`

- Methods used by `~stpipe.library.AbstractModelLibrary`

  - ``_datamodels_open``
  - ``_load_asn``
  - ``_filename_to_group_id``
  - ``_model_to_group_id``

It's likely that a few other methods might require overriding:

- ``_model_to_filename``
- ``_model_to_exptype``
- ``_assign_member_to_model``

Consult the docstrings (and base implementations) for more details.

It may also be required (depending on your usage) to update
``stpipe.step.Step._datamodels_open`` to allow stpipe to open and inspect an
`~stpipe.library.AbstractModelLibrary` when provided as a `~stpipe.step.Step` input.

.. _library_motivation:

Motivation
^^^^^^^^^^

The development of `~stpipe.library.AbstractModelLibrary` was largely motivated by
the need for a container compatible with stpipe machinery
that would allow passing "on disk" models between steps. Existing
containers (when used in "memory saving" modes) were not compatible
with stpipe. These containers also sometimes allowed input files
to be overwritten. It was decided that a new container would be
developed to address these and other issues. This would allow
gradual migration for pipeline code where specific steps and pipelines
could update to `~stpipe.library.AbstractModelLibrary` while leaving the existing
container unchanged for other steps.

A survey of container usage was performed with a few key findings:

- Many uses could be replaced by simpler containers (lists)
- When loaded from an association, the container size never changed;
  that is, no use-cases required adding new models to associations within steps
- The order of models was never changed
- Must be compatible with stpipe infrastructure (implements
  ``crds_observatory``, ``get_crds_parameters``, etc methods)
- Several steps implemented different memory optimizations
- Step code has additional complexity to deal with containers
  that sometimes returned filenames and sometimes returned models

Additionally, pipelines and steps may be expected to handle large
volumes of input data. For one example, consider a pipeline
responsible for generating a mosaic of a large number of input imaging
observations. As the size of the input data approaches (and exceeds)
the available memory it is critical that the pipeline, step, and
container code never read and hold all input data in memory.

.. _library_design_priciples:

Design principles
^^^^^^^^^^^^^^^^^

The high level goals of `~stpipe.library.AbstractModelLibrary` are:

- Replace many uses of existing containers, focusing on areas
  where large data is expected.
- Implement a minimal API that can be later expanded as needs
  arise.
- Provide a consistent API for "on disk" and "in memory" modes
  so step code does not need to be aware of the mode.
- Support all methods required by stpipe to allow a "on disk"
  container to pass between steps.

Most of the core functionality is public and described in the above
user documentation. What follows will be description of other parts
of the API (most private) and internal details.

One core issue is how can the container know when to load and
save models (to temporary files) if needed? With a typical list
``__getitem__`` can map to load but what will map to save?
Initial prototypes used ``__setitem__`` which led to some confusion
amongst reviewers. Treating the container like a list also
leads to expectations that the container also support
``append`` ``extend`` and other API that is unnecessary (as determined
in the above survey) and would be difficult to implement in a way that
would keep the container association information and model information
in sync.

.. _library_integration_with_stpipe:

Integration with stpipe
^^^^^^^^^^^^^^^^^^^^^^^

An `~stpipe.library.AbstractModelLibrary` may interact with stpipe when used as an
input or output for a `~stpipe.step.Step`.

- as a `~stpipe.step.Step` input where `~stpipe.library.AbstractModelLibrary.get_crds_parameters` and
  `~stpipe.library.AbstractModelLibrary.crds_observatory` will be used (sometimes with
  a limited model set, including only the first member of the input
  association).
- as a `~stpipe.step.Step` output where `~stpipe.library.AbstractModelLibrary.finalize_result` will
  be used.


.. _library_future_directions:

Future directions
^^^^^^^^^^^^^^^^^

The initial implementation of `~stpipe.library.AbstractModelLibrary` was intentionally
simple. Several features were discussed but deemed unnecessary for the current code.
This section will describe some of the discussed features to in-part provide a
record of these discussions.

.. _library_borrow_limits:

Borrow limits
^^^^^^^^^^^^^

As `~stpipe.library.AbstractModelLibrary` handles the loading and saving of models
(when "on disk") it could be straightforward to impose a limit to the number
and/or combined size of all "borrowed" models. This would help to avoid crashes
due to out-of-memory issues (especially important for HPC environments where
the memory limit may be defined at the job level). Being able to gracefully
recover from this error could also allow pipeline code to load as many
models as possible for more efficient batch processing.


.. _library_hollowing_out_models:

Hollowing out models
^^^^^^^^^^^^^^^^^^^^

Currently the `~stpipe.library.AbstractModelLibrary` does not close
models when they are "shelved" (it relies on the garbage collector).
This was done to allow easier integration with existing pipeline code
but does mean that the memory used for a model will not be freed until
the model is freed. By explicitly closing models and possibly
removing references between the model and the data arrays ("hollowing
out") memory could be freed sooner allowing for an overall decrease.

.. _library_append:

Append
^^^^^^

There is no way to append a model to a `~stpipe.library.AbstractModelLibrary`
(nor is there a way to pop, extend, delete, etc, any operation that changes the
number of models in a library). This was an intentional choice as any operation
that changes the number of models would obviously invalidate the
`~stpipe.library.AbstractModelLibrary.asn` data. It should be possible
(albeit complex) to support some if not all of these operations. However
serious consideration of their use and exhuasting of alternatives is
recommended as the added complexity would likely introduce bugs.

.. _library_updating_asn_on_shelve:

Updating asn on shelve
^^^^^^^^^^^^^^^^^^^^^^

Related to the note about :ref:`library_append` updating the
`~stpipe.library.AbstractModelLibrary.asn` data on
`~stpipe.library.AbstractModelLibrary.shelve` would allow step code
to modify asn-related attributes (like group_id) and have these changes
reflected in the `~stpipe.library.AbstractModelLibrary.asn` result.
A similar note of caution applies here where some consideration
of the complexity required vs the benefits is recommended.

.. _library_get_sections:

Get sections
^^^^^^^^^^^^

`~stpipe.library.AbstractModelLibrary` has no replacement for
the ``get_sections`` API provided with ``ModelContainer``. If its use
is generally required it might make sense to model the API off of
the existing group_id methods (where the subclass provides 2 methods
for efficiently accessing either an in-memory section or an on-disk
section for the "in memory" and "on disk" modes).

.. _library_parallel_map_function:

Parallel map function
^^^^^^^^^^^^^^^^^^^^^

`~stpipe.library.AbstractModelLibrary.map_function` is applied to each model
in a library sequentially. If this method proves useful and is typically
used with an independent and stateless function, extending the method to
use parallel application seems straightforward (although a new API might
be called for since a parallel application would likely not behave
as a generator.