Result Caching

Overview

In many cases it’s useful to cache some of the intermediate results instead of discarding all the computation results all at once. Think of the following cases where you may have encountered when writing a long-running image processing workflow:

1. The cell density for each region in the scan is computed, but the number does not match up with what’s expected, so you want to debug by displaying a heatmap in a graphical viewer showing cell density. However, the intermediate result is lost so you need to recompute the density in order to draw the heatmap.

2. Some error occurs and you need to find out why a step in the computation causes the issue, but it’s rather difficult to understand what went wrong without displaying some intermediate results to aid debugging.

3. Graphically showing how the algorithm works step-by-step will be very help in identifying causes of issues, but requires saving all the results onto disk and chunked in a viewer-friendly format.

In all cases above, caching all the intermediate results help reduce headaches and risks of unknown errors coming from the difficulty of debugging in an image processing and distributed computing environment. The basic strategy is to cache all the results inside a directory tree. Each step saves all its intermediate and final results onto a node in the tree. The node’s children are directories saved by its sub-steps.

Here, the outputs of a processing step (function) may contain intermediate images (such as .ome.zarr), log files (.txt) and graphs generated by plotting libraries.

We describe the CacheDirectory interface in details below.

cache directory

Every cache directory tree starts with a root directory. In order to create a cache directory tree you need a url to the root directory location:

import cvpl_tools.tools.fs as tlfs
loc = f'path/to/root'  # a remote url will work as well
query = tlfs.cdir_init(loc)
# Now a directory is created at the given path, so you can start writing cache files to it
# ...
tlfs.cidr_commit(loc)

This creates a directory ‘path/to/root’ on the first run. The next time the program is run, it will not create new folders but return a query object with query.commit=True. A cache directory can be a child directory of a cache root directory or other cache directories.

import cvpl_tools.tools.fs as tlfs
loc = f'path/to/root'  # a remote url will work as well
query = tlfs.cdir_init(loc)
if not query.commit:  # if this is first time, compute the result instead of read from saved result
    sub_query = tlfs.cdir_init(f'{loc}/subdir')
    if query.commit:
        print('subdirectory is already created')
    else:
        print('subdirectory created successfully')
    # put your code here that writes to the subdir...
    tlfs.cidr_commit(f'{loc}/subdir')
tlfs.cidr_commit(loc)

Tips

  • when writing a process function that cache to a single location, pass a cache_url object via context_args["cache_url"], or pass None if we don’t want to write to disk

  • cache the images in a viewer-readable format. For OME-ZARR a flat image chunking scheme is suitable for 2D viewers like Napari. Re-chunking when loading back to memory may be slower but is usually not a big issue.