MultifileArrays
MultifileArrays implements "lazy concatenation" of file data. The primary function, load_series
, will load data from disk on-demand and store "slices" in a temporary buffer. This allows you treat a series of files as if they are a large contiguous array.
Further examples are described in the API section, but a simple demo using a directory dir
with a bunch of PNG files might be
julia> using MultifileArrays, FileIO
julia> img = load_series(load, "myimage_*.png"; dir)
Performance tips
While MultifileArrays is convenient, there are some performance caveats to keep in mind:
- to reduce the number of times that a file needs to be (re)loaded from disk, iteration over the resulting array is best done in a manner consistent with the file-by-file slicing.
- operations than can be performed "slice at a time" (e.g., visualization with ImageView) are even more optimized than scalar (single-element) indexing, as the latter must check whether the supplied slice-index corresponds to the currently loaded file upon each access.
For uncompressed data, alternative approaches that exploit memory-mapping may yield better performance. The StackViews package allows you to "glue" such arrays together.
API
MultifileArrays.MultifileArrays
MultifileArrays.load_chunked
MultifileArrays.load_series
MultifileArrays.load_series
MultifileArrays.select_series
MultifileArrays.MultifileArrays
— ModuleMultifileArrays creates lazily-loaded multidimensional arrays from files. Here are the main functions:
load_chunked
: Load an array from chunks stored in files infilenames
.load_series
: Create a lazily-loaded arrayA
from a set of files.select_series
: Create a vector of filenames fromfilepattern
.
MultifileArrays.load_chunked
— FunctionA = load_chunked(lazyloader, filenames)
Load an array from chunks stored in files in filenames
. filenames
must be shaped so that it is "extended" along the dimension of concatenation.
When each chunk has the same size and is equivalent to a single slice of the final array, load_series
may yield better performance.
Examples
Suppose you have 2 files, myimage_1.tiff
and myimage_2.tiff
, with the first storing 1000 two-dimensional images and the second storing 555 images of the same shape. Then you can load a contiguous 3d array with
julia> julia> filenames = reshape(["myimage_1.tiff", "myimage_2.tiff"], (1, 1, 2))
1×1×2 Array{String, 3}:
[:, :, 1] =
"myimage_1.tiff"
[:, :, 2] =
"myimage_2.tiff"
julia> img = load_chunked(fn -> load(fn; mmap=true), filenames);
julia> size(img)
(512, 512, 1555)
In the TiffImages package, mmap=true
allows you to "virtually" load the data by memory-mapping, supporting arrays much larger than computer memory.
load_chunked
requires that you manually load the BlockArrays package.
MultifileArrays.load_series
— MethodA = load_series(f, filepattern; dir=pwd())
Create a lazily-loaded array A
from a set of files. f(filename)
should create an array from the filename
, and filepattern
is a pattern matching the names of the desired files. The file names should have a numeric portion that indicates ordering; ordering is numeric rather than alphabetical, so left-padding with zeros is optional. See select_series
for details about the pattern-matching.
Examples
Suppose you are currently in a directory with files image01.tiff
... image12.tiff
. Then either
julia> using FileIO, MultifileArrays
julia> img = load_series(load, "image*.tiff")
or the more precise regular-expression form
julia> img = load_series(load, r"image(\d+).tiff");
suffice to load the image files.
MultifileArrays.load_series
— MethodA = load_series(f, filenames::AbstractArray{<:AbstractString}, buffer::AbstractArray)
Create a lazily-loaded array A
from a set of files. f
is a function to load the data from a specific file into an array equivalent to buffer
, meaning that
f(buffer, filename)
should fill buffer
with the contents of filename
.
filenames
should be an array of file names with shape equivalent to the trailing dimensions of A
, i.e., those that follow the dimensions of buffer
.
The advantage of this syntax is that it provides greater control than load_series(f, filepattern)
over the choice of files and the shape of the overall output.
StackViews provides an alternative approach that may yield better performance if you can either load all the files into memory at once or use lazy mmap
-based loading.
Examples
Suppose you are currently in a directory with files image_z=1_t=1.tiff
through image_z=5_t=30.tiff
, where each file corresponds to a 2d (x, y)
slice and the filename indicates the z
and t
coordinates. You could reshape filenames
into matrix form
5×30 Matrix{String}:
"image_z=1_t=1.tiff" "image_z=1_t=2.tiff" "image_z=1_t=3.tiff" … "image_z=1_t=29.tiff" "image_z=1_t=30.tiff"
"image_z=2_t=1.tiff" "image_z=2_t=2.tiff" "image_z=2_t=3.tiff" "image_z=2_t=29.tiff" "image_z=2_t=30.tiff"
"image_z=3_t=1.tiff" "image_z=3_t=2.tiff" "image_z=3_t=3.tiff" "image_z=3_t=29.tiff" "image_z=3_t=30.tiff"
"image_z=4_t=1.tiff" "image_z=4_t=2.tiff" "image_z=4_t=3.tiff" "image_z=4_t=29.tiff" "image_z=4_t=30.tiff"
"image_z=5_t=1.tiff" "image_z=5_t=2.tiff" "image_z=5_t=3.tiff" "image_z=5_t=29.tiff" "image_z=5_t=30.tiff"
and then
julia> buf = load(first(filenames));
julia> img = load_series(load!, filenames, buf)
would create a 4-dimensional output. load!
would ideally load directly into its first argument, but could be defined as
load!(dest, filename) = copyto!(dest, load(filename))
if needed.
MultifileArrays.select_series
— Methodfilenames = select_series(filepattern; dir=pwd())
Create a vector of filenames from filepattern
. filepattern
may be a string containing a *
character or a regular expression capturing a digit-substring. The *
/capture
extracts an integer that determines file order.
When dir
contains no extraneous files, and the filenames are ordered alphabetically in the desired sequence, then readdir
is a simpler alternative. select_series
may be useful for cases that don't satisfy both of these conditions.
Examples
Suppose you have a directory with myimage_1.png
, myimage_2.png
, ..., myimage_12.png
. Then
julia> select_series("myimage_*.png")
12-element Vector{String}:
"myimage_1.png"
"myimage_2.png"
⋮
"myimage_12.png"
The myimage_
part of the string is essential: the *
must match only integer data. The "generic wildcard" meaning of *
is implemented in Glob.