DiskArrays
DiskArrays.DiskArrays
— ModuleDiskArrays.jl
This package provides a collection of utilities for working with n-dimensional array-like data structures that do have considerable overhead for single read operations. Most important examples are arrays that represent data on hard disk that are accessed through a C library or that are compressed in chunks. It can be inadvisable to make these arrays a direct subtype of AbstractArray
many functions working with AbstractArrays assume fast random access into single values (including basic things like getindex
, show
, reduce
, etc...).
Currently supported features are:
getindex
/setindex
with the same rules as base (trailing or singleton dimensions etc)- views into
DiskArrays
- a fallback
Base.show
method that does not call getindex repeatedly - implementations for
mapreduce
andmapreducedim
, that respect the chunking of the underlying
dataset. This greatly increases performance of higher-level reductions like sum(a,dims=d)
- an iterator over the values of a DiskArray that caches a chunk of data and returns the values
within. This allows efficient usage of e.g. using DataStructures; counter(a)
- customization of
broadcast
when there is aDiskArray
on the LHS. This at least makes things
like a.=5
possible and relatively fast
AbstractDiskArray Interface definition
Package authors who want to use this library to make their disk-based array an AbstractDiskArray
should at least implement methods for the following functions:
Base.size(A::CustomDiskArray)
readblock!(A::CustomDiskArray{T,N},aout,r::Vararg{AbstractUnitRange,N})
writeblock!(A::CustomDiskArray{T,N},ain,r::Vararg{AbstractUnitRange,N})
Here readblock!
will read a subset of array A
in a hyper-rectangle defined by the unit ranges r
. The results shall be written into aout
. writeblock!
should write the data given by ain
into the (hyper-)rectangle of A defined by r
When defining the functions it can be safely assumed that length(r) == ndims(A)
as well as size(ain) == length.(r)
. However, bounds checking is not performed by the DiskArray machinery and currently should be done by the implementation.
If the data on disk has rectangular chunks as underlying storage units, you should addtionally implement the following methods to optimize some operations like broadcast, reductions and sparse indexing:
DiskArrays.haschunks(A::CustomDiskArray) = DiskArrays.Chunked()
DiskArrays.eachchunk(A::CustomDiskArray) = DiskArrays.GridChunks(A, chunksize)
where chunksize
is a int-tuple of chunk lengths. If the array does not have an internal chunking structure, one should define
DiskArrays.haschunks(A::CustomDiskArray) = DiskArrays.Unchunked()
Implementing only these methods makes all kinds of strange indexing patterns work (Colons, StepRanges, Integer vectors, Boolean masks, CartesianIndices, Arrays of CartesianIndex, and mixtures of all these) while making sure that as few readblock!
or writeblock!
calls as possible are performed by reading a rectangular bounding box of the required array values and re-arranging the resulting values into the output array.
In addition, DiskArrays.jl provides a few optimizations for sparse indexing patterns to avoid reading and discarding too much unnecessary data from disk, for example for indices like A[:,:,[1,1500]]
.
Example
Here we define a new array type that wraps a normal AbstractArray. The only access method that we define is a readblock!
function where indices are strictly given as unit ranges along every dimension of the array. This is a very common API used in libraries like HDF5, NetCDF and Zarr. We also define a chunking, which will control the way iteration and reductions are computed. In order to understand how exactly data is accessed, we added the additional print statements in the readblock!
and writeblock!
functions.
using DiskArrays
struct PseudoDiskArray{T,N,A<:AbstractArray{T,N}} <: AbstractDiskArray{T,N}
parent::A
chunksize::NTuple{N,Int}
end
PseudoDiskArray(a;chunksize=size(a)) = PseudoDiskArray(a,chunksize)
haschunks(a::PseudoDiskArray) = Chunked()
eachchunk(a::PseudoDiskArray) = GridChunks(a,a.chunksize)
Base.size(a::PseudoDiskArray) = size(a.parent)
function DiskArrays.readblock!(a::PseudoDiskArray,aout,i::AbstractUnitRange...)
ndims(a) == length(i) || error("Number of indices is not correct")
all(r->isa(r,AbstractUnitRange),i) || error("Not all indices are unit ranges")
println("Reading at index ", join(string.(i)," "))
aout .= a.parent[i...]
end
function DiskArrays.writeblock!(a::PseudoDiskArray,v,i::AbstractUnitRange...)
ndims(a) == length(i) || error("Number of indices is not correct")
all(r->isa(r,AbstractUnitRange),i) || error("Not all indices are unit ranges")
println("Writing to indices ", join(string.(i)," "))
view(a.parent,i...) .= v
end
a = PseudoDiskArray(rand(4,5,1))
Disk Array with size 10 x 9 x 1
Now all the Base indexing behaviors work for our array, while minimizing the number of reads that have to be done:
a[:,3]
Reading at index Base.OneTo(10) 3:3 1:1
10-element Array{Float64,1}:
0.8821177068878834
0.6220977650963209
0.22676949571723437
0.3177934541451004
0.08014908894614026
0.9989838001681182
0.5865160181790519
0.27931778627456216
0.449108677620097
0.22886146620923808
As can be seen from the read message, only a single call to readblock
is performed, which will map to a single call into the underlying C library.
mask = falses(4,5,1)
mask[3,2:4,1] .= true
a[mask]
3-element Array{Int64,1}:
6
7
8
One can check in a similar way, that reductions respect the chunks defined by the data type:
sum(a,dims=(1,3))
Reading at index 1:5 1:3 1:1
Reading at index 6:10 1:3 1:1
Reading at index 1:5 4:6 1:1
Reading at index 6:10 4:6 1:1
Reading at index 1:5 7:9 1:1
Reading at index 6:10 7:9 1:1
1×9×1 Array{Float64,3}:
[:, :, 1] =
6.33221 4.91877 3.98709 4.18658 … 6.01844 5.03799 3.91565 6.06882
````
When a DiskArray is on the LHS of a broadcasting expression, the results with be
written chunk by chunk:
julia va = view(a,5:10,5:8,1) va .= 2.0 a[:,:,1]
Writing to indices 5:5 5:6 1:1 Writing to indices 6:10 5:6 1:1 Writing to indices 5:5 7:8 1:1 Writing to indices 6:10 7:8 1:1 Reading at index Base.OneTo(10) Base.OneTo(9) 1:1
10×9 Array{Float64,2}: 0.929979 0.664717 0.617594 0.720272 … 0.564644 0.430036 0.791838 0.392748 0.508902 0.941583 0.854843 0.682924 0.323496 0.389914 0.761131 0.937071 0.805167 0.951293 0.630261 0.290144 0.534721 0.332388 0.914568 0.497409 0.471007 0.470808 0.726594 0.97107 0.251657 0.24236 0.866905 0.669599 2.0 2.0 0.427387 0.388476 0.121011 0.738621 0.304039 … 2.0 2.0 0.687802 0.991391 0.621701 0.210167 0.129159 2.0 2.0 0.733581 0.371857 0.549601 0.289447 0.509249 2.0 2.0 0.920333 0.76309 0.648815 0.632453 0.623295 2.0 2.0 0.387723 0.0882056 0.842403 0.147516 0.0562536 2.0 2.0 0.107673 ````
Accessing strided Arrays
There are situations where one wants to read every other value along a certain axis or provide arbitrary strides. Some DiskArray backends may want to provide optimized methods to read these strided arrays. In this case a backend can define readblock!(a,aout,r::OrdinalRange...)
and the respective writeblock
method which will overwrite the fallback behavior that would read the whol block of data and only return the desired range.
Arrays that do not implement eachchunk
There are arrays that live on disk but which are not split into rectangular chunks, so that the haschunks
trait returns Unchunked()
. In order to still enable broadcasting and reductions for these arrays, a chunk size will be estimated in a way that a certain memory limit per chunk is not exceeded. This memory limit defaults to 100MB and can be modified by changing DiskArrays.default_chunk_size[]
. Then a chunk size is computed based on the element size of the array. However, there are cases where the size of the element type is undefined, e.g. for Strings or variable-length vectors. In these cases one can overload the DiskArrays.element_size
function for certain container types which returns an approximate element size (in bytes). Otherwise the size of an element will simply be assumed to equal the value stored in DiskArrays.fallback_element_size
which defaults to 100 bytes.
[ci-img]: https://github.com/JuliaIO/DiskArrays.jl/workflows/CI/badge.svg [ci-url]: https://github.com/JuliaIO/DiskArrays.jl/actions?query=workflow%3ACI [codecov-img]: http://codecov.io/github/JuliaIO/DiskArrays.jl/coverage.svg?branch=main [codecov-url]: (http://codecov.io/github/JuliaIO/DiskArrays.jl?branch=main)
DiskArrays.default_chunk_size
— ConstantThe target chunk size for processing for unchunked arrays in MB, defaults to 100MB
DiskArrays.fallback_element_size
— ConstantA fallback element size for arrays to determine a where elements have unknown size like strings. Defaults to 100MB
DiskArrays.AbstractDiskArray
— TypeAbstractDiskArray <: AbstractArray
Abstract DiskArray type that can be inherited by Array-like data structures that have a significant random access overhead and whose access pattern follows n-dimensional (hyper)-rectangles.
DiskArrays.AbstractPermutedDiskArray
— TypeAbstractPermutedDiskArray <: AbstractDiskArray
Abstract supertype for diskarray with permuted dimensions.
DiskArrays.AbstractReshapedDiskArray
— TypeAbstractReshapedDiskArray <: AbstractDiskArray
Abstract supertype for a replacements of Base.ReshapedArray
for AbstractDiskArray
s`
DiskArrays.AbstractSubDiskArray
— TypeSubDiskArray <: AbstractDiskArray
Abstract supertype for a view of an AbstractDiskArray
DiskArrays.AllowStepRange
— TypeAllowStepRange
Traits to specify if an array axis can utilise step ranges, as an argument to BatchStrategy
types NoBatch
, SubRanges
and ChunkRead
.
CanStepRange()
and NoStepRange()
are the two options.
DiskArrays.BatchStrategy
— TypeBatchStrategy{S<:AllowStepRange}
Traits for array chunking strategy.
NoBatch
, SubRanges
and ChunkRead
are the options.
All have keywords:
alow_steprange
: anAllowStepRange
trait, NoStepRange() by default. this controls if step range are passed to the parent object.density_threshold
: determines the density where step ranges are not read as whole chunks.
DiskArrays.BlockedIndices
— TypeBlockedIndices{C<:GridChunks}
A lazy iterator over the indices of GridChunks.
Uses two Iterators.Stateful
iterators, at chunk and indices levels.
DiskArrays.CachedDiskArray
— TypeCachedDiskArray <: ChunkTiledDiskArray
CachedDiskArray(A::AbstractArray; maxsize=1000, mmap=false)
Wrap some disk array A
with a caching mechanism that will keep chunks up to a total of maxsize
megabytes, dropping the least used chunks when maxsize
is exceeded. If mmap
is set to true
, cached chunks will not be kept in RAM but Mmapped to temproray files.
Can also be called with cache
, which can be extended for wrapper array types.
DiskArrays.ChunkIndex
— TypeChunkIndex{N}
This can be used in indexing operations when one wants to extract a full data chunk from a DiskArray.
Useful for iterating over chunks of data.
d[ChunkIndex(1, 1)]
will extract the first chunk of a 2D-DiskArray
DiskArrays.ChunkIndexType
— TypeDiskArrays.ChunkIndices
— TypeChunkIndices{N}
Represents an iterator of ChunkIndex
objects.
DiskArrays.ChunkRead
— TypeChunkRead <: BatchStrategy
A chunking strategy splits a dataset according to chunk, and reads chunk by chunk.
DiskArrays.ChunkTiledDiskArray
— TypeChunkTiledDiskArray <: AbstractDiskArray
And abstract supertype for disk arrays that have fast indexing of tiled chunks already stored as separate arrays, such as CachedDiskArray
.
DiskArrays.ChunkVector
— TypeChunkVector <: AbstractVector{UnitRange}
Supertype for lazy vectors of UnitRange
.
RegularChunks
and IrregularChunks
are the implementations.
DiskArrays.Chunked
— TypeChunked{<:BatchStrategy}
A trait that specifies an Array has a chunked read pattern.
DiskArrays.ChunkedTrait
— TypeDiskArrays.ConcatDiskArray
— TypeConcatDiskArray <: AbstractDiskArray
ConcatDiskArray(arrays)
Joins multiple AbstractArray
s or AbstractDiskArray
s into a single disk array, using lazy concatination.
Returned from cat
on disk arrays.
It is also useful on its own as it can easily concatenate an array of disk arrays.
DiskArrays.DiskGenerator
— TypeDiskGenerator{I,F}
Replaces Base.Generator
for disk arrays.
Operates out-of-order over chunks, but collect
will create an array in the correct order.
DiskArrays.DiskIndex
— TypeDiskIndex
DiskIndex(
output_size::NTuple{N,<:Integer},
temparray_size::NTuple{M,<:Integer},
output_indices::Tuple,
temparray_indices::Tuple,
data_indices::Tuple
)
DiskIndex(a::AbsractArray, i)
An object encoding indexing into a chunked disk array, and to memory-backed input/output buffers.
Arguments and fields
output_size
size of the output arraytemparray_size
size of the temp array passed toreadblock
output_indices
indices for copying into the output arraytemparray_indices
indices for reading from temp arraydata_indices
indices for reading from data array
DiskArrays.DiskZip
— TypeDiskZip
Replaces Zip
for disk arrays, for calling zip
on disk arrays.
Reads out-of-order over chunks, but collect
s to the correct order. Less flexible than Base.Zip
as it can only zip with other AbstractArray
.
Note: currently only one of the first two arguments of zip
must be a disk array to return DiskZip
.
DiskArrays.GridChunks
— TypeGridChunks
Multi-dimensional chunk specification, that holds a chunk pattern for each axis of an array.
These are usually RegularChunks
or IrregularChunks
.
DiskArrays.IrregularChunks
— TypeIrregularChunks <: ChunkVector
Defines chunks along a dimension where chunk sizes are not constant but arbitrary
DiskArrays.IrregularChunks
— MethodIrregularChunks(; chunksizes)
Returns an IrregularChunks object for the given list of chunk sizes
DiskArrays.MockChunkedDiskArray
— TypeMockChunkedDiskArray <: AbstractDiskArray
MockChunkedDiskArray(parent::AbstractArray, chunks::GridChunks)
A disk array that pretends to have a specific chunk pattern, regardless of the true chunk pattern of the parent array.
This is useful in zip
and other operations that can iterate over multiple arrays with different patterns.
DiskArrays.MultiReadArray
— TypeMultiReadArray <: AbstractArray
An array too that holds indices for multiple block reads.
DiskArrays.NoBatch
— TypeNoBatch <: BatchStrategy
A chunking strategy that avoids batching into multiple reads.
DiskArrays.PaddedDiskArray
— TypePaddedDiskArray <: AbstractDiskArray
PaddedDiskArray(A, padding; fill=zero(eltype(A)))
An AbstractDiskArray
that adds padding to the edges of the parent array. This can help changing chunk offsets or padding a larger than memory array before windowing operations.
Arguments
A
: The parent disk array.padding
: A tuple ofInt
lower and upper padding tuples, one for each dimension.
Keywords
fill=zero(eltype(A))
: The value to pad the array with.
DiskArrays.PermutedDiskArray
— TypePermutedDiskArray <: AbstractPermutedDiskArray
A lazily permuted disk array returned by permutedims(diskarray, permutation)
.
DiskArrays.RegularChunks
— TypeRegularChunks <: ChunkArray
Defines chunking along a dimension where the chunks have constant size and a potential offset for the first chunk. The last chunk is truncated to fit the array size.
DiskArrays.ReshapedDiskArray
— TypeReshapedDiskArray <: AbstractReshapedDiskArray
A replacement for Base.ReshapedArray
for disk arrays, returned by reshape
.
Reshaping is really not trivial, because the access pattern would completely change for reshaped arrays, rectangles would not remain rectangles in the parent array.
However, we can support the case where only singleton dimensions are added, later we could allow more special cases like joining two dimensions to one
DiskArrays.SubDiskArray
— TypeSubDiskArray <: AbstractDiskArray
A replacement for Base.SubArray
for disk arrays, returned by view
.
DiskArrays.SubRanges
— TypeSubRanges <: BatchStrategy
A chunking strategy that splits contiguous streaks into ranges to be read separately.
DiskArrays.Unchunked
— TypeUnchunked{<:BatchStrategy}
A trait that specifies an Array does not have a chunked read pattern, and random access indexing is relatively performant.
DiskArrays.allowscalar
— Methodallowscalar(x::Bool)
Specify if a disk array can do scalar indexing, (with all Int
arguments).
Setting allowscalar(false)
can help identify the cause of poor performance.
DiskArrays.approx_chunksize
— Methodapprox_chunksize(g::GridChunks)
Returns the aproximate chunk size of the grid.
For the dimension with regular chunks, this will be the exact chunk size while for dimensions with irregular chunks this is the average chunks size.
Useful for downstream applications that want to distribute computations and want to know about chunk sizes.
DiskArrays.arraysize_from_chunksize
— Methodarraysize_from_chunksize(g::ChunkVector)
Returns the size of the dimension represented by a chunk object.
DiskArrays.cache
— Methodcache(A::AbstractArray; maxsize=1000, mmap=false)
Wrap internal disk arrays with CacheDiskArray
.
This function is intended to be extended by package that want to re-wrap the disk array afterwards, such as YAXArrays.jl or Rasters.jl.
DiskArrays.canscalar
— MethodDiskArrays.create_outputarray
— Methodcreate_outputarray(out, a, output_size)
Generate an Array
to pass to readblock!
DiskArrays.eachchunk
— Functioneachchunk(a)
Returns an iterator with CartesianIndices
elements that mark the index range of each chunk within an array.
DiskArrays.element_size
— Methodelement_size(a::AbstractArray)
Returns the approximate size of an element of a in bytes. This falls back to calling sizeof
on the element type or to the value stored in DiskArrays.fallback_element_size
. Methods can be added for custom containers.
DiskArrays.estimate_chunksize
— Methodestimate_chunksize(a::AbstractArray)
Estimate a suitable chunk pattern for an AbstractArray
without chunks.
DiskArrays.getindex_disk
— Methodgetindex_disk(a::AbstractArray, i...)
Internal getindex
for disk arrays.
Converts indices to ranges and calls DiskArrays.readblock!
DiskArrays.grid_offset
— Methodgrid_offset(g::GridChunks)
Returns the offset of the grid for the first chunks.
Expect this value to be non-zero for views into regular-gridded arrays.
Useful for downstream applications that want to distribute computations and want to know about chunk sizes.
DiskArrays.haschunks
— FunctionDiskArrays.isdisk
— Methodisdisk(a::AbstractArray)
Return true
if a
is a AbstractDiskArray
or follows the DiskArrays.jl interface via macros. Otherwise false
.
DiskArrays.max_chunksize
— Methodmax_chunksize(g::GridChunks)
Returns the maximum chunk size of an array for each dimension.
Useful for pre-allocating arrays to make sure they can hold a chunk of data.
DiskArrays.maybeshrink
— Methodmaybeshrink(temparray::AbstractArray, indices::Tuple)
Shrink an array with a view, if needed.
TODO: this could be type stable if we reshaped the array instead.
DiskArrays.merge_index
— Methodmerge_index(a::DiskIndex, b::DiskIndex)
Merge two DiskIndex
into a single index accross more dimensions.
DiskArrays.mockchunks
— Methodmockchunks(data::AbstractArray,chunks)
Change the chunk pattern of the underlying DiskArray according to chunks
.
Note that this will not change the chunking of the underlying data itself, it will just make the data "look" like it had a different chunking. If you need a persistent on-disk representation of this chunking, save the resulting array.
The chunks argument can take one of the following forms:
- a
DiskArrays.GridChunks
object - a tuple specifying the chunk size along each dimension, like
(10, 10, 1)
for a 3-D array
DiskArrays.need_batch
— Methodneed_batch(a::AbstractArray, i) => Bool
Check if disk array a
needs batch indexing for indices i
, returning a Bool
.
DiskArrays.nooffset
— MethodRemoves the offset from a ChunkIndex
DiskArrays.output_aliasing
— Methodoutput_aliasing(di::DiskIndex, ndims_dest, ndims_source)
Determines wether output and temp array can:
a) be identical, returning :identical
b) share memory through reshape, returning :reshapeoutput
c) need to be allocated individually, returning :noalign
DiskArrays.pad
— Methodpad(A, padding; fill=zero(eltype(A)))
Pad any AbstractArray
with fill values, updating chunk patterns.
Arguments
A
: The parent disk array.padding
: A tuple ofInt
lower and upper padding tuples, one for each dimension.
Keywords
fill=zero(eltype(A))
: The value to pad the array with.
DiskArrays.process_index
— Methodprocess_index(i, chunks, batchstrategy)
Calculate indices for i
the first chunk/s in chunks
Returns a DiskIndex
, and the remaining chunks.
DiskArrays.readblock!
— Functionreadblock!(A::AbstractDiskArray, A_ret, r::AbstractUnitRange...)
The only function that should be implemented by a AbstractDiskArray
. This function
DiskArrays.readblock_checked!
— MethodLike readblock!
, but only exectued when data size to read is not empty
DiskArrays.setindex_disk!
— Methodsetindex_disk!(A::AbstractArray, v, i...)
Internal setindex!
for disk arrays.
Converts indices to ranges and calls DiskArrays.writeblock!
DiskArrays.splitchunks
— Methodsplitchunks(i, chunks)
Split chunks into a 2-tuple based on i, so that the first group match i and the second match the remaining indices.
The dimensionality of i
will determine the number of chunks returned in the first group.
DiskArrays.transfer_results_read!
— Methodtransfer_results_read!(outputarray, temparray, outputindices, temparrayindices)
Copy results from temparray
to outputarray
for respective indices
DiskArrays.transfer_results_write!
— Methodtransfer_results_write!(values, temparray, valuesindices, temparrayindices)
Copy results from values
to temparry
for respective indices.
DiskArrays.writeblock!
— Functionwriteblock!(A::AbstractDiskArray, A_in, r::AbstractUnitRange...)
Function that should be implemented by a AbstractDiskArray
if write operations should be supported as well.
DiskArrays.writeblock_checked!
— MethodLike writeblock!
, but only exectued when data size to read is not empty
DiskArrays.TestTypes.AccessCountDiskArray
— TypeAccessCountDiskArray(A; chunksize)
An array that counts getindex
and setindex
calls, to debug and optimise chunk access.
getindex_count(A)
and setindex_count(A)
can be used to check the the counters.
DiskArrays.TestTypes.ChunkedDiskArray
— TypeChunkedDiskArray(A; chunksize)
A generic AbstractDiskArray
that can wrap any other AbstractArray
, with custom chunksize
.
DiskArrays.TestTypes.UnchunkedDiskArray
— TypeUnchunkedDiskArray(A)
A disk array without chunking, that can wrap any other AbstractArray
.