Tutorial
Zarr provides classes and functions for working with N-dimensional arrays that behave like Julia arrays but whose data is divided into chunks and each chunk is compressed. If you are already familiar with HDF5 then Zarr arrays provide similar functionality, but with some additional flexibility. This tutorial is an attempt to recreate this Python Zarr tutorial as closely as possible and some of the explanation text is just copied and modified from this source.
Creating an in-memory array
Zarr has several functions for creating arrays. For example:
julia> using Zarr
julia> z = zzeros(Int32,10000,10000,chunks=(1000,1000))
ZArray{Int32} of size 10000 x 10000
The code above creates a 2-dimensional array of 32-bit integers with 10000 rows and 10000 columns, divided into chunks where each chunk has 1000 rows and 1000 columns (and so there will be 100 chunks in total).
Other Array creation routines are [zcreate
, zones
and zfill
].
Reading and Writing data
Zarr arrays support a similar interface to Julia arrays for reading and writing data, although they don't implement the all indexing methods of an AbstractArray
yet. For example, the entire array can be filled with a scalar value:
julia> z .= 42
ZArray{Int32} of size 10000 x 10000
Regions of the array can also be written to, e.g.:
julia> z[1,:]=1:10000;
julia> z[:,1]=1:10000;
The contents of the array can be retrieved by slicing, which will load the requested region into memory as a Julia array, e.g.:
julia> z[1,1]
1
julia> z[end,end]
42
julia> z[1,:]
10000-element Vector{Int32}:
1
2
3
4
5
6
7
8
9
10
⋮
9992
9993
9994
9995
9996
9997
9998
9999
10000
julia> z[1:5,1:10]
5×10 Matrix{Int32}:
1 2 3 4 5 6 7 8 9 10
2 42 42 42 42 42 42 42 42 42
3 42 42 42 42 42 42 42 42 42
4 42 42 42 42 42 42 42 42 42
5 42 42 42 42 42 42 42 42 42
Persistent arrays
In the examples above, compressed data for each chunk of the array was stored in main memory. Zarr arrays can also be stored on a file system, enabling persistence of data between sessions. For example:
julia> using Zarr
julia> p = "data/example.zarr"
"data/example.zarr"
julia> z1 = zcreate(Int, 10000,10000,path = p,chunks=(1000, 1000))
ZArray{Int64} of size 10000 x 10000
The array above will store its configuration metadata and all compressed chunk data in a directory called ‘data/example.zarr’ relative to the current working directory. The zarr.create() function provides a way to create a new persistent array. Note that there is no need to close an array: data are automatically flushed to disk, and files are automatically closed whenever an array is modified.
Persistent arrays support the same interface for reading and writing data, e.g.:
julia> z1 .= 42
ZArray{Int64} of size 10000 x 10000
julia> z1[1,:]=1:10000;
julia> z1[:,1]=1:10000;
Check that the data have been written and can be read again:
julia> z2 = zopen(p)
ZArray{Int64} of size 10000 x 10000
julia> all(z1[:,:].==z2[:,:])
true
A Julia-equivalent for zarr.load and zarr.save is still missing...
Resizing and appending
A Zarr array can be resized, which means that any of its dimensions can be increased or decreased in length. For example:
julia> using Zarr
julia> z = zzeros(Int32,10000, 10000, chunks=(1000, 1000))
ZArray{Int32} of size 10000 x 10000
julia> z .= 42
ZArray{Int32} of size 10000 x 10000
julia> resize!(z,20000, 10000)
julia> size(z)
(20000, 10000)
Note that when an array is resized, the underlying data are not rearranged in any way. If one or more dimensions are shrunk, any chunks falling outside the new array shape will be deleted from the underlying store.
For convenience, ZArrays
also provide an append!
method, which can be used to append data to any axis. E.g.:
julia> a = reshape(1:Int32(10000000),1000, 10000);
julia> z = ZArray(a, chunks=(100, 1000))
ZArray{Int64} of size 1000 x 10000
julia> size(z)
(1000, 10000)
julia> append!(z,a)
julia> append!(z,hcat(a,a), dims=1)
julia> size(z)
(2000, 20000)
Compressors
A number of different compressors can be used with Zarr. In this Julia package we currently support only Blosc compression, but more compression methods will be supported in the future. Different compressors can be provided via the compressor keyword argument accepted by all array creation functions. For example:
julia> using Zarr
julia> compressor = Zarr.BloscCompressor(cname="zstd", clevel=3, shuffle=true)
Zarr.BloscCompressor(0, 3, "zstd", 1)
julia> data = Int32(1):Int32(100000000)
1:100000000
julia> z = Zarr.zcreate(Int32,10000, 10000, chunks = (1000,1000),compressor=compressor)
ZArray{Int32} of size 10000 x 10000
julia> z[:,:]=data
1:100000000
This array above will use Blosc as the primary compressor, using the Zstandard algorithm (compression level 3) internally within Blosc, and with the byte-shuffle filter applied.
When using a compressor, it can be useful to get some diagnostics on the compression ratio. ZArrays
provide a zinfo
function which can be used to print some diagnostics, e.g.:
julia> zinfo(z)
Type : ZArray
Data type : Int32
Shape : (10000, 10000)
Chunk Shape : (1000, 1000)
Order : C
Read-Only : false
Compressor : Zarr.BloscCompressor(0, 3, "zstd", 1)
Filters : nothing
Store type : Dictionary Storage
No. bytes : 400000000
No. bytes stored : 2412289
Storage ratio : 165.81761140559857
Chunks initialized : 100/100
If you don’t specify a compressor, by default Zarr uses the Blosc compressor. Blosc is generally very fast and can be configured in a variety of ways to improve the compression ratio for different types of data. Blosc is in fact a “meta-compressor”, which means that it can use a number of different compression algorithms internally to compress the data. Blosc also provides highly optimized implementations of byte- and bit-shuffle filters, which can improve compression ratios for some data.
To disable compression, set compressor=Zarr.NoCompressor()
when creating an array, e.g.:
julia> z = zzeros(Int32,100000000, chunks=(1000000,), compressor=Zarr.NoCompressor());
julia> storageratio(z)
1.0
Ragged Arrays
If you need to store an array of arrays, where each member array can be of any length and stores the same data type (a.k.a. a ragged array), VLenArray
filter will be used, e.g.:
julia> z = zcreate(Vector{Int}, 4)
ZArray{Vector{Int64}} of size 4
julia> z.metadata.filters
(Zarr.VLenArrayFilter{Int64}(),)
julia> z[1:3] = [[1,3,5],[4],[7,9,14]];
julia> z[:]
4-element Vector{Vector{Int64}}:
[1, 3, 5]
[4]
[7, 9, 14]
[]