Operations on Zarr Arrays

A Zarr Array consists of a collection of potentially compressed chunks, and there is a significant overhead in accessing a single item from such an array compared to Julia's Base Array type.

In order to make operations on ZArrays still efficient, we use the DiskArrays package which enables efficient broadcast and reductions on Zarrays respecting their chunk sizes. This includes some modified behavior compared to a normal AbstractArray, including lazy broadcasting and a non-default array access order for reductions.

Please refer to the DiskArrays documentation to see which operations are supported.

A short example

julia> using Zarr, Statistics

julia> g = zopen("gs://cmip6/CMIP/NCAR/CESM2/historical/r9i1p1f1/Amon/tas/gn/", consolidated=true)
ZarrGroup at Consolidated S3 Object Storage
Variables: lat time tas lat_bnds lon_bnds lon time_bnds

Accessing a single element from the array has significant overhead, because a whole chunk has to be transferred from GCS and unzipped:

julia> @time g["tas"][1,1,1]
18.734581 seconds (129.25 k allocations: 557.614 MiB, 0.56% gc time)

244.39726f0
julia> latweights = reshape(cosd.(g["lat"])[:],1,192,1);

julia> t_celsius = g["tas"].-273.15
Disk Array with size 288 x 192 x 1980

julia> t_w = t_celsius .* latweights
Disk Array with size 288 x 192 x 1980

Note that the broadcast operations are not directly computed but are collected in a fused lazy Broadcast object. When calling a reducing operation on the array, it will be read chunk by chunk and means will be merged instead of accessing the elements in a naive loop, so that the computation can be finished in reasonable time:

julia> mean(t_w, dims = (1,2))./mean(latweights)
1×1×1980 Array{Float64,3}:
[:, :, 1] =
 12.492234157689309

[:, :, 2] =
 12.425466417315654

[:, :, 3] =
 13.190267552582446

...

[:, :, 1978] =
 15.55063620093181

[:, :, 1979] =
 14.614388350826788

[:, :, 1980] =
 13.913361540597469