Some examples on how to access public S3 datasets
With this package it is possible to access public datasets that are hosted remotely on a s3-compatible cloud store. Here we provide examples on how to read data from commonly used datasets.
Accessing data on Amazon S3
First we show how to access the zarr-demo bucket on AWS S3. We have to setup a AWS configuration first, for options look at the documentation of AWS.jl. If you don't have an account, you can access the dataset without credentials as follows:
using Zarr, AWS
Zarr.AWSS3.AWS.global_aws_config(Zarr.AWSS3.AWS.AWSConfig(creds=nothing, region="us-west-2"))
AWS.AWSConfig(nothing, "us-west-2", "json", 3)
Then we can directly open a zarr group stored on s3
z = zopen("s3://mur-sst/zarr-v1")
ZarrGroup at S3 Object Storage and path zarr-v1
Variables: lat analysed_sst analysis_error mask time lon sea_ice_fraction
So we see that the store points to a zarr group with a few arrays
v = z["analysed_sst"]
ZArray{Int16} of size 36000 x 17999 x 6443
And we can read the attributes from the array
v.attrs
Dict{String, Any} with 9 entries:
"units" => "kelvin"
"add_offset" => 298.15
"_ARRAY_DIMENSIONS" => Any["time", "lat", "lon"]
"long_name" => "analysed sea surface temperature"
"scale_factor" => 0.001
"standard_name" => "sea_surface_foundation_temperature"
"valid_min" => -32767
"comment" => "\"Final\" version using Multi-Resolution Variational …
"valid_max" => 32767
Or some data
v[1:1000,1:1000,1]
1000×1000 Matrix{Int16}:
-32768 -32768 -32768 -32768 -32768 … -32768 -32768 -32768 -32768
-32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768
-32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768
-32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768
-32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768
-32768 -32768 -32768 -32768 -32768 … -32768 -32768 -32768 -32768
-32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768
-32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768
-32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768
-32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768
⋮ ⋱
-32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768
-32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768
-32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768
-32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768
-32768 -32768 -32768 -32768 -32768 … -32768 -32768 -32768 -32768
-32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768
-32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768
-32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768
-32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768
Accessing CMIP6 data on GCS
GCS is hosting a subset of the CMIP6 climate model ensemble runs. The data is stored in zarr format and accessible using this package. There is a catalog that contains a table of all model runs available:
using DataFrames, CSV
overview = CSV.read(download("https://storage.googleapis.com/cmip6/cmip6-zarr-consolidated-stores.csv"),DataFrame)
138786×10 DataFrame. Omitted printing of 6 columns
│ Row │ activity_id │ institution_id │ source_id │ experiment_id │
│ │ String │ String │ String │ String │
├────────┼─────────────┼────────────────┼────────────┼───────────────┤
│ 1 │ AerChemMIP │ BCC │ BCC-ESM1 │ piClim-CH4 │
│ 2 │ AerChemMIP │ BCC │ BCC-ESM1 │ piClim-CH4 │
│ 3 │ AerChemMIP │ BCC │ BCC-ESM1 │ piClim-CH4 │
│ 4 │ AerChemMIP │ BCC │ BCC-ESM1 │ piClim-CH4 │
│ 5 │ AerChemMIP │ BCC │ BCC-ESM1 │ piClim-CH4 │
│ 6 │ AerChemMIP │ BCC │ BCC-ESM1 │ piClim-CH4 │
│ 7 │ AerChemMIP │ BCC │ BCC-ESM1 │ piClim-CH4 │
⋮
│ 138779 │ ScenarioMIP │ UA │ MCM-UA-1-0 │ ssp585 │
│ 138780 │ ScenarioMIP │ UA │ MCM-UA-1-0 │ ssp585 │
│ 138781 │ ScenarioMIP │ UA │ MCM-UA-1-0 │ ssp585 │
│ 138782 │ ScenarioMIP │ UA │ MCM-UA-1-0 │ ssp585 │
│ 138783 │ ScenarioMIP │ UA │ MCM-UA-1-0 │ ssp585 │
│ 138784 │ ScenarioMIP │ UA │ MCM-UA-1-0 │ ssp585 │
│ 138785 │ ScenarioMIP │ UA │ MCM-UA-1-0 │ ssp585 │
│ 138786 │ ScenarioMIP │ UA │ MCM-UA-1-0 │ ssp585 │
These columns contain the path to the store as well, so after some subsetting we can access the member run we are interested in:
store = filter(overview) do row
row.activity_id == "ScenarioMIP" && row.institution_id=="DKRZ" && row.variable_id=="tas" && row.experiment_id=="ssp585"
end
store.zstore[1]
"gs://cmip6/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp585/r1i1p1f1/3hr/tas/gn/v20190710/"
So we can access the dataset and read some data from it. Note that we use consolidated=true
reduce the overhead of repeatedly requesting many metadata files:
g = zopen(store.zstore[1], consolidated=true)
You can access the meta-information through g.attrs
or for example read the first time slice through
g["tas"][:,:,1]
384×192 reshape(::Array{Union{Missing, Float32},3}, 384, 192) with eltype Union{Missing, Float32}:
244.27 245.276 245.186 245.419 … 252.782 252.852 252.672 252.667
244.284 245.223 245.122 245.497 252.833 252.88 252.686 252.682
244.309 245.139 245.003 245.422 252.85 252.895 252.704 252.663
244.297 245.104 244.954 245.272 252.84 252.872 252.727 252.69
244.352 245.055 244.835 245.182 252.858 252.895 252.739 252.69
244.358 245.001 244.825 245.079 … 252.79 252.926 252.77 252.7
244.34 244.924 244.79 245.104 252.778 252.907 252.768 252.672
244.348 244.87 244.737 245.112 252.756 252.928 252.755 252.712
244.339 244.803 244.684 245.223 252.741 252.911 252.78 252.706
244.383 244.723 244.649 245.005 252.729 252.842 252.78 252.719
⋮ ⋱ ⋮
244.184 245.68 245.997 246.456 … 252.421 252.528 252.452 252.637
244.186 245.649 245.907 246.313 252.518 252.546 252.469 252.643
244.163 245.542 245.731 246.085 252.561 252.553 252.495 252.637
244.227 245.491 245.68 246.178 252.643 252.596 252.534 252.678
244.227 245.483 245.626 245.987 252.692 252.633 252.573 252.672
244.253 245.442 245.497 245.975 … 252.756 252.682 252.577 252.631
244.227 245.409 245.352 245.897 252.719 252.758 252.6 252.655
244.296 245.356 245.231 245.774 252.735 252.809 252.612 252.659
244.301 245.303 245.192 245.524 252.733 252.862 252.655 252.678
Saving data to S3 using Minio.jl
In the examples above we only accessed data from several sources. Here we show how to store data on an own Minio server that we launch for testing purposes. First we launch the Minio server:
using Minio
s = Minio.Server(tempname(), address="localhost:9005")
run(s, wait=false)
Minio.Server("http://localhost:9005", running)
In the next step we configure AWS.jl to connect to our Minio instance by default. Afterwards we create an new bucket where we can store our data:
using AWS
cfg = MinioConfig("http://localhost:9005")
AWS.global_aws_config(cfg)
@service S3
S3.create_bucket("zarrdata")
Next we create a new zarr group in the just created bucket:
using Zarr
g = zgroup(S3Store("zarrdata"),"group_1")
ZarrGroup at S3 Object Storage and path group_1
and a new array inside the group and fill it with some data:
a = zcreate(Float32, g, "bar", 2,3,4, chunks=(1,2,2), attrs = Dict("att1"=>"one", "att2"=>2.5))
a[:,:,:] = reshape(1.0:24.0, (2,3,4))
2×3×4 reshape(::StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}, 2, 3, 4) with eltype Float64:
[:, :, 1] =
1.0 3.0 5.0
2.0 4.0 6.0
[:, :, 2] =
7.0 9.0 11.0
8.0 10.0 12.0
[:, :, 3] =
13.0 15.0 17.0
14.0 16.0 18.0
[:, :, 4] =
19.0 21.0 23.0
20.0 22.0 24.0
Now we test if the data can be accessed
a2 = zopen("s3://zarrdata/group_1/bar")
a2[2,2,1:4]
4-element Vector{Float32}:
4.0
10.0
16.0
22.0
`````