Zarr
Added in version 3.4.
Driver short name
Zarr
Build dependencies
Built-in by default, but liblz4, libxz (lzma), libzstd and libblosc strongly recommended to get all compressors
Zarr is a format for the storage of chunked, compressed, N-dimensional arrays. This format is supported for read and write access, and using the traditional 2D raster API or the newer multidimensional API
The driver supports the Zarr V2 specification, and has experimental support for the in-progress Zarr V3 specification.
Warning
The current implementation of Zarr V3 before GDAL 3.8 is incompatible with the latest evolutions of the Zarr V3 specification. GDAL 3.8 is compatible with the Zarr V3 specification at date 2023-May-7, and is not interoperable with Zarr V3 datasets produced by earlier GDAL versions.
Local and cloud storage (see GDAL Virtual File Systems (compressed, network hosted, etc...): /vsimem, /vsizip, /vsitar, /vsicurl, ...) are supported in read and write.
Driver capabilities
Supports Create()
This driver supports the GDALDriver::Create()
operation
Supports Georeferencing
This driver supports georeferencing
Supports multidimensional API
This driver supports the Multidimensional Raster Data Model
Supports VirtualIO
This driver supports virtual I/O operations (/vsimem/, etc.)
Concepts
A Zarr dataset is made of a hierarchy of nodes, with intermediate nodes being
groups (GDALGroup
), and leafs being arrays (GDALMDArray
).
Dataset name
For Zarr V2, the dataset name recognized by the Open() method of the driver is
a directory that contains a .zgroup
file, a .zarray
file or a
.zmetadata
file (consolidated metadata). For faster exploration,
the driver will use consolidated metadata by default when found.
For Zarr V3, the dataset name recognized by the Open() method of the driver is
a directory that contains a zarr.json
file (root of the dataset).
For datasets on file systems where file listing is not reliable, as often with
/vsicurl/, it is also possible to prefix the directory name with ZARR:
,
and it is necessary to surround the /vsicurl/-prefixed URL with double quotes.
e.g ZARR:"/vsicurl/https://example.org/foo.zarr". Note that when passing such
string in a command line shell, extra quoting might be necessary to preserve the
double-quoting.
For example with a Bash shell, the whole connection string needs to be surrounded with single-quote characters:
gdalmdiminfo 'ZARR:"/vsicurl/https://example.org/foo.zarr"'
Compression methods
Compression methods available depend on how GDAL is built, and libblosc too.
A full-feature build will show:
$ gdalinfo --format Zarr
[...]
Other metadata items:
COMPRESSORS=blosc,zlib,gzip,lzma,zstd,lz4
BLOSC_COMPRESSORS=blosclz,lz4,lz4hc,snappy,zlib,zstd
For specific uses, it is also possible to register at run-time extra compressors
and decompressors with CPLRegisterCompressor()
and CPLRegisterDecompressor()
.
XArray _ARRAY_DIMENSIONS
The driver support the _ARRAY_DIMENSIONS
special attribute used by
XArray
to store the dimension names of an array.
NCZarr extensions
The driver support the NCZarr v2 extensions of storing the dimension names of an array (read-only)
SRS encoding
The Zarr specification has no provision for spatial reference system encoding.
GDAL uses a _CRS
attribute that is a dictionary that may contain one or
several of the following keys: url
(using a OGC CRS URL), wkt
(WKT:2019
used by default on writing, WKT1 also supported on reading.), projjson
.
On reading, it will use url
by default, if not found will fallback to wkt
and then projjson
.
{
"_CRS":{
"wkt":"PROJCRS[\"NAD27 \/ UTM zone 11N\",BASEGEOGCRS[\"NAD27\",DATUM[\"North American Datum 1927\",ELLIPSOID[\"Clarke 1866\",6378206.4,294.978698213898,LENGTHUNIT[\"metre\",1]]],PRIMEM[\"Greenwich\",0,ANGLEUNIT[\"degree\",0.0174532925199433]],ID[\"EPSG\",4267]],CONVERSION[\"UTM zone 11N\",METHOD[\"Transverse Mercator\",ID[\"EPSG\",9807]],PARAMETER[\"Latitude of natural origin\",0,ANGLEUNIT[\"degree\",0.0174532925199433],ID[\"EPSG\",8801]],PARAMETER[\"Longitude of natural origin\",-117,ANGLEUNIT[\"degree\",0.0174532925199433],ID[\"EPSG\",8802]],PARAMETER[\"Scale factor at natural origin\",0.9996,SCALEUNIT[\"unity\",1],ID[\"EPSG\",8805]],PARAMETER[\"False easting\",500000,LENGTHUNIT[\"metre\",1],ID[\"EPSG\",8806]],PARAMETER[\"False northing\",0,LENGTHUNIT[\"metre\",1],ID[\"EPSG\",8807]]],CS[Cartesian,2],AXIS[\"easting\",east,ORDER[1],LENGTHUNIT[\"metre\",1]],AXIS[\"northing\",north,ORDER[2],LENGTHUNIT[\"metre\",1]],ID[\"EPSG\",26711]]",
"projjson":{
"$schema":"https:\/\/proj.org\/schemas\/v0.2\/projjson.schema.json",
"type":"ProjectedCRS",
"name":"NAD27 \/ UTM zone 11N",
"base_crs":{
"name":"NAD27",
"datum":{
"type":"GeodeticReferenceFrame",
"name":"North American Datum 1927",
"ellipsoid":{
"name":"Clarke 1866",
"semi_major_axis":6378206.4,
"inverse_flattening":294.978698213898
}
},
"coordinate_system":{
"subtype":"ellipsoidal",
"axis":[
{
"name":"Geodetic latitude",
"abbreviation":"Lat",
"direction":"north",
"unit":"degree"
},
{
"name":"Geodetic longitude",
"abbreviation":"Lon",
"direction":"east",
"unit":"degree"
}
]
},
"id":{
"authority":"EPSG",
"code":4267
}
},
"conversion":{
"name":"UTM zone 11N",
"method":{
"name":"Transverse Mercator",
"id":{
"authority":"EPSG",
"code":9807
}
},
"parameters":[
{
"name":"Latitude of natural origin",
"value":0,
"unit":"degree",
"id":{
"authority":"EPSG",
"code":8801
}
},
{
"name":"Longitude of natural origin",
"value":-117,
"unit":"degree",
"id":{
"authority":"EPSG",
"code":8802
}
},
{
"name":"Scale factor at natural origin",
"value":0.9996,
"unit":"unity",
"id":{
"authority":"EPSG",
"code":8805
}
},
{
"name":"False easting",
"value":500000,
"unit":"metre",
"id":{
"authority":"EPSG",
"code":8806
}
},
{
"name":"False northing",
"value":0,
"unit":"metre",
"id":{
"authority":"EPSG",
"code":8807
}
}
]
},
"coordinate_system":{
"subtype":"Cartesian",
"axis":[
{
"name":"Easting",
"abbreviation":"",
"direction":"east",
"unit":"metre"
},
{
"name":"Northing",
"abbreviation":"",
"direction":"north",
"unit":"metre"
}
]
},
"id":{
"authority":"EPSG",
"code":26711
}
},
"url":"http:\/\/www.opengis.net\/def\/crs\/EPSG\/0\/26711"
}
}
Particularities of the classic raster API
If the Zarr dataset contains one single array with 2 dimensions, it will be exposed as a regular GDALDataset when using the classic raster API. If the dataset contains more than one such single array, or arrays with 3 or more dimensions, the driver will list subdatasets to access each array and/or 2D slices within arrays with 3 or more dimensions.
Open options
The following dataset open options are available:
USE_ZMETADATA=[YES/NO]: Defaults to
YES
. Whether to use consolidated metadata from .zmetadata (Zarr V2 only).CACHE_TILE_PRESENCE=[YES/NO]: Defaults to
NO
. Whether to establish an initial listing of present tiles. This cached listing will be stored in a .gmac file next to the .zarray / .array.json.gmac file if they can be written. Otherwise theGDAL_PAM_PROXY_DIR
config option should be set to an existing directory where those cached files will be stored. Once the cached listing has been established, the open option no longer needs to be specified. Note: the runtime of this option can be in minutes or more for large datasets stored on remote file systems. And for network file systems, this will rarely work for /vsicurl/ itself, but more cloud-based file systems (such as /vsis3/, /vsigs/, /vsiaz/, etc) which have a dedicated directory listing operation.MULTIBAND=[YES/NO]: (GDAL >= 3.8) Defaults to
YES
. Whether to expose > 3D arrays as GDAL multiband datasets (when using the classic 2D API)DIM_X=<string> or <integer>: (GDAL >= 3.8) Name or index of the X dimension (only used when MULTIBAND=YES and with th classic 2D API). If not specified, deduced from dimension type (when equal to "HORIZONTAL_X"), or the last dimension (i.e. fastest varying one), if no dimension type found.
DIM_Y=<string> or <integer>: (GDAL >= 3.8) Name or index of the Y dimension (only used when MULTIBAND=YES and with th classic 2D API). If not specified, deduced from dimension type (when equal to "HORIZONTAL_Y"), or the before last dimension, if no dimension type found.
LOAD_EXTRA_DIM_METADATA_DELAY=<integer> or "unlimited": (GDAL >= 3.8) Defaults to
5
. Maximum delay in seconds allowed to set the DIM_{dimname}_VALUE band metadata items from the indexing variable of the dimensions. Default value is 5.unlimited
can be used to mean unlimited delay. Can also be defined globally with the GDAL_LOAD_EXTRA_DIM_METADATA_DELAY configuration` option. Only used through the classic 2D API.
Multi-threaded caching
The driver implements the GDALMDArray::AdviseRead()
method. This
proceed to multi-threaded decoding of the tiles that intersect the area of
interest specified. A sufficient cache size must be specified. The call is
blocking.
The options that can be passed to the methods are:
CACHE_SIZE=value_in_byte: Maximum RAM to use, expressed in number of bytes. If not specified, half of the remaining GDAL block cache size will be used. Note: the caching mechanism of Zarr array will not update this remaining block cache size.
NUM_THREADS=integer or ALL_CPUS: Number of threads to use in parallel. If not specified, the
GDAL_NUM_THREADS
configuration option will be taken into account.
Creation options
The following options are creation options of the classic raster API, or
array-level creation options for the multidimensional API (must be prefixed
with ARRAY:
using gdalmdimtranslate):
COMPRESS=[NONE/BLOSC/ZLIB/GZIP/LZMA/ZSTD/LZ4]: Defaults to
NONE
. Compression method.FILTER=[NONE/DELTA]: Defaults to
NONE
. Filter method. Only support for FORMAT=ZARR_V2.BLOCKSIZE=<string>: Comma separated list of chunk size along each dimension. If not specified, the fastest varying 2 dimensions (the last ones) used a block size of 256 samples, and the other ones of 1.
CHUNK_MEMORY_LAYOUT=[C/F]: Defaults to
C
. Whether to use C (row-major) order or F (column-major) order in encoded chunks. Only useful when using compression. Changing to F may improve depending on array content.STRING_FORMAT=[ASCII/UNICODE]: Defaults to
ASCII
. Whether to use the numpy type for ASCII-only strings or Unicode strings. Unicode strings take 4 byte per character.DIM_SEPARATOR=<string>: Dimension separator in chunk filenames. Default to decimal point for ZarrV2 and slash for ZarrV3.
BLOSC_CNAME=[bloclz/lz4/lz4hc/snappy/zlib/zstd]: Defaults to
lz4
. Blosc compressor name. Only used whenCOMPRESS=BLOSC
.BLOSC_CLEVEL=1-9: Defaults to
5
. Blosc compression level. Only used whenCOMPRESS=BLOSC
.BLOSC_SHUFFLE=[NONE/BYTE/BIT]: Defaults to
BYTE
. Type of shuffle algorithm. Only used whenCOMPRESS=BLOSC
.BLOSC_BLOCKSIZE=<integer>: Defaults to
0
. Blosc block size. Only used whenCOMPRESS=BLOSC
.BLOSC_NUM_THREADS=[<integer>/ALL_CPUS]: Defaults to
1
. Number of worker threads for compression. Only used whenCOMPRESS=BLOSC
.ZLIB_LEVEL=1-9: Defaults to
6
. ZLib compression level. Only used whenCOMPRESS=ZLIB
.GZIP_LEVEL=1-9: Defaults to
6
. GZip compression level. Only used whenCOMPRESS=GZIP
.LZMA_PRESET=0-9: Defaults to
6
. LZMA compression level. Only used whenCOMPRESS=LZMA
.LZMA_DELTA=<integer>: Defaults to
1
. Delta distance in byte. Only used whenCOMPRESS=LZMA
.ZSTD_LEVEL=1-22: Defaults to
13
. ZSTD compression level. Only used whenCOMPRESS=ZSTD
.LZ4_ACCELERATION=<integer> [1-]: Defaults to
1
. LZ4 acceleration factor. The higher, the less compressed. Only used whenCOMPRESS=LZ4
. Defaults to 1 (the fastest).DELTA_DTYPE=<string>: Data type following NumPy array protocol type string (typestr) format (https://numpy.org/doc/stable/reference/arrays.interface.html#arrays-interface). Only
u1
,i1
,u2
,i2
,u4
,i4
,u8
,i8
,f4
,f8
, potentially prefixed with the endianness flag (<
for little endian,>
for big endian) are supported. Only used whenFILTER=DELTA
. Defaults to the native data type.
The following options are creation options of the classic raster API, or dataset-level creation options for the multidimensional API :
FORMAT=[ZARR_V2/ZARR_V3]: Defaults to
ZARR_V2
.CREATE_ZMETADATA=[YES/NO]: Defaults to
YES
. Whether to create consolidated metadata into .zmetadata (Zarr V2 only).
The following options are creation options of the classic raster API only:
ARRAY_NAME=<string>: Array name. If not specified, deduced from the filename.
APPEND_SUBDATASET=[YES/NO]: Defaults to
NO
. Whether to append the new dataset to an existing Zarr hierarchy.SINGLE_ARRAY=[YES/NO]: (GDAL >= 3.8) Defaults to
YES
. Whether to write a multi-band dataset as a 3D Zarr array. If false, one 2D Zarr array per band will be written.INTERLEAVE=[BAND/PIXEL]: (GDAL >= 3.8) Defaults to
BAND
. When writing a multi-band dataset as a 3D Zarr array, whether the band dimension should be the first one/slowest varying one (BAND), or the last one/fastest varying one (PIXEL) The default value is BAND in Create() mode. In CreateCopy() mode, the default value is the value of the INTERLEAVE metadata item of the IMAGE_STRUCTURE metadata domain of the source dataset, if set.
Examples
Get information on the dataset using the multidimensional tools:
gdalmdiminfo my.zarr
Convert a netCDF file to ZARR using the multidimensional tools:
gdalmdimtranslate in.nc out.zarr -co ARRAY:COMPRESS=GZIP
Convert a 2D slice (the one at index 0 of the non-2D dimension) of a 3D array to GeoTIFF:
gdal_translate 'ZARR:"my.zarr":/group/myarray:0' out.tif
Note
The single quoting around the connection string is specific to the Bash shell to make sure that the double quoting is preserved.