Zarr
Added in version 3.4.
Driver short name
Zarr
Build dependencies
Built-in by default, but liblz4, libxz (lzma), libzstd and libblosc strongly recommended to get all compressors
Zarr is a format for the storage of chunked, compressed, N-dimensional arrays. This format is supported for read and write access, and using the traditional 2D raster API or the multidimensional API
The driver supports the Zarr V2 an V3 specifications. It also supports Kerchunk reference files since GDAL 3.11.
Local and cloud storage (see GDAL Virtual File Systems (compressed, network hosted, etc...): /vsimem, /vsizip, /vsitar, /vsicurl, ...) are supported in read and write.
Driver capabilities
Supports Create()
This driver supports the GDALDriver::Create() operation
Supports CreateCopy()
This driver supports the GDALDriver::CreateCopy() operation
Supports Georeferencing
This driver supports georeferencing
Supports multidimensional API
This driver supports the Multidimensional Raster Data Model
Supports VirtualIO
This driver supports virtual I/O operations (/vsimem/, etc.)
Concepts
A Zarr dataset is made of a hierarchy of nodes, with intermediate nodes being
groups (GDALGroup), and leafs being arrays (GDALMDArray).
Dataset name
For Zarr V2, the dataset name recognized by the Open() method of the driver is
a directory that contains a .zgroup file, a .zarray file or a
.zmetadata file (consolidated metadata). For faster exploration,
the driver will use consolidated metadata by default when found.
For Zarr V3, the dataset name recognized by the Open() method of the driver is
a directory that contains a zarr.json file (root of the dataset).
For datasets on file systems where file listing is not reliable, as often with
/vsicurl/, it is also possible to prefix the directory name with ZARR:,
and it is necessary to surround the /vsicurl/-prefixed URL with double quotes.
e.g ZARR:"/vsicurl/https://example.org/foo.zarr". Note that when passing such
string in a command line shell, extra quoting might be necessary to preserve the
double-quoting.
For example with a Bash shell, the whole connection string needs to be surrounded with single-quote characters:
gdalmdiminfo 'ZARR:"/vsicurl/https://example.org/foo.zarr"'
Zarr stores without directory listing
Added in version 3.12.
Sometimes remote Zarr stores don't have a reliable directory listing. In such cases, one can point to one of the following metadata files for GDAL to detect the correct driver to open the Zarr store:
zarr.json.zmetadata.zgroup.zarray
Multiscales (overviews / pyramids)
Added in version 3.13.
The driver supports reading the Zarr multiscales convention for Zarr V3 datasets. This convention describes a pyramid of arrays at decreasing resolutions within a group hierarchy.
When a Zarr V3 array has a parent (or grandparent) group whose
attributes contain a zarr_conventions entry with the multiscales UUID and
a multiscales attribute with a layout array, the driver exposes
lower-resolution levels as overviews via GDALMDArray::GetOverview()
and the classic raster band overview API.
Overviews can be generated using GDALMDArray::BuildOverviews() or
equivalently via GDALDataset::BuildOverviews() on datasets obtained
through GDALMDArray::AsClassicDataset(). For arrays with more than
two dimensions, only the spatial dimensions are downsampled; non-spatial
dimensions (e.g., time) are preserved. Each overview level is resampled
sequentially from the previous level (e.g., 4x from 2x, not from base).
Codec settings are inherited from the source array. Calling
BuildOverviews replaces all existing overviews (unlike the default
GDALDataset::BuildOverviews behavior which adds new levels).
Kerchunk reference stores
Added in version 3.11.
The driver supports reading "virtual" Zarr datasets where the Zarr logical hierarchy does not directly translate into a hierarchy of files on the filesystem, but through a file that contains a store of references to chunk locations (possibly remote). Such stores are generated by the Python Kerchunk library.
There are 2 types of Kerchunk reference stores:
JSON reference stores, where the Zarr dataset is entirely described in a single JSON file. This comes in two versions: Version 0 and Version 1.
Such stores can be opened with:
gdalmdiminfo "/vsikerchunk_json_ref//path/to/reference.json"
or sometimes, when the header of the file is sufficient to be recognized as a Kerchunk reference store, with just:
gdalmdiminfo "/path/to/reference.json"
Note that the "templates" and "gen" features of Version 1 are not supported.
Parquet reference stores, where the Zarr dataset is described in a
.zmetadatafile that contains only the JSON definitions of the groups and arrays. The chunks themselves are cataloged in Parquet files in subdirectories: https://fsspec.github.io/kerchunk/spec.html#parquet-referencesSuch stores can be opened with:
gdalmdiminfo 'ZARR:"/path/to/directory/where/.zmetadata/is/located"'
That is exactly like a regular Zarr dataset.
To be able to read pixel values, the (Geo)Parquet driver must be available.
As JSON reference stores can be very large and slow to parse, the CACHE_KERCHUNK_JSON
open option can be set to YES to ask to generate and use a local Parquet
reference store that is cached in $HOME/.gdal/zarr_kerchunk_cache.
This implies the (Geo)Parquet driver is available.
The driver does not rotate cached stores in the local cache. It is the responsibility of the user to manage its content and remove obsolete datasets.
It is also possible to convert JSON reference store into a Parquet one using
the CONVERT_TO_KERCHUNK_PARQUET_REFERENCE creation option set to YES.
For example:
gdal_translate -of ZARR -co CONVERT_TO_KERCHUNK_PARQUET_REFERENCE=YES store.json store.parq
Compression methods
Compression methods available depend on how GDAL is built, and libblosc too.
A full-feature build will show:
$ gdalinfo --format Zarr
[...]
Other metadata items:
COMPRESSORS=blosc,zlib,gzip,lzma,zstd,lz4
BLOSC_COMPRESSORS=blosclz,lz4,lz4hc,snappy,zlib,zstd
For specific uses, it is also possible to register at run-time extra compressors
and decompressors with CPLRegisterCompressor() and CPLRegisterDecompressor().
XArray _ARRAY_DIMENSIONS
The driver support the _ARRAY_DIMENSIONS special attribute used by
XArray
to store the dimension names of an array.
NCZarr extensions
The driver support the NCZarr v2 extensions of storing the dimension names of an array (read-only)
Georeferencing encoding (CRS and geotransformation matrix)
The Zarr specification has no provision for spatial reference system encoding. Several conventions
GDAL convention
Before GDAL 3.13, the only convention supported both in reading and writing was
the GDAL one, using a _CRS attribute. The geotransformation matrix, when
no rotation terms is present, is encoded as X and Y one-dimensional
coordinate arrays.
The _CRS attribute is a dictionary that may contain one or
several of the following keys: url (using a OGC CRS URL), wkt (WKT:2019
used by default on writing, WKT1 also supported on reading.), projjson.
On reading, it will use url by default, if not found will fallback to wkt
and then projjson.
Example:
{
"_CRS":{
"wkt":"PROJCRS[\"NAD27 \/ UTM zone 11N\",BASEGEOGCRS[\"NAD27\",DATUM[\"North American Datum 1927\",ELLIPSOID[\"Clarke 1866\",6378206.4,294.978698213898,LENGTHUNIT[\"metre\",1]]],PRIMEM[\"Greenwich\",0,ANGLEUNIT[\"degree\",0.0174532925199433]],ID[\"EPSG\",4267]],CONVERSION[\"UTM zone 11N\",METHOD[\"Transverse Mercator\",ID[\"EPSG\",9807]],PARAMETER[\"Latitude of natural origin\",0,ANGLEUNIT[\"degree\",0.0174532925199433],ID[\"EPSG\",8801]],PARAMETER[\"Longitude of natural origin\",-117,ANGLEUNIT[\"degree\",0.0174532925199433],ID[\"EPSG\",8802]],PARAMETER[\"Scale factor at natural origin\",0.9996,SCALEUNIT[\"unity\",1],ID[\"EPSG\",8805]],PARAMETER[\"False easting\",500000,LENGTHUNIT[\"metre\",1],ID[\"EPSG\",8806]],PARAMETER[\"False northing\",0,LENGTHUNIT[\"metre\",1],ID[\"EPSG\",8807]]],CS[Cartesian,2],AXIS[\"easting\",east,ORDER[1],LENGTHUNIT[\"metre\",1]],AXIS[\"northing\",north,ORDER[2],LENGTHUNIT[\"metre\",1]],ID[\"EPSG\",26711]]",
"projjson":{
"$schema":"https:\/\/proj.org\/schemas\/v0.2\/projjson.schema.json",
"type":"ProjectedCRS",
"name":"NAD27 \/ UTM zone 11N",
"base_crs":{
"name":"NAD27",
"datum":{
"type":"GeodeticReferenceFrame",
"name":"North American Datum 1927",
"ellipsoid":{
"name":"Clarke 1866",
"semi_major_axis":6378206.4,
"inverse_flattening":294.978698213898
}
},
"coordinate_system":{
"subtype":"ellipsoidal",
"axis":[
{
"name":"Geodetic latitude",
"abbreviation":"Lat",
"direction":"north",
"unit":"degree"
},
{
"name":"Geodetic longitude",
"abbreviation":"Lon",
"direction":"east",
"unit":"degree"
}
]
},
"id":{
"authority":"EPSG",
"code":4267
}
},
"conversion":{
"name":"UTM zone 11N",
"method":{
"name":"Transverse Mercator",
"id":{
"authority":"EPSG",
"code":9807
}
},
"parameters":[
{
"name":"Latitude of natural origin",
"value":0,
"unit":"degree",
"id":{
"authority":"EPSG",
"code":8801
}
},
{
"name":"Longitude of natural origin",
"value":-117,
"unit":"degree",
"id":{
"authority":"EPSG",
"code":8802
}
},
{
"name":"Scale factor at natural origin",
"value":0.9996,
"unit":"unity",
"id":{
"authority":"EPSG",
"code":8805
}
},
{
"name":"False easting",
"value":500000,
"unit":"metre",
"id":{
"authority":"EPSG",
"code":8806
}
},
{
"name":"False northing",
"value":0,
"unit":"metre",
"id":{
"authority":"EPSG",
"code":8807
}
}
]
},
"coordinate_system":{
"subtype":"Cartesian",
"axis":[
{
"name":"Easting",
"abbreviation":"",
"direction":"east",
"unit":"metre"
},
{
"name":"Northing",
"abbreviation":"",
"direction":"north",
"unit":"metre"
}
]
},
"id":{
"authority":"EPSG",
"code":26711
}
},
"url":"http:\/\/www.opengis.net\/def\/crs\/EPSG\/0\/26711"
}
}
SPATIAL_PROJ convention
Added in version 3.13.
Since GDAL 3.13, the Zarr spatial
and geo-proj conventions
are supported in reading, and in writing when the GEOREFERENCING_CONVENTION
creation option is set to SPATIAL_PROJ. X and Y coordinate arrays are
written only if the geotransformation matrix has no rotation terms.
Example:
{
"attributes": {
"proj:code": "EPSG:26711",
"spatial:bbox": [
440720.0,
3750120.0,
441920.0,
3751320.0,
],
"spatial:transform_type": "affine",
"spatial:transform": [
60.0,
0.0,
440720.0,
0.0,
-60.0,
3751320.0,
],
"spatial:registration": "pixel",
"spatial:dimensions": ["Y", "X"],
"zarr_conventions": [
{
"schema_url": "https://raw.githubusercontent.com/zarr-experimental/geo-proj/refs/tags/v1/schema.json",
"spec_url": "https://github.com/zarr-experimental/geo-proj/blob/v1/README.md",
"uuid": "f17cb550-5864-4468-aeb7-f3180cfb622f",
"name": "proj:",
"description": "Coordinate reference system information for geospatial data",
},
{
"schema_url": "https://raw.githubusercontent.com/zarr-conventions/spatial/refs/tags/v1/schema.json",
"spec_url": "https://github.com/zarr-conventions/spatial/blob/v1/README.md",
"uuid": "689b58e2-cf7b-45e0-9fff-9cfc0883d6b4",
"name": "spatial:",
"description": "Spatial coordinate information",
},
]
}
}
netCDF CF conventions
Added in version 3.9.
The driver supports reading a CRS using the CF conventions.
Particularities of the classic raster API
If the Zarr dataset contains one single array with 2 dimensions, it will be exposed as a regular GDALDataset when using the classic raster API. If the dataset contains more than one such single array, or arrays with 3 or more dimensions, the driver will list subdatasets to access each array and/or 2D slices within arrays with 3 or more dimensions.
Open options
The following dataset open options are available:
LIST_ALL_ARRAYS=[YES/NO]: (GDAL >= 3.11) Defaults to
NO. In classic 2D mode, whereas the subdataset list should include all arrays, including those with 0 or 1 dimension.USE_CONSOLIDATED_METADATA=[YES/NO]: Defaults to
YES. Whether to use consolidated metadata (from.zmetadatafor Zarr V2, or rootzarr.jsonfor Zarr V3)CACHE_TILE_PRESENCE=[YES/NO]: Defaults to
NO. Whether to establish an initial listing of present tiles. This cached listing will be stored in a .gmac file next to the .zarray / .array.json.gmac file if they can be written. Otherwise theGDAL_PAM_PROXY_DIRconfig option should be set to an existing directory where those cached files will be stored. Once the cached listing has been established, the open option no longer needs to be specified. Note: the runtime of this option can be in minutes or more for large datasets stored on remote file systems. And for network file systems, this will rarely work for /vsicurl/ itself, but more cloud-based file systems (such as /vsis3/, /vsigs/, /vsiaz/, etc) which have a dedicated directory listing operation.CACHE_KERCHUNK_JSON=[YES/NO]: (GDAL >= 3.11) Defaults to
NO. Whether to use (and generate if needed) a local cache where Kerchunk JSON reference files are transformed as Kerchunk Parquet reference files for more efficiencyMULTIBAND=[YES/NO]: (GDAL >= 3.8) Defaults to
YES. Whether to expose > 3D arrays as GDAL multiband datasets (when using the classic 2D API)DIM_X=<string> or <integer>: (GDAL >= 3.8) Name or index of the X dimension (only used when MULTIBAND=YES and with the classic 2D API). If not specified, deduced from dimension type (when equal to "HORIZONTAL_X"), or the last dimension (i.e. fastest varying one), if no dimension type found.
DIM_Y=<string> or <integer>: (GDAL >= 3.8) Name or index of the Y dimension (only used when MULTIBAND=YES and with the classic 2D API). If not specified, deduced from dimension type (when equal to "HORIZONTAL_Y"), or the before last dimension, if no dimension type found.
LOAD_EXTRA_DIM_METADATA_DELAY=<integer> or "unlimited": (GDAL >= 3.8) Defaults to
5. Maximum delay in seconds allowed to set the DIM_{dimname}_VALUE band metadata items from the indexing variable of the dimensions. Default value is 5.unlimitedcan be used to mean unlimited delay. Can also be defined globally with the GDAL_LOAD_EXTRA_DIM_METADATA_DELAY configuration` option. Only used through the classic 2D API.
Multi-threaded caching
Starting with GDAL 3.13, when the GDAL_NUM_THREADS configuration
option is set and a read request spans multiple chunks, the driver automatically
decodes chunks in parallel, similar to the GeoTIFF driver behavior
(since GDAL 3.6). No application changes are needed.
The driver also implements the GDALMDArray::AdviseRead() method for
explicit multi-threaded pre-fetching of tiles that intersect the area of
interest specified. A sufficient cache size must be specified. The call is
blocking.
The options that can be passed to the methods are:
CACHE_SIZE=value_in_byte: Maximum RAM to use, expressed in number of bytes. If not specified, half of the remaining GDAL block cache size will be used. Note: the caching mechanism of Zarr array will not update this remaining block cache size.
NUM_THREADS=integer or ALL_CPUS: Number of threads to use in parallel. If not specified, the
GDAL_NUM_THREADSconfiguration option will be taken into account.
Creation options
The following options are creation options of the classic raster API, or
array-level creation options for the multidimensional API (must be prefixed
with ARRAY: using gdalmdimtranslate):
GEOREFERENCING_CONVENTION=[GDAL/SPATIAL_PROJ]: Defaults to
GDAL. Which convention is used to write georeferencing information: geotransformation and CRS.The
GDALconvention uses a_CRSattribute described above. TheSPATIAL_PROJconvention, added both in read and write support in GDAL 3.13, uses the Zarr spatial and geo-proj conventions.COMPRESS=[NONE/BLOSC/ZLIB/GZIP/LZMA/ZSTD/LZ4]: Defaults to
NONE. Compression method.For FORMAT=ZARR_V3, only
NONE,BLOSC,GZIPandZSTDare supported.FILTER=[NONE/DELTA]: Defaults to
NONE. Filter method. Only support for FORMAT=ZARR_V2.BLOCKSIZE=<string>: Comma separated list of chunk size along each dimension. If not specified, the fastest varying 2 dimensions (the last ones) used a block size of 256 samples, and the other ones of 1.
SHARD_CHUNK_SHAPE=<string>: (GDAL >= 3.13) Comma-separated inner chunk dimensions for Zarr V3 sharded storage. When set,
BLOCKSIZEdefines the shard dimensions and this option defines the inner chunk dimensions within each shard. Each value must evenly divide the correspondingBLOCKSIZEdimension. For example,BLOCKSIZE=256,256withSHARD_CHUNK_SHAPE=64,64creates shards of 4x4=16 inner chunks.CHUNK_MEMORY_LAYOUT=[C/F]: Defaults to
C. Whether to use C (row-major) order or F (column-major) order in encoded chunks. Only useful when using compression. Changing to F may improve depending on array content.STRING_FORMAT=[ASCII/UNICODE]: Defaults to
ASCII. Whether to use the numpy type for ASCII-only strings or Unicode strings. Unicode strings take 4 byte per character.DIM_SEPARATOR=<string>: Dimension separator in chunk filenames. Default to decimal point for ZarrV2 and slash for ZarrV3.
BLOSC_CNAME=[bloclz/lz4/lz4hc/snappy/zlib/zstd]: Defaults to
lz4. Blosc compressor name. Only used whenCOMPRESS=BLOSC.BLOSC_CLEVEL=1-9: Defaults to
5. Blosc compression level. Only used whenCOMPRESS=BLOSC.BLOSC_SHUFFLE=[NONE/BYTE/BIT]: Defaults to
BYTE. Type of shuffle algorithm. Only used whenCOMPRESS=BLOSC.BLOSC_BLOCKSIZE=<integer>: Defaults to
0. Blosc block size. Only used whenCOMPRESS=BLOSC.BLOSC_NUM_THREADS=[<integer>/ALL_CPUS]: Defaults to
1. Number of worker threads for compression. Only used whenCOMPRESS=BLOSC.ZLIB_LEVEL=1-9: Defaults to
6. ZLib compression level. Only used whenCOMPRESS=ZLIB.GZIP_LEVEL=1-9: Defaults to
6. GZip compression level. Only used whenCOMPRESS=GZIP.LZMA_PRESET=0-9: Defaults to
6. LZMA compression level. Only used whenCOMPRESS=LZMA.LZMA_DELTA=<integer>: Defaults to
1. Delta distance in byte. Only used whenCOMPRESS=LZMA.ZSTD_LEVEL=1-22: Defaults to
13. ZSTD compression level. Only used whenCOMPRESS=ZSTD.LZ4_ACCELERATION=<integer> [1-]: Defaults to
1. LZ4 acceleration factor. The higher, the less compressed. Only used whenCOMPRESS=LZ4. Defaults to 1 (the fastest).DELTA_DTYPE=<string>: Data type following NumPy array protocol type string (typestr) format (https://numpy.org/doc/stable/reference/arrays.interface.html#arrays-interface). Only
u1,i1,u2,i2,u4,i4,u8,i8,f4,f8, potentially prefixed with the endianness flag (<for little endian,>for big endian) are supported. Only used whenFILTER=DELTA. Defaults to the native data type.
The following options are creation options of the classic raster API, or dataset-level creation options for the multidimensional API :
FORMAT=[ZARR_V2/ZARR_V3]: Defaults to
ZARR_V2.CREATE_CONSOLIDATED_METADATA=[YES/NO]: Defaults to
YES. Whether to create consolidated metadata (into.zmetadatafor Zarr V2, or rootzarr.jsonfor Zarr V3)CONVERT_TO_KERCHUNK_PARQUET_REFERENCE=[YES/NO]: (GDAL >= 3.11) Defaults to
NO. Whether to convert a Kerchunk JSON reference store into a Kerchunk Parquet reference store.
The following options are creation options of the classic raster API only:
ARRAY_NAME=<string>: Array name. If not specified, deduced from the filename.
APPEND_SUBDATASET=[YES/NO]: Defaults to
NO. Whether to append the new dataset to an existing Zarr hierarchy.SINGLE_ARRAY=[YES/NO]: (GDAL >= 3.8) Defaults to
YES. Whether to write a multi-band dataset as a 3D Zarr array. If false, one 2D Zarr array per band will be written.INTERLEAVE=[BAND/PIXEL]: (GDAL >= 3.8) Defaults to
BAND. When writing a multi-band dataset as a 3D Zarr array, whether the band dimension should be the first one/slowest varying one (BAND), or the last one/fastest varying one (PIXEL) The default value is BAND in Create() mode. In CreateCopy() mode, the default value is the value of the INTERLEAVE metadata item of the IMAGE_STRUCTURE metadata domain of the source dataset, if set. See Multiband pixel organization (INTERLEAVE metadata item) for more details.
Examples
Get information on the dataset using the multidimensional tools:
gdalmdiminfo my.zarr
Get information on the dataset using the multidimensional tools when there is no directory listing available or reliable:
gdalmdiminfo /vsicurl/https://example.com/my.zarr/.zmetadata
Convert a netCDF file to ZARR using the multidimensional tools:
gdalmdimtranslate in.nc out.zarr -co ARRAY:COMPRESS=GZIP
Convert a 2D slice (the one at index 0 of the non-2D dimension) of a 3D array to GeoTIFF:
gdal_translate 'ZARR:"my.zarr":/group/myarray:0' out.tif
Note
The single quoting around the connection string is specific to the Bash shell to make sure that the double quoting is preserved.