Zarr

Added in version 3.4.

Driver short name

Zarr

Build dependencies

Built-in by default, but liblz4, libxz (lzma), libzstd and libblosc strongly recommended to get all compressors

Zarr is a format for the storage of chunked, compressed, N-dimensional arrays. This format is supported for read and write access, and using the traditional 2D raster API or the newer multidimensional API

The driver supports the Zarr V2 specification, and has experimental support for the in-progress Zarr V3 specification.

Warning

The current implementation of Zarr V3 before GDAL 3.8 is incompatible with the latest evolutions of the Zarr V3 specification. GDAL 3.8 is compatible with the Zarr V3 specification at date 2023-May-7, and is not interoperable with Zarr V3 datasets produced by earlier GDAL versions.

Local and cloud storage (see GDAL Virtual File Systems (compressed, network hosted, etc...): /vsimem, /vsizip, /vsitar, /vsicurl, ...) are supported in read and write.

Driver capabilities

Supports Create()

This driver supports the GDALDriver::Create() operation

Supports Georeferencing

This driver supports georeferencing

Supports multidimensional API

This driver supports the Multidimensional Raster Data Model

Supports VirtualIO

This driver supports virtual I/O operations (/vsimem/, etc.)

Concepts

A Zarr dataset is made of a hierarchy of nodes, with intermediate nodes being groups (GDALGroup), and leafs being arrays (GDALMDArray).

Dataset name

For Zarr V2, the dataset name recognized by the Open() method of the driver is a directory that contains a .zgroup file, a .zarray file or a .zmetadata file (consolidated metadata). For faster exploration, the driver will use consolidated metadata by default when found.

For Zarr V3, the dataset name recognized by the Open() method of the driver is a directory that contains a zarr.json file (root of the dataset).

For datasets on file systems where file listing is not reliable, as often with /vsicurl/, it is also possible to prefix the directory name with ZARR:, and it is necessary to surround the /vsicurl/-prefixed URL with double quotes. e.g ZARR:"/vsicurl/https://example.org/foo.zarr". Note that when passing such string in a command line shell, extra quoting might be necessary to preserve the double-quoting.

For example with a Bash shell, the whole connection string needs to be surrounded with single-quote characters:

gdalmdiminfo 'ZARR:"/vsicurl/https://example.org/foo.zarr"'

Compression methods

Compression methods available depend on how GDAL is built, and libblosc too.

A full-feature build will show:

$ gdalinfo --format Zarr

[...]

  Other metadata items:
    COMPRESSORS=blosc,zlib,gzip,lzma,zstd,lz4
    BLOSC_COMPRESSORS=blosclz,lz4,lz4hc,snappy,zlib,zstd

For specific uses, it is also possible to register at run-time extra compressors and decompressors with CPLRegisterCompressor() and CPLRegisterDecompressor().

XArray _ARRAY_DIMENSIONS

The driver support the _ARRAY_DIMENSIONS special attribute used by XArray to store the dimension names of an array.

NCZarr extensions

The driver support the NCZarr v2 extensions of storing the dimension names of an array (read-only)

SRS encoding

The Zarr specification has no provision for spatial reference system encoding. GDAL uses a _CRS attribute that is a dictionary that may contain one or several of the following keys: url (using a OGC CRS URL), wkt (WKT:2019 used by default on writing, WKT1 also supported on reading.), projjson. On reading, it will use url by default, if not found will fallback to wkt and then projjson.

{
  "_CRS":{
    "wkt":"PROJCRS[\"NAD27 \/ UTM zone 11N\",BASEGEOGCRS[\"NAD27\",DATUM[\"North American Datum 1927\",ELLIPSOID[\"Clarke 1866\",6378206.4,294.978698213898,LENGTHUNIT[\"metre\",1]]],PRIMEM[\"Greenwich\",0,ANGLEUNIT[\"degree\",0.0174532925199433]],ID[\"EPSG\",4267]],CONVERSION[\"UTM zone 11N\",METHOD[\"Transverse Mercator\",ID[\"EPSG\",9807]],PARAMETER[\"Latitude of natural origin\",0,ANGLEUNIT[\"degree\",0.0174532925199433],ID[\"EPSG\",8801]],PARAMETER[\"Longitude of natural origin\",-117,ANGLEUNIT[\"degree\",0.0174532925199433],ID[\"EPSG\",8802]],PARAMETER[\"Scale factor at natural origin\",0.9996,SCALEUNIT[\"unity\",1],ID[\"EPSG\",8805]],PARAMETER[\"False easting\",500000,LENGTHUNIT[\"metre\",1],ID[\"EPSG\",8806]],PARAMETER[\"False northing\",0,LENGTHUNIT[\"metre\",1],ID[\"EPSG\",8807]]],CS[Cartesian,2],AXIS[\"easting\",east,ORDER[1],LENGTHUNIT[\"metre\",1]],AXIS[\"northing\",north,ORDER[2],LENGTHUNIT[\"metre\",1]],ID[\"EPSG\",26711]]",

    "projjson":{
      "$schema":"https:\/\/proj.org\/schemas\/v0.2\/projjson.schema.json",
      "type":"ProjectedCRS",
      "name":"NAD27 \/ UTM zone 11N",
      "base_crs":{
        "name":"NAD27",
        "datum":{
          "type":"GeodeticReferenceFrame",
          "name":"North American Datum 1927",
          "ellipsoid":{
            "name":"Clarke 1866",
            "semi_major_axis":6378206.4,
            "inverse_flattening":294.978698213898
          }
        },
        "coordinate_system":{
          "subtype":"ellipsoidal",
          "axis":[
            {
              "name":"Geodetic latitude",
              "abbreviation":"Lat",
              "direction":"north",
              "unit":"degree"
            },
            {
              "name":"Geodetic longitude",
              "abbreviation":"Lon",
              "direction":"east",
              "unit":"degree"
            }
          ]
        },
        "id":{
          "authority":"EPSG",
          "code":4267
        }
      },
      "conversion":{
        "name":"UTM zone 11N",
        "method":{
          "name":"Transverse Mercator",
          "id":{
            "authority":"EPSG",
            "code":9807
          }
        },
        "parameters":[
          {
            "name":"Latitude of natural origin",
            "value":0,
            "unit":"degree",
            "id":{
              "authority":"EPSG",
              "code":8801
            }
          },
          {
            "name":"Longitude of natural origin",
            "value":-117,
            "unit":"degree",
            "id":{
              "authority":"EPSG",
              "code":8802
            }
          },
          {
            "name":"Scale factor at natural origin",
            "value":0.9996,
            "unit":"unity",
            "id":{
              "authority":"EPSG",
              "code":8805
            }
          },
          {
            "name":"False easting",
            "value":500000,
            "unit":"metre",
            "id":{
              "authority":"EPSG",
              "code":8806
            }
          },
          {
            "name":"False northing",
            "value":0,
            "unit":"metre",
            "id":{
              "authority":"EPSG",
              "code":8807
            }
          }
        ]
      },
      "coordinate_system":{
        "subtype":"Cartesian",
        "axis":[
          {
            "name":"Easting",
            "abbreviation":"",
            "direction":"east",
            "unit":"metre"
          },
          {
            "name":"Northing",
            "abbreviation":"",
            "direction":"north",
            "unit":"metre"
          }
        ]
      },
      "id":{
        "authority":"EPSG",
        "code":26711
      }
    },

    "url":"http:\/\/www.opengis.net\/def\/crs\/EPSG\/0\/26711"
  }
}

Particularities of the classic raster API

If the Zarr dataset contains one single array with 2 dimensions, it will be exposed as a regular GDALDataset when using the classic raster API. If the dataset contains more than one such single array, or arrays with 3 or more dimensions, the driver will list subdatasets to access each array and/or 2D slices within arrays with 3 or more dimensions.

Open options

The following dataset open options are available:

  • USE_ZMETADATA=[YES​/​NO]: Defaults to YES. Whether to use consolidated metadata from .zmetadata (Zarr V2 only).

  • CACHE_TILE_PRESENCE=[YES​/​NO]: Defaults to NO. Whether to establish an initial listing of present tiles. This cached listing will be stored in a .gmac file next to the .zarray / .array.json.gmac file if they can be written. Otherwise the GDAL_PAM_PROXY_DIR config option should be set to an existing directory where those cached files will be stored. Once the cached listing has been established, the open option no longer needs to be specified. Note: the runtime of this option can be in minutes or more for large datasets stored on remote file systems. And for network file systems, this will rarely work for /vsicurl/ itself, but more cloud-based file systems (such as /vsis3/, /vsigs/, /vsiaz/, etc) which have a dedicated directory listing operation.

  • MULTIBAND=[YES​/​NO]: (GDAL >= 3.8) Defaults to YES. Whether to expose > 3D arrays as GDAL multiband datasets (when using the classic 2D API)

  • DIM_X=<string> or <integer>: (GDAL >= 3.8) Name or index of the X dimension (only used when MULTIBAND=YES and with th classic 2D API). If not specified, deduced from dimension type (when equal to "HORIZONTAL_X"), or the last dimension (i.e. fastest varying one), if no dimension type found.

  • DIM_Y=<string> or <integer>: (GDAL >= 3.8) Name or index of the Y dimension (only used when MULTIBAND=YES and with th classic 2D API). If not specified, deduced from dimension type (when equal to "HORIZONTAL_Y"), or the before last dimension, if no dimension type found.

  • LOAD_EXTRA_DIM_METADATA_DELAY=<integer> or "unlimited": (GDAL >= 3.8) Defaults to 5. Maximum delay in seconds allowed to set the DIM_{dimname}_VALUE band metadata items from the indexing variable of the dimensions. Default value is 5. unlimited can be used to mean unlimited delay. Can also be defined globally with the GDAL_LOAD_EXTRA_DIM_METADATA_DELAY configuration` option. Only used through the classic 2D API.

Multi-threaded caching

The driver implements the GDALMDArray::AdviseRead() method. This proceed to multi-threaded decoding of the tiles that intersect the area of interest specified. A sufficient cache size must be specified. The call is blocking.

The options that can be passed to the methods are:

  • CACHE_SIZE=value_in_byte: Maximum RAM to use, expressed in number of bytes. If not specified, half of the remaining GDAL block cache size will be used. Note: the caching mechanism of Zarr array will not update this remaining block cache size.

  • NUM_THREADS=integer or ALL_CPUS: Number of threads to use in parallel. If not specified, the GDAL_NUM_THREADS configuration option will be taken into account.

Creation options

The following options are creation options of the classic raster API, or array-level creation options for the multidimensional API (must be prefixed with ARRAY: using gdalmdimtranslate):

  • COMPRESS=[NONE​/​BLOSC​/​ZLIB​/​GZIP​/​LZMA​/​ZSTD​/​LZ4]: Defaults to NONE. Compression method.

  • FILTER=[NONE​/​DELTA]: Defaults to NONE. Filter method. Only support for FORMAT=ZARR_V2.

  • BLOCKSIZE=<string>: Comma separated list of chunk size along each dimension. If not specified, the fastest varying 2 dimensions (the last ones) used a block size of 256 samples, and the other ones of 1.

  • CHUNK_MEMORY_LAYOUT=[C​/​F]: Defaults to C. Whether to use C (row-major) order or F (column-major) order in encoded chunks. Only useful when using compression. Changing to F may improve depending on array content.

  • STRING_FORMAT=[ASCII​/​UNICODE]: Defaults to ASCII. Whether to use the numpy type for ASCII-only strings or Unicode strings. Unicode strings take 4 byte per character.

  • DIM_SEPARATOR=<string>: Dimension separator in chunk filenames. Default to decimal point for ZarrV2 and slash for ZarrV3.

  • BLOSC_CNAME=[bloclz​/​lz4​/​lz4hc​/​snappy​/​zlib​/​zstd]: Defaults to lz4. Blosc compressor name. Only used when COMPRESS=BLOSC.

  • BLOSC_CLEVEL=1-9: Defaults to 5. Blosc compression level. Only used when COMPRESS=BLOSC.

  • BLOSC_SHUFFLE=[NONE​/​BYTE​/​BIT]: Defaults to BYTE. Type of shuffle algorithm. Only used when COMPRESS=BLOSC.

  • BLOSC_BLOCKSIZE=<integer>: Defaults to 0. Blosc block size. Only used when COMPRESS=BLOSC.

  • BLOSC_NUM_THREADS=[<integer>​/​ALL_CPUS]: Defaults to 1. Number of worker threads for compression. Only used when COMPRESS=BLOSC.

  • ZLIB_LEVEL=1-9: Defaults to 6. ZLib compression level. Only used when COMPRESS=ZLIB.

  • GZIP_LEVEL=1-9: Defaults to 6. GZip compression level. Only used when COMPRESS=GZIP.

  • LZMA_PRESET=0-9: Defaults to 6. LZMA compression level. Only used when COMPRESS=LZMA.

  • LZMA_DELTA=<integer>: Defaults to 1. Delta distance in byte. Only used when COMPRESS=LZMA.

  • ZSTD_LEVEL=1-22: Defaults to 13. ZSTD compression level. Only used when COMPRESS=ZSTD.

  • LZ4_ACCELERATION=<integer> [1-]: Defaults to 1. LZ4 acceleration factor. The higher, the less compressed. Only used when COMPRESS=LZ4. Defaults to 1 (the fastest).

  • DELTA_DTYPE=<string>: Data type following NumPy array protocol type string (typestr) format (https://numpy.org/doc/stable/reference/arrays.interface.html#arrays-interface). Only u1, i1, u2, i2, u4, i4, u8, i8, f4, f8, potentially prefixed with the endianness flag (< for little endian, > for big endian) are supported. Only used when FILTER=DELTA. Defaults to the native data type.

The following options are creation options of the classic raster API, or dataset-level creation options for the multidimensional API :

  • FORMAT=[ZARR_V2​/​ZARR_V3]: Defaults to ZARR_V2.

  • CREATE_ZMETADATA=[YES​/​NO]: Defaults to YES. Whether to create consolidated metadata into .zmetadata (Zarr V2 only).

The following options are creation options of the classic raster API only:

  • ARRAY_NAME=<string>: Array name. If not specified, deduced from the filename.

  • APPEND_SUBDATASET=[YES​/​NO]: Defaults to NO. Whether to append the new dataset to an existing Zarr hierarchy.

  • SINGLE_ARRAY=[YES​/​NO]: (GDAL >= 3.8) Defaults to YES. Whether to write a multi-band dataset as a 3D Zarr array. If false, one 2D Zarr array per band will be written.

  • INTERLEAVE=[BAND​/​PIXEL]: (GDAL >= 3.8) Defaults to BAND. When writing a multi-band dataset as a 3D Zarr array, whether the band dimension should be the first one/slowest varying one (BAND), or the last one/fastest varying one (PIXEL) The default value is BAND in Create() mode. In CreateCopy() mode, the default value is the value of the INTERLEAVE metadata item of the IMAGE_STRUCTURE metadata domain of the source dataset, if set.

Examples

Get information on the dataset using the multidimensional tools:

gdalmdiminfo my.zarr

Convert a netCDF file to ZARR using the multidimensional tools:

gdalmdimtranslate in.nc out.zarr -co ARRAY:COMPRESS=GZIP

Convert a 2D slice (the one at index 0 of the non-2D dimension) of a 3D array to GeoTIFF:

gdal_translate 'ZARR:"my.zarr":/group/myarray:0' out.tif

Note

The single quoting around the connection string is specific to the Bash shell to make sure that the double quoting is preserved.

See Also: