Reading and writing files#

Vector data cubes do not have an official specification for an interoperable file format that can be used to save any Xarray object with geometric dimensions to a file and read it back or in other software like {stars} in the R-Spatial ecosystem.

However, there are some ways to encode the data to be compliant with either array I/O. The CF conventions offer a way of encoding coordinate geometries to array data that can then be saved to an array format like netCDF or Zarr. Both coordinate geometry and variable geometry can be encoded as well-known binary representation and stored in Zarr, with additional attributes allowing for lossless rountripping through a file. Both of these are supported by Xvec.

First lets read some data.

import geopandas as gpd
import xarray as xr
from geodatasets import get_path

import xvec
counties = gpd.read_file(get_path("geoda.natregimes"))

cube = xr.Dataset(
    data_vars=dict(
        population=(["county", "year"], counties[["PO60", "PO70", "PO80", "PO90"]]),
        unemployment=(["county", "year"], counties[["UE60", "UE70", "UE80", "UE90"]]),
        divorce=(["county", "year"], counties[["DV60", "DV70", "DV80", "DV90"]]),
        age=(["county", "year"], counties[["MA60", "MA70", "MA80", "MA90"]]),
    ),
    coords=dict(county=counties.geometry, year=[1960, 1970, 1980, 1990]),
).xvec.set_geom_indexes("county", crs=counties.crs)
cube
<xarray.Dataset> Size: 370kB
Dimensions:       (county: 3085, year: 4)
Coordinates:
  * county        (county) object 25kB POLYGON ((-95.34258270263672 48.546703...
  * year          (year) int64 32B 1960 1970 1980 1990
Data variables:
    population    (county, year) int32 49kB 4304 3987 3764 ... 43766 55800 65077
    unemployment  (county, year) float64 99kB 7.9 9.0 5.903 ... 7.018 5.489
    divorce       (county, year) float64 99kB 1.859 2.62 3.747 ... 4.782 7.415
    age           (county, year) float64 99kB 28.8 30.5 34.5 ... 28.97 35.33
Indexes:
    county   GeometryIndex (crs=EPSG:4326)

CF conventions and netCDF, Zarr#

Given a vector data cube that contains only coordinate geometries (i.e. no variable geometries), you can use .xvec.encode_cf() to to encode geometry arrays with CF conventions to a form that is compatible with any array storage format (e.g. netCDF, Zarr). This function uses cf_xarray.geometry.encode_geometries and cf_xarray.geometry.decode_geometries under the hood and is not limited to a single array of geometries.

encoded = cube.xvec.encode_cf()
encoded
<xarray.Dataset> Size: 2MB
Dimensions:             (county: 3085, part: 3172, node: 80563, year: 4)
Coordinates:
    spatial_ref         int64 8B 0
  * year                (year) int64 32B 1960 1970 1980 1990
Dimensions without coordinates: county, part, node
Data variables:
    node_count          (county) int64 25kB 24 50 68 99 16 7 ... 86 32 24 32 74
    part_node_count     (part) int64 25kB 24 50 68 99 16 7 ... 28 86 32 24 32 74
    interior_ring       (part) bool 3kB False False False ... False False False
    geometry_container  float64 8B nan
    x                   (node) float64 645kB -95.34 -95.34 ... -111.3 -111.4
    y                   (node) float64 645kB 48.55 48.72 48.72 ... 44.73 44.75
    lon                 (county) float64 25kB -95.34 -118.9 ... -77.53 -111.4
    lat                 (county) float64 25kB 48.55 47.95 48.04 ... 38.57 44.75
    population          (county, year) int32 49kB 4304 3987 3764 ... 55800 65077
    unemployment        (county, year) float64 99kB 7.9 9.0 ... 7.018 5.489
    divorce             (county, year) float64 99kB 1.859 2.62 ... 4.782 7.415
    age                 (county, year) float64 99kB 28.8 30.5 ... 28.97 35.33

Write to a format as usual with Xarray

encoded.to_netcdf("geo-encoded.nc", mode="w")
encoded.to_zarr("geo-encoded.zarr", mode="w")
<xarray.backends.zarr.ZarrStore at 0x1356920e0>

On open, use .xvec.decode_cf to recover the array. This function uses cf_xarray.decode_geometries so again cf_xarray must be installed.

roundtripped = xr.open_zarr("geo-encoded.zarr").xvec.decode_cf()
roundtripped
<xarray.Dataset> Size: 370kB
Dimensions:       (county: 3085, year: 4)
Coordinates:
  * county        (county) object 25kB POLYGON ((-95.34258270263672 48.546703...
  * year          (year) int64 32B 1960 1970 1980 1990
Data variables:
    age           (county, year) float64 99kB dask.array<chunksize=(3085, 4), meta=np.ndarray>
    divorce       (county, year) float64 99kB dask.array<chunksize=(3085, 4), meta=np.ndarray>
    population    (county, year) int32 49kB dask.array<chunksize=(3085, 4), meta=np.ndarray>
    unemployment  (county, year) float64 99kB dask.array<chunksize=(3085, 4), meta=np.ndarray>
Indexes:
    county   GeometryIndex (crs=EPSG:4326)
roundtripped.identical(cube)
True

Zarr and well-known binary#

The CF conventions do not define a way of storing variable geometries. For that reason, xvec implements encoding to well-known binary representation of geometries. This way, you can save any Dataset to Zarr and read it back.

Create an example Dataset with Svalbard glacier changing geometries with a summary geometry to have both coordinate and variable geometry in the same object.

glaciers_df = gpd.read_file("https://github.com/loreabad6/post/raw/refs/heads/main/inst/extdata/svalbard.gpkg")

glaciers = (
    glaciers_df.set_index(["name", "year"])
    .to_xarray()
    .proj.assign_crs(
        spatial_ref=glaciers_df.crs
    )  # use xproj to store the CRS information
    .xvec.summarize_geometry(
        dim="name", geom_array="geometry", aggfunc="concave_hull", ratio=0.2
    )  # generate coordinate geometry
)
glaciers
<xarray.Dataset> Size: 472B
Dimensions:           (name: 5, year: 3)
Coordinates:
  * name              (name) object 40B 'Austre Brøggerbreen' ... 'Steenbreen'
  * year              (year) float64 24B 1.936e+03 1.99e+03 2.007e+03
  * spatial_ref       int64 8B 0
  * summary_geometry  (name) object 40B POLYGON ((431324.8682000004 8762430.3...
Data variables:
    length            (name, year) float64 120B 5.808e+03 ... 1.819e+03
    fwidth            (name, year) float64 120B 1.254e+03 470.1 ... 279.4 202.6
    geometry          (name, year) object 120B POLYGON ((432375.11039999966 8...
Indexes:
    spatial_ref       CRSIndex (crs=EPSG:32633)
    summary_geometry  GeometryIndex (crs=EPSG:32633)

The .xvec.encode_wkb() method encodes all arrays with shapely geometry objects into WKB and stores the CRS information as PROJJSON in their attributes, together with the information that the resulting array contains WKB-encoded geometry. Additionally, it adds a wkb_encoded_geometry attribute equal to True to avoid ambiguity during the decoding step.

encoded_wkb = glaciers.xvec.encode_wkb()
encoded_wkb
<xarray.Dataset> Size: 472B
Dimensions:           (name: 5, year: 3)
Coordinates:
  * name              (name) object 40B 'Austre Brøggerbreen' ... 'Steenbreen'
  * year              (year) float64 24B 1.936e+03 1.99e+03 2.007e+03
  * spatial_ref       int64 8B 0
    summary_geometry  (name) object 40B b"\x01\x03\x00\x00\x00\x01\x00\x00\x0...
Data variables:
    length            (name, year) float64 120B 5.808e+03 ... 1.819e+03
    fwidth            (name, year) float64 120B 1.254e+03 470.1 ... 279.4 202.6
    geometry          (name, year) object 120B b'\x01\x03\x00\x00\x00\x06\x00...
Indexes:
    spatial_ref  CRSIndex (crs=EPSG:32633)

Write to a Zarr file.

encoded_wkb.to_zarr("glaciers.zarr", mode="w")
<xarray.backends.zarr.ZarrStore at 0x1502181f0>

As above, the object can be losslessly roundtripped via Zarr.

roundtripped_wkb = xr.open_zarr("glaciers.zarr").xvec.decode_wkb()
roundtripped_wkb
<xarray.Dataset> Size: 472B
Dimensions:           (name: 5, year: 3)
Coordinates:
  * name              (name) object 40B 'Austre Brøggerbreen' ... 'Steenbreen'
  * spatial_ref       int64 8B ...
  * summary_geometry  (name) object 40B POLYGON ((431324.8682000004 8762430.3...
  * year              (year) float64 24B 1.936e+03 1.99e+03 2.007e+03
Data variables:
    fwidth            (name, year) float64 120B dask.array<chunksize=(5, 3), meta=np.ndarray>
    geometry          (name, year) object 120B POLYGON ((432375.11039999966 8...
    length            (name, year) float64 120B dask.array<chunksize=(5, 3), meta=np.ndarray>
Indexes:
    spatial_ref       CRSIndex (crs=EPSG:32633)
    summary_geometry  GeometryIndex (crs=EPSG:32633)
roundtripped_wkb.identical(glaciers)
True

GIS file formats#

Another option is to convert the data to a GeoDataFrame and save it to any geospatial file format or a GeoParquet if there are multiple geometry columns.

Long table form#

Long table form needs to save each value in a row with all its dimension coordinates. While this option is possible, it can be inefficient, leading to a high degree of repetition and very long tables. However, it is readily supported by Xarray and Xvec.

long = cube.xvec.to_geodataframe()
long
county population unemployment divorce age
year
1960 POLYGON ((-95.34258 48.5467, -95.34081 48.7151... 4304 7.900000 1.858974 28.800000
1970 POLYGON ((-95.34258 48.5467, -95.34081 48.7151... 3987 9.000000 2.619808 30.500000
1980 POLYGON ((-95.34258 48.5467, -95.34081 48.7151... 3764 5.902579 3.746594 34.500000
1990 POLYGON ((-95.34258 48.5467, -95.34081 48.7151... 4076 3.894790 7.388535 35.500000
1960 POLYGON ((-118.8505 47.94969, -118.84732 48.47... 3889 8.200000 2.863278 25.900000
... ... ... ... ... ...
1990 POLYGON ((-77.53178 38.56506, -77.72094 38.840... 250377 3.353981 5.750177 28.466667
1960 POLYGON ((-111.37152 44.74516, -111.36839 45.3... 39260 5.218039 2.777586 28.250000
1970 POLYGON ((-111.37152 44.74516, -111.36839 45.3... 43766 5.444291 2.724604 26.400000
1980 POLYGON ((-111.37152 44.74516, -111.36839 45.3... 55800 7.017943 4.782264 28.966667
1990 POLYGON ((-111.37152 44.74516, -111.36839 45.3... 65077 5.488553 7.414559 35.333333

12340 rows × 5 columns

You can see that the result is longer than the original GeoDataFrame that had each year encoded as a column.

You can then save it using to_file or to_parquet.

long.to_file("cube.gpkg")

or

long.to_parquet("cube.parquet")

You can then re-create the Dataset using the from_dataframe method with the geometry column set as an index.

xr.Dataset.from_dataframe(long.set_index("county", append=True)).xvec.set_geom_indexes(
    "county", crs=long.crs
)
<xarray.Dataset> Size: 370kB
Dimensions:       (year: 4, county: 3085)
Coordinates:
  * year          (year) int64 32B 1960 1970 1980 1990
  * county        (county) object 25kB MULTIPOLYGON (((-97.25125122070312 26....
Data variables:
    population    (year, county) int32 49kB 151098 20084 180904 ... 30167 62496
    unemployment  (year, county) float64 99kB 8.4 4.0 6.3 ... 4.464 6.393 4.379
    divorce       (year, county) float64 99kB 1.584 1.219 1.148 ... 7.917 8.998
    age           (year, county) float64 99kB 20.5 18.9 19.8 ... 37.0 27.2 34.1
Indexes:
    county   GeometryIndex (crs=EPSG:4326)

Wide table form#

The other option is to create a wide table form which is what we started with. Each combination of coordinate and data variables except geometry is encoded as a column (e.g. ["population_1960", "population_1970", "population_1980", "population_1990"] or using the MultiIndex like below). However, this works only when the Dataset or DataArray is indexed by one geometry dimension only.

To create a valid GeoDataFrame, you would need to do something like this:

wide = cube.xvec.to_geodataframe(long=False)
wide
county population unemployment divorce age
year 1960 1970 1980 1990 1960 1970 1980 1990 1960 1970 1980 1990 1960 1970 1980 1990
0 POLYGON ((-95.34258 48.5467, -95.34081 48.7151... 4304 3987 3764 4076 7.900000 9.000000 5.902579 3.894790 1.858974 2.619808 3.746594 7.388535 28.80 30.5 34.500000 35.500000
1 POLYGON ((-118.8505 47.94969, -118.84732 48.47... 3889 3655 5811 6295 8.200000 15.400000 15.422886 16.811594 2.863278 3.686007 6.625442 11.543135 25.90 27.1 27.200000 32.800000
2 POLYGON ((-117.43777 48.04422, -117.54113 48.0... 17884 17405 28979 30948 10.100000 9.000000 13.574064 10.700794 2.711447 2.976378 5.448223 9.123212 29.60 31.8 28.700000 34.500000
3 POLYGON ((-118.97096 47.93928, -118.97293 47.9... 25520 25867 30639 33350 7.500000 10.500000 12.700195 10.203544 3.372041 4.090386 7.122333 9.245627 29.40 31.1 31.200000 35.000000
4 POLYGON ((-117.4375 49, -117.03098 49, -117.02... 6914 6025 8580 8915 10.600000 13.400000 18.148999 14.991023 3.008988 4.010695 5.287860 10.158351 31.20 33.8 31.300000 36.100000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3080 POLYGON ((-114.51984 33.02767, -114.5583 33.03... 46235 60827 90554 120739 4.600000 4.800000 8.012906 10.910351 3.702838 4.160574 5.730308 7.798840 25.70 24.6 27.900000 33.900000
3081 POLYGON ((-107.19484 34.58352, -107.7178 34.58... 39085 40576 61115 69029 6.100000 3.800000 8.375127 11.002271 1.880621 2.274648 5.372952 9.078873 20.10 21.7 25.500000 30.450000
3082 POLYGON ((-76.39569 37.10771, -76.4027 37.0905... 21583 33203 44189 53427 2.500000 3.105065 5.037045 4.102888 1.113223 1.550323 3.257639 4.850746 24.40 24.4 29.800000 34.000000
3083 POLYGON ((-77.53178 38.56506, -77.72094 38.840... 50164 111102 166665 250377 3.300000 2.361559 3.708886 3.353981 1.672391 1.468343 3.742207 5.750177 22.30 21.8 25.666667 28.466667
3084 POLYGON ((-111.37152 44.74516, -111.36839 45.3... 39260 43766 55800 65077 5.218039 5.444291 7.017943 5.488553 2.777586 2.724604 4.782264 7.414559 28.25 26.4 28.966667 35.333333

3085 rows × 17 columns

And a re-creation of the Dataset.

wide.set_index("county", append=True).stack().reset_index(
    level=0, drop=True
).to_xarray().xvec.set_geom_indexes("county", crs=wide.crs)
/var/folders/2f/fhks6w_d0k556plcv3rfmshw0000gn/T/ipykernel_28128/2455462364.py:1: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
  wide.set_index("county", append=True).stack().reset_index(
<xarray.Dataset> Size: 370kB
Dimensions:       (county: 3085, year: 4)
Coordinates:
  * county        (county) object 25kB MULTIPOLYGON (((-97.25125122070312 26....
  * year          (year) object 32B 1960 1970 1980 1990
Data variables:
    population    (county, year) int32 49kB 151098 140368 209727 ... 54981 62496
    unemployment  (county, year) float64 99kB 8.4 6.6 7.757 ... 3.0 4.898 4.379
    divorce       (county, year) float64 99kB 1.584 1.91 3.21 ... 6.656 8.998
    age           (county, year) float64 99kB 20.5 21.8 25.0 ... 25.3 30.1 34.1
Indexes:
    county   GeometryIndex (crs=EPSG:4326)