Reading and writing files#

Vector data cubes do not have a specific interoperable file format that can be used to save any Xarray object with geometric dimensions to a file and read it back or in other software like {stars} in the R-Spatial ecosystem.

However the CF conventions offer a way of encoding geometries to array data that can then be saved to an array format like netCDF or Zarr.

First lets read some data

import geopandas as gpd
import xarray as xr
import xvec
from geodatasets import get_path
counties = gpd.read_file(get_path("geoda.natregimes"))

cube = xr.Dataset(
    data_vars=dict(
        population=(["county", "year"], counties[["PO60", "PO70", "PO80", "PO90"]]),
        unemployment=(["county", "year"], counties[["UE60", "UE70", "UE80", "UE90"]]),
        divorce=(["county", "year"], counties[["DV60", "DV70", "DV80", "DV90"]]),
        age=(["county", "year"], counties[["MA60", "MA70", "MA80", "MA90"]]),
    ),
    coords=dict(county=counties.geometry, year=[1960, 1970, 1980, 1990]),
).xvec.set_geom_indexes("county", crs=counties.crs)
cube
<xarray.Dataset> Size: 420kB
Dimensions:       (county: 3085, year: 4)
Coordinates:
  * county        (county) object 25kB POLYGON ((-95.34258270263672 48.546703...
  * year          (year) int64 32B 1960 1970 1980 1990
Data variables:
    population    (county, year) int64 99kB 4304 3987 3764 ... 43766 55800 65077
    unemployment  (county, year) float64 99kB 7.9 9.0 5.903 ... 7.018 5.489
    divorce       (county, year) float64 99kB 1.859 2.62 3.747 ... 4.782 7.415
    age           (county, year) float64 99kB 28.8 30.5 34.5 ... 28.97 35.33
Indexes:
    county   GeometryIndex (crs=EPSG:4326)

CF conventions and netCDF, Zarr#

Use .xvec.encode_cf() to to encode geometry arrays with CF conventions to a form that is compatible with any array storage format (e.g. netCDF, Zarr). This function uses cf_xarray.geometry.encode_geometries and cf_xarray.geometry.decode_geometries under the hood.

encoded = cube.xvec.encode_cf()
encoded
<xarray.Dataset> Size: 2MB
Dimensions:             (county: 3085, node: 80563, part: 3172, year: 4)
Coordinates:
    spatial_ref         int64 8B 0
  * year                (year) int64 32B 1960 1970 1980 1990
Dimensions without coordinates: county, node, part
Data variables:
    node_count          (county) int64 25kB 24 50 68 99 16 7 ... 86 32 24 32 74
    geometry_container  float64 8B nan
    x                   (node) float64 645kB -95.34 -95.34 ... -111.3 -111.4
    y                   (node) float64 645kB 48.55 48.72 48.72 ... 44.73 44.75
    lon                 (county) float64 25kB -95.34 -118.9 ... -77.53 -111.4
    lat                 (county) float64 25kB 48.55 47.95 48.04 ... 38.57 44.75
    part_node_count     (part) int64 25kB 24 50 68 99 16 7 ... 28 86 32 24 32 74
    interior_ring       (part) int64 25kB 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0
    population          (county, year) int64 99kB 4304 3987 3764 ... 55800 65077
    unemployment        (county, year) float64 99kB 7.9 9.0 ... 7.018 5.489
    divorce             (county, year) float64 99kB 1.859 2.62 ... 4.782 7.415
    age                 (county, year) float64 99kB 28.8 30.5 ... 28.97 35.33

Write to a format as usual with Xarray

encoded.to_netcdf("geo-encoded.nc", mode="w")
encoded.to_zarr("geo-encoded.zarr", mode="w")
<xarray.backends.zarr.ZarrStore at 0x135e610c0>

On open, use .xvec.decode_cf to recover the array. This function uses cf_xarray.decode_geometries so again cf_xarray must be installed.

roundtripped = xr.open_zarr("geo-encoded.zarr").xvec.decode_cf()
roundtripped
<xarray.Dataset> Size: 420kB
Dimensions:       (county: 3085, year: 4)
Coordinates:
  * county        (county) object 25kB POLYGON ((-95.34258270263672 48.546703...
  * year          (year) int64 32B 1960 1970 1980 1990
Data variables:
    age           (county, year) float64 99kB dask.array<chunksize=(3085, 4), meta=np.ndarray>
    divorce       (county, year) float64 99kB dask.array<chunksize=(3085, 4), meta=np.ndarray>
    population    (county, year) int64 99kB dask.array<chunksize=(3085, 4), meta=np.ndarray>
    unemployment  (county, year) float64 99kB dask.array<chunksize=(3085, 4), meta=np.ndarray>
Indexes:
    county   GeometryIndex (crs=EPSG:4326)
roundtripped.identical(cube)
True

Pickle, joblib#

As with any other file, you can serialize and deserialize Xarray objects with the pickle and joblib libraries. However, keep in mind that such a file needs to be read in an environment with the same or similar versions of all dependencies to avoid issues.

Create an example Dataset based on the US demographic data as in the introductory notebook.

import pickle

import joblib

Pickle#

Save the object from memory to a file using the built-in pickle package:

with open("cube.pickle", "wb") as f:
    pickle.dump(cube, f)

And read it back to memory:

with open("cube.pickle", "rb") as f:
    cube2 = pickle.load(f)

cube2
<xarray.Dataset> Size: 420kB
Dimensions:       (county: 3085, year: 4)
Coordinates:
  * county        (county) object 25kB POLYGON ((-95.34258270263672 48.546703...
  * year          (year) int64 32B 1960 1970 1980 1990
Data variables:
    population    (county, year) int64 99kB 4304 3987 3764 ... 43766 55800 65077
    unemployment  (county, year) float64 99kB 7.9 9.0 5.903 ... 7.018 5.489
    divorce       (county, year) float64 99kB 1.859 2.62 3.747 ... 4.782 7.415
    age           (county, year) float64 99kB 28.8 30.5 34.5 ... 28.97 35.33
Indexes:
    county   GeometryIndex (crs=EPSG:4326)

Joblib#

Larger objects may be better to save using the joblib package that allows compression.

with open("cube.joblib.compressed", "wb") as f:
    joblib.dump(cube, f, compress=True)

And decompress and read it back to memory:

with open("cube.joblib.compressed", "rb") as f:
    cube3 = joblib.load(f)

cube3
<xarray.Dataset> Size: 420kB
Dimensions:       (county: 3085, year: 4)
Coordinates:
  * county        (county) object 25kB POLYGON ((-95.34258270263672 48.546703...
  * year          (year) int64 32B 1960 1970 1980 1990
Data variables:
    population    (county, year) int64 99kB 4304 3987 3764 ... 43766 55800 65077
    unemployment  (county, year) float64 99kB 7.9 9.0 5.903 ... 7.018 5.489
    divorce       (county, year) float64 99kB 1.859 2.62 3.747 ... 4.782 7.415
    age           (county, year) float64 99kB 28.8 30.5 34.5 ... 28.97 35.33
Indexes:
    county   GeometryIndex (crs=EPSG:4326)

In this case, the cube.pickle file has 1.8 MB while the cube.joblib.compressed has 823 KB.

GIS file formats#

Another option is to convert the data to a GeoDataFrame and save it to any geospatial file format or a GeoParquet if there are multiple geometry columns.

Long table form#

Long table form needs to save each value in a row with all its dimension coordinates. While this option is possible, it can be inefficient, leading to a high degree of repetition and very long tables. However, it is readily supported by Xarray and Xvec.

long = cube.xvec.to_geodataframe()
long
county population unemployment divorce age
year
1960 POLYGON ((-95.34258 48.54670, -95.34081 48.715... 4304 7.900000 1.858974 28.800000
1970 POLYGON ((-95.34258 48.54670, -95.34081 48.715... 3987 9.000000 2.619808 30.500000
1980 POLYGON ((-95.34258 48.54670, -95.34081 48.715... 3764 5.902579 3.746594 34.500000
1990 POLYGON ((-95.34258 48.54670, -95.34081 48.715... 4076 3.894790 7.388535 35.500000
1960 POLYGON ((-118.85050 47.94969, -118.84732 48.4... 3889 8.200000 2.863278 25.900000
... ... ... ... ... ...
1990 POLYGON ((-77.53178 38.56506, -77.72094 38.840... 250377 3.353981 5.750177 28.466667
1960 POLYGON ((-111.37152 44.74516, -111.36839 45.3... 39260 5.218039 2.777586 28.250000
1970 POLYGON ((-111.37152 44.74516, -111.36839 45.3... 43766 5.444291 2.724604 26.400000
1980 POLYGON ((-111.37152 44.74516, -111.36839 45.3... 55800 7.017943 4.782264 28.966667
1990 POLYGON ((-111.37152 44.74516, -111.36839 45.3... 65077 5.488553 7.414559 35.333333

12340 rows × 5 columns

You can see that the result is longer than the original GeoDataFrame that had each year encoded as a column.

You can then save it using to_file or to_parquet.

long.to_file("cube.gpkg")

or

long.to_parquet("cube.parquet")

You can then re-create the Dataset using the from_dataframe method with the geometry column set as an index.

xr.Dataset.from_dataframe(long.set_index("county", append=True)).xvec.set_geom_indexes(
    "county", crs=long.crs
)
<xarray.Dataset> Size: 420kB
Dimensions:       (year: 4, county: 3085)
Coordinates:
  * year          (year) int64 32B 1960 1970 1980 1990
  * county        (county) object 25kB MULTIPOLYGON (((-97.25125122070312 26....
Data variables:
    population    (year, county) int64 99kB 151098 20084 180904 ... 30167 62496
    unemployment  (year, county) float64 99kB 8.4 4.0 6.3 ... 4.464 6.393 4.379
    divorce       (year, county) float64 99kB 1.584 1.219 1.148 ... 7.917 8.998
    age           (year, county) float64 99kB 20.5 18.9 19.8 ... 37.0 27.2 34.1
Indexes:
    county   GeometryIndex (crs=EPSG:4326)

Wide table form#

The other option is to create a wide table form which is what we started with. Each combination of coordinate and data variables except geometry is encoded as a column (e.g. ["population_1960", "population_1970", "population_1980", "population_1990"] or using the MultiIndex like below). However, this works only when the Dataset or DataArray is indexed by one geometry dimension only.

To create a valid GeoDataFrame, you would need to do something like this:

wide = cube.xvec.to_geodataframe(long=False)
wide
county population unemployment divorce age
year 1960 1970 1980 1990 1960 1970 1980 1990 1960 1970 1980 1990 1960 1970 1980 1990
0 POLYGON ((-95.34258 48.54670, -95.34081 48.715... 4304 3987 3764 4076 7.900000 9.000000 5.902579 3.894790 1.858974 2.619808 3.746594 7.388535 28.80 30.5 34.500000 35.500000
1 POLYGON ((-118.85050 47.94969, -118.84732 48.4... 3889 3655 5811 6295 8.200000 15.400000 15.422886 16.811594 2.863278 3.686007 6.625442 11.543135 25.90 27.1 27.200000 32.800000
2 POLYGON ((-117.43777 48.04422, -117.54113 48.0... 17884 17405 28979 30948 10.100000 9.000000 13.574064 10.700794 2.711447 2.976378 5.448223 9.123212 29.60 31.8 28.700000 34.500000
3 POLYGON ((-118.97096 47.93928, -118.97293 47.9... 25520 25867 30639 33350 7.500000 10.500000 12.700195 10.203544 3.372041 4.090386 7.122333 9.245627 29.40 31.1 31.200000 35.000000
4 POLYGON ((-117.43750 49.00000, -117.03098 49.0... 6914 6025 8580 8915 10.600000 13.400000 18.148999 14.991023 3.008988 4.010695 5.287860 10.158351 31.20 33.8 31.300000 36.100000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3080 POLYGON ((-114.51984 33.02767, -114.55830 33.0... 46235 60827 90554 120739 4.600000 4.800000 8.012906 10.910351 3.702838 4.160574 5.730308 7.798840 25.70 24.6 27.900000 33.900000
3081 POLYGON ((-107.19484 34.58352, -107.71780 34.5... 39085 40576 61115 69029 6.100000 3.800000 8.375127 11.002271 1.880621 2.274648 5.372952 9.078873 20.10 21.7 25.500000 30.450000
3082 POLYGON ((-76.39569 37.10771, -76.40270 37.090... 21583 33203 44189 53427 2.500000 3.105065 5.037045 4.102888 1.113223 1.550323 3.257639 4.850746 24.40 24.4 29.800000 34.000000
3083 POLYGON ((-77.53178 38.56506, -77.72094 38.840... 50164 111102 166665 250377 3.300000 2.361559 3.708886 3.353981 1.672391 1.468343 3.742207 5.750177 22.30 21.8 25.666667 28.466667
3084 POLYGON ((-111.37152 44.74516, -111.36839 45.3... 39260 43766 55800 65077 5.218039 5.444291 7.017943 5.488553 2.777586 2.724604 4.782264 7.414559 28.25 26.4 28.966667 35.333333

3085 rows × 17 columns

And a re-creation of the Dataset.

wide.set_index("county", append=True).stack().reset_index(
    level=0, drop=True
).to_xarray().xvec.set_geom_indexes("county", crs=wide.crs)
/var/folders/sc/hsdthq2x7mnbfvwnk7rp4xmr0000gn/T/ipykernel_52609/2455462364.py:1: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
  wide.set_index("county", append=True).stack().reset_index(
<xarray.Dataset> Size: 420kB
Dimensions:       (county: 3085, year: 4)
Coordinates:
  * county        (county) object 25kB MULTIPOLYGON (((-97.25125122070312 26....
  * year          (year) object 32B 1960 1970 1980 1990
Data variables:
    population    (county, year) int64 99kB 151098 140368 209727 ... 54981 62496
    unemployment  (county, year) float64 99kB 8.4 6.6 7.757 ... 3.0 4.898 4.379
    divorce       (county, year) float64 99kB 1.584 1.91 3.21 ... 6.656 8.998
    age           (county, year) float64 99kB 20.5 21.8 25.0 ... 25.3 30.1 34.1
Indexes:
    county   GeometryIndex (crs=EPSG:4326)