Reading and writing files#

Vector data cubes do not have an interoperable file format that can be used to save any Xarray object with geometric dimensions to a file and read it back or in other software like {stars} in the R-Spatial ecosystem.

CF conventions and NetCDF offer a way of saving a Dataset with a single dimension indexed by geometry, but this still needs to be implemented in Xvec. While we are working on a solution (see this issue), some options can be used in the meantime.

Pickle, joblib#

As with any other file, you can serialize and deserialize Xarray objects with the pickle and joblib libraries. However, keep in mind that such a file needs to be read in an environment with the same or similar versions of all dependencies to avoid issues.

import pickle
import joblib

import geopandas as gpd
import xarray as xr
import xvec

from geodatasets import get_path

Create an example Dataset based on the US demographic data as in the introductory notebook.

counties = gpd.read_file(get_path("geoda.natregimes"))

cube = xr.Dataset(
    data_vars=dict(
        population=(["county", "year"], counties[["PO60", "PO70", "PO80", "PO90"]]),
        unemployment=(["county", "year"], counties[["UE60", "UE70", "UE80", "UE90"]]),
        divorce=(["county", "year"], counties[["DV60", "DV70", "DV80", "DV90"]]),
        age=(["county", "year"], counties[["MA60", "MA70", "MA80", "MA90"]]),
    ),
    coords=dict(county=counties.geometry, year=[1960, 1970, 1980, 1990]),
).xvec.set_geom_indexes("county", crs=counties.crs)
cube
<xarray.Dataset>
Dimensions:       (county: 3085, year: 4)
Coordinates:
  * county        (county) object POLYGON ((-95.34258270263672 48.54670333862...
  * year          (year) int64 1960 1970 1980 1990
Data variables:
    population    (county, year) int64 4304 3987 3764 4076 ... 43766 55800 65077
    unemployment  (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489
    divorce       (county, year) float64 1.859 2.62 3.747 ... 2.725 4.782 7.415
    age           (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33
Indexes:
    county   GeometryIndex (crs=EPSG:4326)

Pickle#

Save the object from memory to a file using the built-in pickle package:

with open("cube.pickle", "wb") as f:
    pickle.dump(cube, f)

And read it back to memory:

with open("cube.pickle", "rb") as f:
    cube2 = pickle.load(f)

cube2
<xarray.Dataset>
Dimensions:       (county: 3085, year: 4)
Coordinates:
  * county        (county) object POLYGON ((-95.34258270263672 48.54670333862...
  * year          (year) int64 1960 1970 1980 1990
Data variables:
    population    (county, year) int64 4304 3987 3764 4076 ... 43766 55800 65077
    unemployment  (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489
    divorce       (county, year) float64 1.859 2.62 3.747 ... 2.725 4.782 7.415
    age           (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33
Indexes:
    county   GeometryIndex (crs=EPSG:4326)

Joblib#

Larger objects may be better to save using the joblib package that allows compression.

with open("cube.joblib.compressed", "wb") as f:
    joblib.dump(cube, f, compress=True)

And decompress and read it back to memory:

with open("cube.joblib.compressed", "rb") as f:
    cube3 = joblib.load(f)

cube3
<xarray.Dataset>
Dimensions:       (county: 3085, year: 4)
Coordinates:
  * county        (county) object POLYGON ((-95.34258270263672 48.54670333862...
  * year          (year) int64 1960 1970 1980 1990
Data variables:
    population    (county, year) int64 4304 3987 3764 4076 ... 43766 55800 65077
    unemployment  (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489
    divorce       (county, year) float64 1.859 2.62 3.747 ... 2.725 4.782 7.415
    age           (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33
Indexes:
    county   GeometryIndex (crs=EPSG:4326)

In this case, the cube.pickle file has 1.8 MB while the cube.joblib.compressed has 823 KB.

GIS file formats#

Another option is to convert the data to a GeoDataFrame and save it to any geospatial file format or a GeoParquet if there are multiple geometry columns.

Long table form#

Long table form needs to save each value in a row with all its dimension coordinates. While this option is possible, it can be inefficient, leading to a high degree of repetition and very long tables. However, it is readily supported by Xarray and Xvec.

long = cube.xvec.to_geodataframe()
long
county population unemployment divorce age
year
1960 POLYGON ((-95.34258 48.5467, -95.34081 48.7151... 4304 7.900000 1.858974 28.800000
1970 POLYGON ((-95.34258 48.5467, -95.34081 48.7151... 3987 9.000000 2.619808 30.500000
1980 POLYGON ((-95.34258 48.5467, -95.34081 48.7151... 3764 5.902579 3.746594 34.500000
1990 POLYGON ((-95.34258 48.5467, -95.34081 48.7151... 4076 3.894790 7.388535 35.500000
1960 POLYGON ((-118.8505 47.94969, -118.84732 48.47... 3889 8.200000 2.863278 25.900000
... ... ... ... ... ...
1990 POLYGON ((-77.53178 38.56506, -77.72094 38.840... 250377 3.353981 5.750177 28.466667
1960 POLYGON ((-111.37152 44.74516, -111.36839 45.3... 39260 5.218039 2.777586 28.250000
1970 POLYGON ((-111.37152 44.74516, -111.36839 45.3... 43766 5.444291 2.724604 26.400000
1980 POLYGON ((-111.37152 44.74516, -111.36839 45.3... 55800 7.017943 4.782264 28.966667
1990 POLYGON ((-111.37152 44.74516, -111.36839 45.3... 65077 5.488553 7.414559 35.333333

12340 rows × 5 columns

You can see that the result is longer than the original GeoDataFrame that had each year encoded as a column.

You can then save it using to_file or to_parquet.

long.to_file("cube.gpkg")

or

long.to_parquet("cube.parquet")

You can then re-create the Dataset using the from_dataframe method with the geometry column set as an index.

xr.Dataset.from_dataframe(long.set_index("county", append=True)).xvec.set_geom_indexes(
    "county", crs=long.crs
)
<xarray.Dataset>
Dimensions:       (year: 4, county: 3085)
Coordinates:
  * year          (year) int64 1960 1970 1980 1990
  * county        (county) object MULTIPOLYGON (((-97.25125122070312 26.41930...
Data variables:
    population    (year, county) int64 151098 20084 180904 ... 83829 30167 62496
    unemployment  (year, county) float64 8.4 4.0 6.3 6.9 ... 4.464 6.393 4.379
    divorce       (year, county) float64 1.584 1.219 1.148 ... 9.628 7.917 8.998
    age           (year, county) float64 20.5 18.9 19.8 20.2 ... 37.0 27.2 34.1
Indexes:
    county   GeometryIndex (crs=EPSG:4326)

Wide table form#

The other option is to create a wide table form which is what we started with. Each combination of coordinate and data variables except geometry is encoded as a column (e.g. ["population_1960", "population_1970", "population_1980", "population_1990"] or using the MultiIndex like below). However, this works only when the Dataset or DataArray is indexed by one geometry dimension only.

To create a valid GeoDataFrame, you would need to do something like this:

wide = cube.xvec.to_geodataframe(long=False)
wide
county population unemployment divorce age
year 1960 1970 1980 1990 1960 1970 1980 1990 1960 1970 1980 1990 1960 1970 1980 1990
0 POLYGON ((-95.34258 48.5467, -95.34081 48.7151... 4304 3987 3764 4076 7.900000 9.000000 5.902579 3.894790 1.858974 2.619808 3.746594 7.388535 28.80 30.5 34.500000 35.500000
1 POLYGON ((-118.8505 47.94969, -118.84732 48.47... 3889 3655 5811 6295 8.200000 15.400000 15.422886 16.811594 2.863278 3.686007 6.625442 11.543135 25.90 27.1 27.200000 32.800000
2 POLYGON ((-117.43777 48.04422, -117.54113 48.0... 17884 17405 28979 30948 10.100000 9.000000 13.574064 10.700794 2.711447 2.976378 5.448223 9.123212 29.60 31.8 28.700000 34.500000
3 POLYGON ((-118.97096 47.93928, -118.97293 47.9... 25520 25867 30639 33350 7.500000 10.500000 12.700195 10.203544 3.372041 4.090386 7.122333 9.245627 29.40 31.1 31.200000 35.000000
4 POLYGON ((-117.4375 49, -117.03098 49, -117.02... 6914 6025 8580 8915 10.600000 13.400000 18.148999 14.991023 3.008988 4.010695 5.287860 10.158351 31.20 33.8 31.300000 36.100000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3080 POLYGON ((-114.51984 33.02767, -114.5583 33.03... 46235 60827 90554 120739 4.600000 4.800000 8.012906 10.910351 3.702838 4.160574 5.730308 7.798840 25.70 24.6 27.900000 33.900000
3081 POLYGON ((-107.19484 34.58352, -107.7178 34.58... 39085 40576 61115 69029 6.100000 3.800000 8.375127 11.002271 1.880621 2.274648 5.372952 9.078873 20.10 21.7 25.500000 30.450000
3082 POLYGON ((-76.39569 37.10771, -76.4027 37.0905... 21583 33203 44189 53427 2.500000 3.105065 5.037045 4.102888 1.113223 1.550323 3.257639 4.850746 24.40 24.4 29.800000 34.000000
3083 POLYGON ((-77.53178 38.56506, -77.72094 38.840... 50164 111102 166665 250377 3.300000 2.361559 3.708886 3.353981 1.672391 1.468343 3.742207 5.750177 22.30 21.8 25.666667 28.466667
3084 POLYGON ((-111.37152 44.74516, -111.36839 45.3... 39260 43766 55800 65077 5.218039 5.444291 7.017943 5.488553 2.777586 2.724604 4.782264 7.414559 28.25 26.4 28.966667 35.333333

3085 rows × 17 columns

And a re-creation of the Dataset.

wide.set_index("county", append=True).stack().reset_index(
    level=0, drop=True
).to_xarray().xvec.set_geom_indexes("county", crs=wide.crs)
/var/folders/2f/fhks6w_d0k556plcv3rfmshw0000gn/T/ipykernel_6430/2455462364.py:1: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
  wide.set_index("county", append=True).stack().reset_index(
<xarray.Dataset>
Dimensions:       (county: 3085, year: 4)
Coordinates:
  * county        (county) object MULTIPOLYGON (((-97.25125122070312 26.41930...
  * year          (year) object 1960 1970 1980 1990
Data variables:
    population    (county, year) int64 151098 140368 209727 ... 54981 62496
    unemployment  (county, year) float64 8.4 6.6 7.757 13.33 ... 3.0 4.898 4.379
    divorce       (county, year) float64 1.584 1.91 3.21 ... 2.502 6.656 8.998
    age           (county, year) float64 20.5 21.8 25.0 27.4 ... 25.3 30.1 34.1
Indexes:
    county   GeometryIndex (crs=EPSG:4326)