Reading and writing files#
Vector data cubes do not have a specific interoperable file format that can be used to save any Xarray object with geometric dimensions to a file and read it back or in other software like {stars}
in the R-Spatial ecosystem.
However the CF conventions offer a way of encoding geometries to array data that can then be saved to an array format like netCDF or Zarr.
First lets read some data
import geopandas as gpd
import xarray as xr
import xvec
from geodatasets import get_path
counties = gpd.read_file(get_path("geoda.natregimes"))
cube = xr.Dataset(
data_vars=dict(
population=(["county", "year"], counties[["PO60", "PO70", "PO80", "PO90"]]),
unemployment=(["county", "year"], counties[["UE60", "UE70", "UE80", "UE90"]]),
divorce=(["county", "year"], counties[["DV60", "DV70", "DV80", "DV90"]]),
age=(["county", "year"], counties[["MA60", "MA70", "MA80", "MA90"]]),
),
coords=dict(county=counties.geometry, year=[1960, 1970, 1980, 1990]),
).xvec.set_geom_indexes("county", crs=counties.crs)
cube
<xarray.Dataset> Size: 420kB Dimensions: (county: 3085, year: 4) Coordinates: * county (county) object 25kB POLYGON ((-95.34258270263672 48.546703... * year (year) int64 32B 1960 1970 1980 1990 Data variables: population (county, year) int64 99kB 4304 3987 3764 ... 43766 55800 65077 unemployment (county, year) float64 99kB 7.9 9.0 5.903 ... 7.018 5.489 divorce (county, year) float64 99kB 1.859 2.62 3.747 ... 4.782 7.415 age (county, year) float64 99kB 28.8 30.5 34.5 ... 28.97 35.33 Indexes: county GeometryIndex (crs=EPSG:4326)
CF conventions and netCDF, Zarr#
Use .xvec.encode_cf()
to to encode geometry arrays with CF conventions to a form that is compatible with any array storage format (e.g. netCDF, Zarr). This function uses cf_xarray.geometry.encode_geometries
and cf_xarray.geometry.decode_geometries
under the hood.
encoded = cube.xvec.encode_cf()
encoded
<xarray.Dataset> Size: 2MB Dimensions: (county: 3085, node: 80563, part: 3172, year: 4) Coordinates: spatial_ref int64 8B 0 * year (year) int64 32B 1960 1970 1980 1990 Dimensions without coordinates: county, node, part Data variables: node_count (county) int64 25kB 24 50 68 99 16 7 ... 86 32 24 32 74 geometry_container float64 8B nan x (node) float64 645kB -95.34 -95.34 ... -111.3 -111.4 y (node) float64 645kB 48.55 48.72 48.72 ... 44.73 44.75 lon (county) float64 25kB -95.34 -118.9 ... -77.53 -111.4 lat (county) float64 25kB 48.55 47.95 48.04 ... 38.57 44.75 part_node_count (part) int64 25kB 24 50 68 99 16 7 ... 28 86 32 24 32 74 interior_ring (part) int64 25kB 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 population (county, year) int64 99kB 4304 3987 3764 ... 55800 65077 unemployment (county, year) float64 99kB 7.9 9.0 ... 7.018 5.489 divorce (county, year) float64 99kB 1.859 2.62 ... 4.782 7.415 age (county, year) float64 99kB 28.8 30.5 ... 28.97 35.33
Write to a format as usual with Xarray
encoded.to_netcdf("geo-encoded.nc", mode="w")
encoded.to_zarr("geo-encoded.zarr", mode="w")
<xarray.backends.zarr.ZarrStore at 0x135e610c0>
On open, use .xvec.decode_cf
to recover the array. This function uses cf_xarray.decode_geometries
so again cf_xarray
must be installed.
roundtripped = xr.open_zarr("geo-encoded.zarr").xvec.decode_cf()
roundtripped
<xarray.Dataset> Size: 420kB Dimensions: (county: 3085, year: 4) Coordinates: * county (county) object 25kB POLYGON ((-95.34258270263672 48.546703... * year (year) int64 32B 1960 1970 1980 1990 Data variables: age (county, year) float64 99kB dask.array<chunksize=(3085, 4), meta=np.ndarray> divorce (county, year) float64 99kB dask.array<chunksize=(3085, 4), meta=np.ndarray> population (county, year) int64 99kB dask.array<chunksize=(3085, 4), meta=np.ndarray> unemployment (county, year) float64 99kB dask.array<chunksize=(3085, 4), meta=np.ndarray> Indexes: county GeometryIndex (crs=EPSG:4326)
roundtripped.identical(cube)
True
Pickle, joblib#
As with any other file, you can serialize and deserialize Xarray objects with the pickle
and joblib
libraries. However, keep in mind that such a file needs to be read in an environment with the same or similar versions of all dependencies to avoid issues.
Create an example Dataset based on the US demographic data as in the introductory notebook.
import pickle
import joblib
Pickle#
Save the object from memory to a file using the built-in pickle
package:
with open("cube.pickle", "wb") as f:
pickle.dump(cube, f)
And read it back to memory:
with open("cube.pickle", "rb") as f:
cube2 = pickle.load(f)
cube2
<xarray.Dataset> Size: 420kB Dimensions: (county: 3085, year: 4) Coordinates: * county (county) object 25kB POLYGON ((-95.34258270263672 48.546703... * year (year) int64 32B 1960 1970 1980 1990 Data variables: population (county, year) int64 99kB 4304 3987 3764 ... 43766 55800 65077 unemployment (county, year) float64 99kB 7.9 9.0 5.903 ... 7.018 5.489 divorce (county, year) float64 99kB 1.859 2.62 3.747 ... 4.782 7.415 age (county, year) float64 99kB 28.8 30.5 34.5 ... 28.97 35.33 Indexes: county GeometryIndex (crs=EPSG:4326)
Joblib#
Larger objects may be better to save using the joblib package that allows compression.
with open("cube.joblib.compressed", "wb") as f:
joblib.dump(cube, f, compress=True)
And decompress and read it back to memory:
with open("cube.joblib.compressed", "rb") as f:
cube3 = joblib.load(f)
cube3
<xarray.Dataset> Size: 420kB Dimensions: (county: 3085, year: 4) Coordinates: * county (county) object 25kB POLYGON ((-95.34258270263672 48.546703... * year (year) int64 32B 1960 1970 1980 1990 Data variables: population (county, year) int64 99kB 4304 3987 3764 ... 43766 55800 65077 unemployment (county, year) float64 99kB 7.9 9.0 5.903 ... 7.018 5.489 divorce (county, year) float64 99kB 1.859 2.62 3.747 ... 4.782 7.415 age (county, year) float64 99kB 28.8 30.5 34.5 ... 28.97 35.33 Indexes: county GeometryIndex (crs=EPSG:4326)
In this case, the cube.pickle
file has 1.8 MB while the cube.joblib.compressed
has 823 KB.
GIS file formats#
Another option is to convert the data to a GeoDataFrame and save it to any geospatial file format or a GeoParquet if there are multiple geometry columns.
Long table form#
Long table form needs to save each value in a row with all its dimension coordinates. While this option is possible, it can be inefficient, leading to a high degree of repetition and very long tables. However, it is readily supported by Xarray and Xvec.
long = cube.xvec.to_geodataframe()
long
county | population | unemployment | divorce | age | |
---|---|---|---|---|---|
year | |||||
1960 | POLYGON ((-95.34258 48.54670, -95.34081 48.715... | 4304 | 7.900000 | 1.858974 | 28.800000 |
1970 | POLYGON ((-95.34258 48.54670, -95.34081 48.715... | 3987 | 9.000000 | 2.619808 | 30.500000 |
1980 | POLYGON ((-95.34258 48.54670, -95.34081 48.715... | 3764 | 5.902579 | 3.746594 | 34.500000 |
1990 | POLYGON ((-95.34258 48.54670, -95.34081 48.715... | 4076 | 3.894790 | 7.388535 | 35.500000 |
1960 | POLYGON ((-118.85050 47.94969, -118.84732 48.4... | 3889 | 8.200000 | 2.863278 | 25.900000 |
... | ... | ... | ... | ... | ... |
1990 | POLYGON ((-77.53178 38.56506, -77.72094 38.840... | 250377 | 3.353981 | 5.750177 | 28.466667 |
1960 | POLYGON ((-111.37152 44.74516, -111.36839 45.3... | 39260 | 5.218039 | 2.777586 | 28.250000 |
1970 | POLYGON ((-111.37152 44.74516, -111.36839 45.3... | 43766 | 5.444291 | 2.724604 | 26.400000 |
1980 | POLYGON ((-111.37152 44.74516, -111.36839 45.3... | 55800 | 7.017943 | 4.782264 | 28.966667 |
1990 | POLYGON ((-111.37152 44.74516, -111.36839 45.3... | 65077 | 5.488553 | 7.414559 | 35.333333 |
12340 rows × 5 columns
You can see that the result is longer than the original GeoDataFrame that had each year encoded as a column.
You can then save it using to_file
or to_parquet
.
long.to_file("cube.gpkg")
or
long.to_parquet("cube.parquet")
You can then re-create the Dataset using the from_dataframe
method with the geometry column set as an index.
xr.Dataset.from_dataframe(long.set_index("county", append=True)).xvec.set_geom_indexes(
"county", crs=long.crs
)
<xarray.Dataset> Size: 420kB Dimensions: (year: 4, county: 3085) Coordinates: * year (year) int64 32B 1960 1970 1980 1990 * county (county) object 25kB MULTIPOLYGON (((-97.25125122070312 26.... Data variables: population (year, county) int64 99kB 151098 20084 180904 ... 30167 62496 unemployment (year, county) float64 99kB 8.4 4.0 6.3 ... 4.464 6.393 4.379 divorce (year, county) float64 99kB 1.584 1.219 1.148 ... 7.917 8.998 age (year, county) float64 99kB 20.5 18.9 19.8 ... 37.0 27.2 34.1 Indexes: county GeometryIndex (crs=EPSG:4326)
Wide table form#
The other option is to create a wide table form which is what we started with. Each combination of coordinate and data variables except geometry is encoded as a column (e.g. ["population_1960", "population_1970", "population_1980", "population_1990"]
or using the MultiIndex
like below). However, this works only when the Dataset or DataArray is indexed by one geometry dimension only.
To create a valid GeoDataFrame, you would need to do something like this:
wide = cube.xvec.to_geodataframe(long=False)
wide
county | population | unemployment | divorce | age | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
year | 1960 | 1970 | 1980 | 1990 | 1960 | 1970 | 1980 | 1990 | 1960 | 1970 | 1980 | 1990 | 1960 | 1970 | 1980 | 1990 | |
0 | POLYGON ((-95.34258 48.54670, -95.34081 48.715... | 4304 | 3987 | 3764 | 4076 | 7.900000 | 9.000000 | 5.902579 | 3.894790 | 1.858974 | 2.619808 | 3.746594 | 7.388535 | 28.80 | 30.5 | 34.500000 | 35.500000 |
1 | POLYGON ((-118.85050 47.94969, -118.84732 48.4... | 3889 | 3655 | 5811 | 6295 | 8.200000 | 15.400000 | 15.422886 | 16.811594 | 2.863278 | 3.686007 | 6.625442 | 11.543135 | 25.90 | 27.1 | 27.200000 | 32.800000 |
2 | POLYGON ((-117.43777 48.04422, -117.54113 48.0... | 17884 | 17405 | 28979 | 30948 | 10.100000 | 9.000000 | 13.574064 | 10.700794 | 2.711447 | 2.976378 | 5.448223 | 9.123212 | 29.60 | 31.8 | 28.700000 | 34.500000 |
3 | POLYGON ((-118.97096 47.93928, -118.97293 47.9... | 25520 | 25867 | 30639 | 33350 | 7.500000 | 10.500000 | 12.700195 | 10.203544 | 3.372041 | 4.090386 | 7.122333 | 9.245627 | 29.40 | 31.1 | 31.200000 | 35.000000 |
4 | POLYGON ((-117.43750 49.00000, -117.03098 49.0... | 6914 | 6025 | 8580 | 8915 | 10.600000 | 13.400000 | 18.148999 | 14.991023 | 3.008988 | 4.010695 | 5.287860 | 10.158351 | 31.20 | 33.8 | 31.300000 | 36.100000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3080 | POLYGON ((-114.51984 33.02767, -114.55830 33.0... | 46235 | 60827 | 90554 | 120739 | 4.600000 | 4.800000 | 8.012906 | 10.910351 | 3.702838 | 4.160574 | 5.730308 | 7.798840 | 25.70 | 24.6 | 27.900000 | 33.900000 |
3081 | POLYGON ((-107.19484 34.58352, -107.71780 34.5... | 39085 | 40576 | 61115 | 69029 | 6.100000 | 3.800000 | 8.375127 | 11.002271 | 1.880621 | 2.274648 | 5.372952 | 9.078873 | 20.10 | 21.7 | 25.500000 | 30.450000 |
3082 | POLYGON ((-76.39569 37.10771, -76.40270 37.090... | 21583 | 33203 | 44189 | 53427 | 2.500000 | 3.105065 | 5.037045 | 4.102888 | 1.113223 | 1.550323 | 3.257639 | 4.850746 | 24.40 | 24.4 | 29.800000 | 34.000000 |
3083 | POLYGON ((-77.53178 38.56506, -77.72094 38.840... | 50164 | 111102 | 166665 | 250377 | 3.300000 | 2.361559 | 3.708886 | 3.353981 | 1.672391 | 1.468343 | 3.742207 | 5.750177 | 22.30 | 21.8 | 25.666667 | 28.466667 |
3084 | POLYGON ((-111.37152 44.74516, -111.36839 45.3... | 39260 | 43766 | 55800 | 65077 | 5.218039 | 5.444291 | 7.017943 | 5.488553 | 2.777586 | 2.724604 | 4.782264 | 7.414559 | 28.25 | 26.4 | 28.966667 | 35.333333 |
3085 rows × 17 columns
And a re-creation of the Dataset.
wide.set_index("county", append=True).stack().reset_index(
level=0, drop=True
).to_xarray().xvec.set_geom_indexes("county", crs=wide.crs)
/var/folders/sc/hsdthq2x7mnbfvwnk7rp4xmr0000gn/T/ipykernel_52609/2455462364.py:1: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
wide.set_index("county", append=True).stack().reset_index(
<xarray.Dataset> Size: 420kB Dimensions: (county: 3085, year: 4) Coordinates: * county (county) object 25kB MULTIPOLYGON (((-97.25125122070312 26.... * year (year) object 32B 1960 1970 1980 1990 Data variables: population (county, year) int64 99kB 151098 140368 209727 ... 54981 62496 unemployment (county, year) float64 99kB 8.4 6.6 7.757 ... 3.0 4.898 4.379 divorce (county, year) float64 99kB 1.584 1.91 3.21 ... 6.656 8.998 age (county, year) float64 99kB 20.5 21.8 25.0 ... 25.3 30.1 34.1 Indexes: county GeometryIndex (crs=EPSG:4326)