Reading and writing files#
Vector data cubes do not have an interoperable file format that can be used to save any Xarray object with geometric dimensions to a file and read it back or in other software like {stars}
in the R-Spatial ecosystem.
CF conventions and NetCDF offer a way of saving a Dataset with a single dimension indexed by geometry, but this still needs to be implemented in Xvec. While we are working on a solution (see this issue), some options can be used in the meantime.
Pickle, joblib#
As with any other file, you can serialize and deserialize Xarray objects with the pickle
and joblib
libraries. However, keep in mind that such a file needs to be read in an environment with the same or similar versions of all dependencies to avoid issues.
import pickle
import joblib
import geopandas as gpd
import xarray as xr
import xvec
from geodatasets import get_path
Create an example Dataset based on the US demographic data as in the introductory notebook.
counties = gpd.read_file(get_path("geoda.natregimes"))
cube = xr.Dataset(
data_vars=dict(
population=(["county", "year"], counties[["PO60", "PO70", "PO80", "PO90"]]),
unemployment=(["county", "year"], counties[["UE60", "UE70", "UE80", "UE90"]]),
divorce=(["county", "year"], counties[["DV60", "DV70", "DV80", "DV90"]]),
age=(["county", "year"], counties[["MA60", "MA70", "MA80", "MA90"]]),
),
coords=dict(county=counties.geometry, year=[1960, 1970, 1980, 1990]),
).xvec.set_geom_indexes("county", crs=counties.crs)
cube
<xarray.Dataset> Dimensions: (county: 3085, year: 4) Coordinates: * county (county) object POLYGON ((-95.34258270263672 48.54670333862... * year (year) int64 1960 1970 1980 1990 Data variables: population (county, year) int64 4304 3987 3764 4076 ... 43766 55800 65077 unemployment (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489 divorce (county, year) float64 1.859 2.62 3.747 ... 2.725 4.782 7.415 age (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33 Indexes: county GeometryIndex (crs=EPSG:4326)
Pickle#
Save the object from memory to a file using the built-in pickle
package:
with open("cube.pickle", "wb") as f:
pickle.dump(cube, f)
And read it back to memory:
with open("cube.pickle", "rb") as f:
cube2 = pickle.load(f)
cube2
<xarray.Dataset> Dimensions: (county: 3085, year: 4) Coordinates: * county (county) object POLYGON ((-95.34258270263672 48.54670333862... * year (year) int64 1960 1970 1980 1990 Data variables: population (county, year) int64 4304 3987 3764 4076 ... 43766 55800 65077 unemployment (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489 divorce (county, year) float64 1.859 2.62 3.747 ... 2.725 4.782 7.415 age (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33 Indexes: county GeometryIndex (crs=EPSG:4326)
Joblib#
Larger objects may be better to save using the joblib package that allows compression.
with open("cube.joblib.compressed", "wb") as f:
joblib.dump(cube, f, compress=True)
And decompress and read it back to memory:
with open("cube.joblib.compressed", "rb") as f:
cube3 = joblib.load(f)
cube3
<xarray.Dataset> Dimensions: (county: 3085, year: 4) Coordinates: * county (county) object POLYGON ((-95.34258270263672 48.54670333862... * year (year) int64 1960 1970 1980 1990 Data variables: population (county, year) int64 4304 3987 3764 4076 ... 43766 55800 65077 unemployment (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489 divorce (county, year) float64 1.859 2.62 3.747 ... 2.725 4.782 7.415 age (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33 Indexes: county GeometryIndex (crs=EPSG:4326)
In this case, the cube.pickle
file has 1.8 MB while the cube.joblib.compressed
has 823 KB.
GIS file formats#
Another option is to convert the data to a GeoDataFrame and save it to any geospatial file format or a GeoParquet if there are multiple geometry columns.
Long table form#
Long table form needs to save each value in a row with all its dimension coordinates. While this option is possible, it can be inefficient, leading to a high degree of repetition and very long tables. However, it is readily supported by Xarray and Xvec.
long = cube.xvec.to_geodataframe()
long
county | population | unemployment | divorce | age | |
---|---|---|---|---|---|
year | |||||
1960 | POLYGON ((-95.34258 48.54670, -95.34081 48.715... | 4304 | 7.900000 | 1.858974 | 28.800000 |
1970 | POLYGON ((-95.34258 48.54670, -95.34081 48.715... | 3987 | 9.000000 | 2.619808 | 30.500000 |
1980 | POLYGON ((-95.34258 48.54670, -95.34081 48.715... | 3764 | 5.902579 | 3.746594 | 34.500000 |
1990 | POLYGON ((-95.34258 48.54670, -95.34081 48.715... | 4076 | 3.894790 | 7.388535 | 35.500000 |
1960 | POLYGON ((-118.85050 47.94969, -118.84732 48.4... | 3889 | 8.200000 | 2.863278 | 25.900000 |
... | ... | ... | ... | ... | ... |
1990 | POLYGON ((-77.53178 38.56506, -77.72094 38.840... | 250377 | 3.353981 | 5.750177 | 28.466667 |
1960 | POLYGON ((-111.37152 44.74516, -111.36839 45.3... | 39260 | 5.218039 | 2.777586 | 28.250000 |
1970 | POLYGON ((-111.37152 44.74516, -111.36839 45.3... | 43766 | 5.444291 | 2.724604 | 26.400000 |
1980 | POLYGON ((-111.37152 44.74516, -111.36839 45.3... | 55800 | 7.017943 | 4.782264 | 28.966667 |
1990 | POLYGON ((-111.37152 44.74516, -111.36839 45.3... | 65077 | 5.488553 | 7.414559 | 35.333333 |
12340 rows × 5 columns
You can see that the result is longer than the original GeoDataFrame that had each year encoded as a column.
You can then save it using to_file
or to_parquet
.
long.to_file("cube.gpkg")
or
long.to_parquet("cube.parquet")
You can then re-create the Dataset using the from_dataframe
method with the geometry column set as an index.
xr.Dataset.from_dataframe(long.set_index("county", append=True)).xvec.set_geom_indexes(
"county", crs=long.crs
)
<xarray.Dataset> Dimensions: (year: 4, county: 3085) Coordinates: * year (year) int64 1960 1970 1980 1990 * county (county) object POLYGON ((-95.34258270263672 48.54670333862... Data variables: population (year, county) int64 4304 3889 17884 ... 53427 250377 65077 unemployment (year, county) float64 7.9 8.2 10.1 7.5 ... 4.103 3.354 5.489 divorce (year, county) float64 1.859 2.863 2.711 ... 4.851 5.75 7.415 age (year, county) float64 28.8 25.9 29.6 ... 34.0 28.47 35.33 Indexes: county GeometryIndex (crs=EPSG:4326)
Wide table form#
The other option is to create a wide table form which is what we started with. Each combination of coordinate and data variables except geometry is encoded as a column (e.g. ["population_1960", "population_1970", "population_1980", "population_1990"]
or using the MultiIndex
like below). However, this works only when the Dataset or DataArray is indexed by one geometry dimension only.
To create a valid GeoDataFrame, you would need to do something like this:
wide = cube.xvec.to_geodataframe(long=False)
wide
county | population | unemployment | divorce | age | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
year | 1960 | 1970 | 1980 | 1990 | 1960 | 1970 | 1980 | 1990 | 1960 | 1970 | 1980 | 1990 | 1960 | 1970 | 1980 | 1990 | |
0 | POLYGON ((-95.34258 48.54670, -95.34081 48.715... | 4304 | 3987 | 3764 | 4076 | 7.900000 | 9.000000 | 5.902579 | 3.894790 | 1.858974 | 2.619808 | 3.746594 | 7.388535 | 28.80 | 30.5 | 34.500000 | 35.500000 |
1 | POLYGON ((-118.85050 47.94969, -118.84732 48.4... | 3889 | 3655 | 5811 | 6295 | 8.200000 | 15.400000 | 15.422886 | 16.811594 | 2.863278 | 3.686007 | 6.625442 | 11.543135 | 25.90 | 27.1 | 27.200000 | 32.800000 |
2 | POLYGON ((-117.43777 48.04422, -117.54113 48.0... | 17884 | 17405 | 28979 | 30948 | 10.100000 | 9.000000 | 13.574064 | 10.700794 | 2.711447 | 2.976378 | 5.448223 | 9.123212 | 29.60 | 31.8 | 28.700000 | 34.500000 |
3 | POLYGON ((-118.97096 47.93928, -118.97293 47.9... | 25520 | 25867 | 30639 | 33350 | 7.500000 | 10.500000 | 12.700195 | 10.203544 | 3.372041 | 4.090386 | 7.122333 | 9.245627 | 29.40 | 31.1 | 31.200000 | 35.000000 |
4 | POLYGON ((-117.43750 49.00000, -117.03098 49.0... | 6914 | 6025 | 8580 | 8915 | 10.600000 | 13.400000 | 18.148999 | 14.991023 | 3.008988 | 4.010695 | 5.287860 | 10.158351 | 31.20 | 33.8 | 31.300000 | 36.100000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3080 | POLYGON ((-114.51984 33.02767, -114.55830 33.0... | 46235 | 60827 | 90554 | 120739 | 4.600000 | 4.800000 | 8.012906 | 10.910351 | 3.702838 | 4.160574 | 5.730308 | 7.798840 | 25.70 | 24.6 | 27.900000 | 33.900000 |
3081 | POLYGON ((-107.19484 34.58352, -107.71780 34.5... | 39085 | 40576 | 61115 | 69029 | 6.100000 | 3.800000 | 8.375127 | 11.002271 | 1.880621 | 2.274648 | 5.372952 | 9.078873 | 20.10 | 21.7 | 25.500000 | 30.450000 |
3082 | POLYGON ((-76.39569 37.10771, -76.40270 37.090... | 21583 | 33203 | 44189 | 53427 | 2.500000 | 3.105065 | 5.037045 | 4.102888 | 1.113223 | 1.550323 | 3.257639 | 4.850746 | 24.40 | 24.4 | 29.800000 | 34.000000 |
3083 | POLYGON ((-77.53178 38.56506, -77.72094 38.840... | 50164 | 111102 | 166665 | 250377 | 3.300000 | 2.361559 | 3.708886 | 3.353981 | 1.672391 | 1.468343 | 3.742207 | 5.750177 | 22.30 | 21.8 | 25.666667 | 28.466667 |
3084 | POLYGON ((-111.37152 44.74516, -111.36839 45.3... | 39260 | 43766 | 55800 | 65077 | 5.218039 | 5.444291 | 7.017943 | 5.488553 | 2.777586 | 2.724604 | 4.782264 | 7.414559 | 28.25 | 26.4 | 28.966667 | 35.333333 |
3085 rows × 17 columns
And a re-creation of the Dataset.
wide.set_index("county", append=True).stack().reset_index(
level=0, drop=True
).to_xarray().xvec.set_geom_indexes("county", crs=wide.crs)
<xarray.Dataset> Dimensions: (county: 3085, year: 4) Coordinates: * county (county) object POLYGON ((-95.34258270263672 48.54670333862... * year (year) object 1960 1970 1980 1990 Data variables: age (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33 divorce (county, year) float64 1.859 2.62 3.747 ... 2.725 4.782 7.415 population (county, year) int64 4304 3987 3764 4076 ... 43766 55800 65077 unemployment (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489 Indexes: county GeometryIndex (crs=EPSG:4326)