Introduction to vector data cubes#

In geospatial analysis, data cubes can be of two sorts. The first is a raster data cube, typically represented by an Xarray DataArray indexed either by x or y dimensions or latitude and longitude. The second is a vector data cube, which is an n-D array that has either at least one dimension indexed by a 2-D array of vector geometries (Pebesma, 2022) or contains geometries as variables (e.g. moving features or time-evolving shapes), possibly both.

We can distinguish between two types of geometries in a DataArray or Dataset:

  • coordinate geometry - an array (typically one dimensional) is used as coordinates along one or more dimensions. A typical example would be an outcome of zonal statistics of a multi-dimensional raster, avoiding the need for flattenning of the array to a data frame.

  • variable geometry - an array (typicially multi-dimensional) is used as a variable within a DataArray. This may encode evolving shapes of lava flows in time, trajectories, or growth of city limits.

The Xvec package brings support for both of these to the Xarray ecosystem. It uses Shapely1Shapely 2 or newer is required. Shapely 1.8 or older will not work with Xvec, as the geometries are not hashable before the 2.0 release. package, allowing a seamless interface between Xvec and GeoPandas.

Let’s have a look at how that looks in practice. You need to import xarray and xvec to enable the .xvec accessor on Xarray objects. For this tutorial, you will also use’ geopandas’ to read the geometries and geodatasets to fetch the data.

import geopandas as gpd
import xarray as xr
from geodatasets import get_path

import xvec

The simplest DataArray with coordinate geometry#

The simplest vector data cube would be an array of values indexed by an array of geometries, conceptually similar to geopandas.GeoSeries.

Open a "geoda.natregimes" dataset that contains US counties with some demographic data and create a data cube of population counts out of the GeoDataFrame.

counties = gpd.read_file(get_path("geoda.natregimes"))

For the simplest case, use just a geometry column and a column PO60 representing the population in 1960.

pop1960 = xr.DataArray(counties.PO60, coords=[counties.geometry], dims=["county"])
pop1960
<xarray.DataArray 'PO60' (county: 3085)> Size: 12kB
array([ 4304,  3889, 17884, ..., 21583, 50164, 39260],
      shape=(3085,), dtype=int32)
Coordinates:
  * county   (county) object 25kB POLYGON ((-95.34258270263672 48.54670333862...

This is the basic data structure. To enable geometric functionality provided by Xvec, you need to turn the counties index, which is by default a PandasIndex, into an xvec.GeometryIndex. The easiest method is to use the .xvec accessor and the set_geom_indexes method.

pop1960 = pop1960.xvec.set_geom_indexes("county", crs=counties.crs)
pop1960.xindexes
Indexes:
    county   GeometryIndex (crs=EPSG:4326)

However, it is often better to use GeoPandas instead at this dimensionality. The relationship between GeoDataFrames and vector data cubes is similar to that of a DataFrame and a DataArray. With lower dimensionality, DataFrame is often a better choice. With a higher one, DataArray is a more fitting data structure.

The meaningful DataArray#

The counties data frame has time (1960, 1970, 1980, 1990) encoded in column names, which is not ideal, so you can turn it into a DataArray indexed by geometry and by time.

population = xr.DataArray(
    counties[["PO60", "PO70", "PO80", "PO90"]],
    coords=(counties.geometry, [1960, 1970, 1980, 1990]),
    dims=("county", "year"),
).xvec.set_geom_indexes("county", crs=counties.crs)
population
<xarray.DataArray (county: 3085, year: 4)> Size: 49kB
array([[  4304,   3987,   3764,   4076],
       [  3889,   3655,   5811,   6295],
       [ 17884,  17405,  28979,  30948],
       ...,
       [ 21583,  33203,  44189,  53427],
       [ 50164, 111102, 166665, 250377],
       [ 39260,  43766,  55800,  65077]], shape=(3085, 4), dtype=int32)
Coordinates:
  * county   (county) object 25kB POLYGON ((-95.34258270263672 48.54670333862...
  * year     (year) int64 32B 1960 1970 1980 1990
Indexes:
    county   GeometryIndex (crs=EPSG:4326)

This is the simplest form of a vector data cube that is useful.

The Dataset with coordinate geometry#

A proper data cube is represented as a Dataset in the Xarray world. You can index a dataset by geometry in precisely the same way as a DataArray. For example, you can use the counties GeoDataFrame and turn it into a Dataset with population, unemployment rate, divorce rate and median age.

cube = xr.Dataset(
    data_vars=dict(
        population=(["county", "year"], counties[["PO60", "PO70", "PO80", "PO90"]]),
        unemployment=(["county", "year"], counties[["UE60", "UE70", "UE80", "UE90"]]),
        divorce=(["county", "year"], counties[["DV60", "DV70", "DV80", "DV90"]]),
        age=(["county", "year"], counties[["MA60", "MA70", "MA80", "MA90"]]),
    ),
    coords=dict(county=counties.geometry, year=[1960, 1970, 1980, 1990]),
).xvec.set_geom_indexes("county", crs=counties.crs)
cube
<xarray.Dataset> Size: 370kB
Dimensions:       (county: 3085, year: 4)
Coordinates:
  * county        (county) object 25kB POLYGON ((-95.34258270263672 48.546703...
  * year          (year) int64 32B 1960 1970 1980 1990
Data variables:
    population    (county, year) int32 49kB 4304 3987 3764 ... 43766 55800 65077
    unemployment  (county, year) float64 99kB 7.9 9.0 5.903 ... 7.018 5.489
    divorce       (county, year) float64 99kB 1.859 2.62 3.747 ... 4.782 7.415
    age           (county, year) float64 99kB 28.8 30.5 34.5 ... 28.97 35.33
Indexes:
    county   GeometryIndex (crs=EPSG:4326)

Now you have a Dataset with 4 variables, indexed by the year and county geometry—a so-called vector data cube with coordinate geometry.

You can check that it has a dimension with geometry:

cube.xvec.geom_coords
Coordinates:
  * county   (county) object 25kB POLYGON ((-95.34258270263672 48.54670333862...

And that this dimension is properly indexed by the GeometryIndex.

cube.xvec.geom_coords_indexed
Coordinates:
  * county   (county) object 25kB POLYGON ((-95.34258270263672 48.54670333862...

The DataArray with variable geometry#

The second type of vector data cubes is the one where geometry objects represent DataArray’s variable. To illustrate it, a subset of glacier data from Svalbard is a nice small example.

glaciers_df = gpd.read_file("https://github.com/loreabad6/post/raw/refs/heads/main/inst/extdata/svalbard.gpkg")
glaciers_df.head(4)
name year length fwidth geometry
0 Austre Brøggerbreen 1936.0 5808.299579 1254.324348 POLYGON ((432375.11 8761657.493, 432374.075 87...
1 Austre Brøggerbreen 1990.0 5265.479773 470.127452 POLYGON ((432827.911 8760071.745, 432927.89 87...
2 Austre Brøggerbreen 2007.0 4886.823854 888.406938 POLYGON ((430859.331 8760810.068, 430859.331 8...
3 Austre Lovenbreen 1936.0 4655.052772 1215.718290 POLYGON ((439050.076 8760008.254, 439165.072 8...

The data frame contains a shape of three glaciers captures in three different years (1936, 1990, 2007), with the glacier tongue length captures for each year. It is clear that name and year are prime examples for data cube dimensions, while length and geometry are suited to be variables.

glaciers = glaciers_df.set_index(["name", "year"]).to_xarray()
glaciers
<xarray.Dataset> Size: 424B
Dimensions:   (name: 5, year: 3)
Coordinates:
  * name      (name) object 40B 'Austre Brøggerbreen' ... 'Steenbreen'
  * year      (year) float64 24B 1.936e+03 1.99e+03 2.007e+03
Data variables:
    length    (name, year) float64 120B 5.808e+03 5.265e+03 ... 1.819e+03
    fwidth    (name, year) float64 120B 1.254e+03 470.1 888.4 ... 279.4 202.6
    geometry  (name, year) object 120B POLYGON ((432375.11039999966 8761657.4...

In principle, nothing else is needed. However, you would typically want to retain the information about the coordinate reference system, so let’s assign one using the (experimental) xproj package.

glaciers = glaciers.proj.assign_crs(spatial_ref=glaciers_df.crs)
glaciers
<xarray.Dataset> Size: 432B
Dimensions:      (name: 5, year: 3)
Coordinates:
  * name         (name) object 40B 'Austre Brøggerbreen' ... 'Steenbreen'
  * year         (year) float64 24B 1.936e+03 1.99e+03 2.007e+03
  * spatial_ref  int64 8B 0
Data variables:
    length       (name, year) float64 120B 5.808e+03 5.265e+03 ... 1.819e+03
    fwidth       (name, year) float64 120B 1.254e+03 470.1 888.4 ... 279.4 202.6
    geometry     (name, year) object 120B POLYGON ((432375.11039999966 876165...
Indexes:
    spatial_ref  CRSIndex (crs=EPSG:32633)

Xvec offers a wide range of features on top of this data structure, from spatial queries to plotting and I/O.

What next?