Introduction to vector data cubes

Introduction to vector data cubes#

In geospatial analysis, data cubes can be of two sorts. The first is a raster data cube, typically represented by an Xarray DataArray indexed either by x or y dimensions or latitude and longitude. The second is a vector data cube, which is an n-D array with at least one dimension indexed by a 2-D array of vector geometries (Pebesma, 2022).

The Xvec package brings support for indexing by an array of vector geometries to the Xarray ecosystem. It uses Shapely1Shapely 2 or newer is required. Shapely 1.8 or older will not work with Xvec, as the geometries are not hashable before the 2.0 release. package, allowing a seamless interface between Xvec and GeoPandas.

Let’s have a look at how that looks in practice. You need to import xarray and xvec to enable the .xvec accessor on Xarray objects. For this tutorial, you will also use’ geopandas’ to read the geometries and geodatasets to fetch the data.

import geopandas as gpd
import xarray as xr
import xvec

from geodatasets import get_path

The simplest DataArray#

The simplest vector data cube would be an array of values indexed by an array of geometries, conceptually similar to geopandas.GeoSeries.

Open a "geoda.natregimes" dataset that contains US counties with some demographic data and create a data cube of population counts out of the GeoDataFrame.

counties = gpd.read_file(get_path("geoda.natregimes"))

For the simplest case, use just a geometry column and a column PO60 representing the population in 1960.

pop1960 = xr.DataArray(counties.PO60, coords=[counties.geometry], dims=["county"])
<xarray.DataArray 'PO60' (county: 3085)>
array([ 4304,  3889, 17884, ..., 21583, 50164, 39260])
  * county   (county) object POLYGON ((-95.34258270263672 48.54670333862305, ...

This is the basic data structure. To enable geometric functionality provided by Xvec, you need to turn the counties index, which is by default a PandasIndex, into an xvec.GeometryIndex. The easiest method is to use the .xvec accessor and the set_geom_indexes method.

pop1960 = pop1960.xvec.set_geom_indexes("county",
    county   GeometryIndex (crs=EPSG:4326)

However, it is often better to use GeoPandas instead at this dimensionality. The relationship between GeoDataFrames and vector data cubes is similar to that of a DataFrame and a DataArray. With lower dimensionality, DataFrame is often a better choice. With a higher one, DataArray is a more fitting data structure.

The meaningful DataArray#

The counties data frame has time (1960, 1970, 1980, 1990) encoded in column names, which is not ideal, so you can turn it into a DataArray indexed by geometry and by time.

population = xr.DataArray(
    counties[["PO60", "PO70", "PO80", "PO90"]],
    coords=(counties.geometry, [1960, 1970, 1980, 1990]),
    dims=("county", "year"),
<xarray.DataArray (county: 3085, year: 4)>
array([[  4304,   3987,   3764,   4076],
       [  3889,   3655,   5811,   6295],
       [ 17884,  17405,  28979,  30948],
       [ 21583,  33203,  44189,  53427],
       [ 50164, 111102, 166665, 250377],
       [ 39260,  43766,  55800,  65077]])
  * county   (county) object POLYGON ((-95.34258270263672 48.54670333862305, ...
  * year     (year) int64 1960 1970 1980 1990
    county   GeometryIndex (crs=EPSG:4326)

This is the simplest form of a vector data cube that is useful.

The Dataset#

A proper data cube is represented as a Dataset in the Xarray world. You can index a dataset by geometry in precisely the same way as a DataArray. For example, you can use the counties GeoDataFrame and turn it into a Dataset with population, unemployment rate, divorce rate and median age.

cube = xr.Dataset(
        population=(["county", "year"], counties[["PO60", "PO70", "PO80", "PO90"]]),
        unemployment=(["county", "year"], counties[["UE60", "UE70", "UE80", "UE90"]]),
        divorce=(["county", "year"], counties[["DV60", "DV70", "DV80", "DV90"]]),
        age=(["county", "year"], counties[["MA60", "MA70", "MA80", "MA90"]]),
    coords=dict(county=counties.geometry, year=[1960, 1970, 1980, 1990]),
Dimensions:       (county: 3085, year: 4)
  * county        (county) object POLYGON ((-95.34258270263672 48.54670333862...
  * year          (year) int64 1960 1970 1980 1990
Data variables:
    population    (county, year) int64 4304 3987 3764 4076 ... 43766 55800 65077
    unemployment  (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489
    divorce       (county, year) float64 1.859 2.62 3.747 ... 2.725 4.782 7.415
    age           (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33
    county   GeometryIndex (crs=EPSG:4326)

Now you have a Dataset with 4 variables, indexed by the year and county geometry—a so-called vector data cube.

You can check that it has a dimension with geometry:

  * county   (county) object POLYGON ((-95.34258270263672 48.54670333862305, ...

And that this dimension is properly indexed by the GeometryIndex.

  * county   (county) object POLYGON ((-95.34258270263672 48.54670333862305, ...

What next?