# Introduction to vector data cubes#

In geospatial analysis, data cubes can be of two sorts. The first is a raster data cube, typically represented by an Xarray DataArray indexed either by `x` or `y` dimensions or `latitude` and `longitude`. The second is a vector data cube, which is an n-D array with at least one dimension indexed by a 2-D array of vector geometries (Pebesma, 2022).

The Xvec package brings support for indexing by an array of vector geometries to the Xarray ecosystem. It uses Shapely1Shapely 2 or newer is required. Shapely 1.8 or older will not work with Xvec, as the geometries are not hashable before the 2.0 release. package, allowing a seamless interface between Xvec and GeoPandas.

Let’s have a look at how that looks in practice. You need to import `xarray` and `xvec` to enable the `.xvec` accessor on Xarray objects. For this tutorial, you will also use’ geopandas’ to read the geometries and `geodatasets` to fetch the data.

```import geopandas as gpd
import xarray as xr
import xvec

from geodatasets import get_path
```

## The simplest DataArray#

The simplest vector data cube would be an array of values indexed by an array of geometries, conceptually similar to `geopandas.GeoSeries`.

Open a `"geoda.natregimes"` dataset that contains US counties with some demographic data and create a data cube of population counts out of the GeoDataFrame.

```counties = gpd.read_file(get_path("geoda.natregimes"))
```

For the simplest case, use just a geometry column and a column `PO60` representing the population in 1960.

```pop1960 = xr.DataArray(counties.PO60, coords=[counties.geometry], dims=["county"])
pop1960
```
```<xarray.DataArray 'PO60' (county: 3085)>
array([ 4304,  3889, 17884, ..., 21583, 50164, 39260])
Coordinates:
* county   (county) object POLYGON ((-95.34258270263672 48.54670333862305, ...```

This is the basic data structure. To enable geometric functionality provided by Xvec, you need to turn the `counties` index, which is by default a `PandasIndex`, into an `xvec.GeometryIndex`. The easiest method is to use the `.xvec` accessor and the `set_geom_indexes` method.

```pop1960 = pop1960.xvec.set_geom_indexes("county", crs=counties.crs)
pop1960.xindexes
```
```Indexes:
county   GeometryIndex (crs=EPSG:4326)
```

However, it is often better to use GeoPandas instead at this dimensionality. The relationship between GeoDataFrames and vector data cubes is similar to that of a DataFrame and a DataArray. With lower dimensionality, DataFrame is often a better choice. With a higher one, DataArray is a more fitting data structure.

## The meaningful DataArray#

The `counties` data frame has time (1960, 1970, 1980, 1990) encoded in column names, which is not ideal, so you can turn it into a DataArray indexed by geometry and by time.

```population = xr.DataArray(
counties[["PO60", "PO70", "PO80", "PO90"]],
coords=(counties.geometry, [1960, 1970, 1980, 1990]),
dims=("county", "year"),
).xvec.set_geom_indexes("county", crs=counties.crs)
population
```
```<xarray.DataArray (county: 3085, year: 4)>
array([[  4304,   3987,   3764,   4076],
[  3889,   3655,   5811,   6295],
[ 17884,  17405,  28979,  30948],
...,
[ 21583,  33203,  44189,  53427],
[ 50164, 111102, 166665, 250377],
[ 39260,  43766,  55800,  65077]])
Coordinates:
* county   (county) object POLYGON ((-95.34258270263672 48.54670333862305, ...
* year     (year) int64 1960 1970 1980 1990
Indexes:
county   GeometryIndex (crs=EPSG:4326)```

This is the simplest form of a vector data cube that is useful.

## The Dataset#

A proper data cube is represented as a Dataset in the Xarray world. You can index a dataset by geometry in precisely the same way as a DataArray. For example, you can use the `counties` GeoDataFrame and turn it into a Dataset with population, unemployment rate, divorce rate and median age.

```cube = xr.Dataset(
data_vars=dict(
population=(["county", "year"], counties[["PO60", "PO70", "PO80", "PO90"]]),
unemployment=(["county", "year"], counties[["UE60", "UE70", "UE80", "UE90"]]),
divorce=(["county", "year"], counties[["DV60", "DV70", "DV80", "DV90"]]),
age=(["county", "year"], counties[["MA60", "MA70", "MA80", "MA90"]]),
),
coords=dict(county=counties.geometry, year=[1960, 1970, 1980, 1990]),
).xvec.set_geom_indexes("county", crs=counties.crs)
cube
```
```<xarray.Dataset>
Dimensions:       (county: 3085, year: 4)
Coordinates:
* county        (county) object POLYGON ((-95.34258270263672 48.54670333862...
* year          (year) int64 1960 1970 1980 1990
Data variables:
population    (county, year) int64 4304 3987 3764 4076 ... 43766 55800 65077
unemployment  (county, year) float64 7.9 9.0 5.903 ... 5.444 7.018 5.489
divorce       (county, year) float64 1.859 2.62 3.747 ... 2.725 4.782 7.415
age           (county, year) float64 28.8 30.5 34.5 ... 26.4 28.97 35.33
Indexes:
county   GeometryIndex (crs=EPSG:4326)```

Now you have a Dataset with 4 variables, indexed by the year and county geometry—a so-called vector data cube.

You can check that it has a dimension with geometry:

```cube.xvec.geom_coords
```
```Coordinates:
* county   (county) object POLYGON ((-95.34258270263672 48.54670333862305, ...
```

And that this dimension is properly indexed by the `GeometryIndex`.

```cube.xvec.geom_coords_indexed
```
```Coordinates:
* county   (county) object POLYGON ((-95.34258270263672 48.54670333862305, ...
```

What next?