Indexing and selecting data

Indexing and selecting data#

DataArrays and Datasets with an xvec.GeometryIndex support standard indexing, slicing and selection from Xarray on non-geometric dimensions plus specific spatial indexing options based on geometric dimensions. To make the example more interesting, create a Dataset of trips between individual taxi zones in New York City in January 2022.

import datetime

import geopandas as gpd
import pandas as pd
import xarray as xr
import xvec

from shapely import Point, box

You can index the data by the payment type, day of the month, the hour of the day, origin zone and destination zone. For example, you can check trip count, mean trip distance, fare amount and tip amount. It may be better to create the data as sparse arrays, but those do not support all indexing methods, so it is better to use dense arrays in this example.

Hide code cell source
# Load the data
trips = pd.read_parquet(
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet"
)  # 33MB
zones = gpd.read_file(
    "https://d37ci6vzurychx.cloudfront.net/misc/taxi_zones.zip"
)  # 1MB
lookup = pd.read_csv("https://d37ci6vzurychx.cloudfront.net/misc/taxi+_zone_lookup.csv")

# create variables for day and hour
trips["date"] = trips.tpep_pickup_datetime.dt.date
trips["hour"] = trips.tpep_pickup_datetime.dt.hour

# use groupby over five columns to create a mutli-indexed DataFrame with aggregations
# and create a Dataset backed by sparse arrays
taxi_trips = xr.Dataset.from_dataframe(
    trips[  # filter only trips with known locations
        trips.PULocationID.isin(zones.LocationID)
        & trips.DOLocationID.isin(zones.LocationID)
    ]
    .groupby(["payment_type", "date", "hour", "PULocationID", "DOLocationID"])
    .agg(
        {
            "trip_distance": "mean",
            "VendorID": "count",
            "tip_amount": "mean",
            "fare_amount": "mean",
        }
    ),
)

# Replace int codes with labels
taxi_trips["payment_type"] = [
    "Credit card",
    "Cash",
    "No charge",
    "Dispute",
    "Unknown",
    "Voided trip",
]

# create linkable geometry variable
taxi_zones = (
    lookup.merge(
        zones.dissolve("LocationID")[["zone", "geometry"]],
        left_on="Zone",
        right_on="zone",
        how="left",
    )
    .set_index("LocationID")
    .geometry
)
# replace location IDs with actual geometries
taxi_trips["PULocationID"] = taxi_zones.loc[taxi_trips.PULocationID].values
taxi_trips["DOLocationID"] = taxi_zones.loc[taxi_trips.DOLocationID].values

# rename
taxi_trips = taxi_trips.rename(
    {"PULocationID": "origin", "DOLocationID": "destination", "VendorID": "trips_count"}
)

# assing GeometryIndex
taxi_trips = taxi_trips.xvec.set_geom_indexes(["origin", "destination"], crs=zones.crs)
taxi_trips
<xarray.Dataset>
Dimensions:        (payment_type: 6, date: 40, hour: 24, origin: 252,
                    destination: 257)
Coordinates:
  * payment_type   (payment_type) <U11 'Credit card' 'Cash' ... 'Voided trip'
  * date           (date) object 2008-12-31 2009-01-01 ... 2022-04-06 2022-05-18
  * hour           (hour) int32 0 1 2 3 4 5 6 7 8 ... 15 16 17 18 19 20 21 22 23
  * origin         (origin) object POLYGON ((933100.9183527103 192536.0856972...
  * destination    (destination) object POLYGON ((933100.9183527103 192536.08...
Data variables:
    trip_distance  (payment_type, date, hour, origin, destination) float64 na...
    trips_count    (payment_type, date, hour, origin, destination) float64 na...
    tip_amount     (payment_type, date, hour, origin, destination) float64 na...
    fare_amount    (payment_type, date, hour, origin, destination) float64 na...
Indexes:
    origin       GeometryIndex (crs=EPSG:2263)
    destination  GeometryIndex (crs=EPSG:2263)

The dataset is created with two dimensions with a GeometryIndex.

taxi_trips.xindexes
Indexes:
    payment_type  PandasIndex
    date          PandasIndex
    hour          PandasIndex
    origin        GeometryIndex (crs=EPSG:2263)
    destination   GeometryIndex (crs=EPSG:2263)

Selection by geometry#

Geometry as a label#

You can select data based on geometry as with any other index, treating it as a label.

taxi_trips.sel(destination=[zones.geometry[0], zones.geometry[3]])
<xarray.Dataset>
Dimensions:        (payment_type: 6, date: 40, hour: 24, origin: 252,
                    destination: 2)
Coordinates:
  * payment_type   (payment_type) <U11 'Credit card' 'Cash' ... 'Voided trip'
  * date           (date) object 2008-12-31 2009-01-01 ... 2022-04-06 2022-05-18
  * hour           (hour) int32 0 1 2 3 4 5 6 7 8 ... 15 16 17 18 19 20 21 22 23
  * origin         (origin) object POLYGON ((933100.9183527103 192536.0856972...
  * destination    (destination) object POLYGON ((933100.9183527103 192536.08...
Data variables:
    trip_distance  (payment_type, date, hour, origin, destination) float64 na...
    trips_count    (payment_type, date, hour, origin, destination) float64 na...
    tip_amount     (payment_type, date, hour, origin, destination) float64 na...
    fare_amount    (payment_type, date, hour, origin, destination) float64 na...
Indexes:
    origin       GeometryIndex (crs=EPSG:2263)
    destination  GeometryIndex (crs=EPSG:2263)

Nearest#

Alternatively, you can select based on the nearest neighbor Remember that all geometries, those in the index and those in the query, must use the same Coordinate Reference System..

taxi_trips.sel(
    date=datetime.datetime(2022, 1, 28),
    hour=12,
    origin=[Point(1064321, 211194), Point(988669, 207721)],
    destination=[Point(998142, 191215), Point(1010116, 42998)],
    method="nearest",
)
<xarray.Dataset>
Dimensions:        (payment_type: 6, origin: 2, destination: 2)
Coordinates:
  * payment_type   (payment_type) <U11 'Credit card' 'Cash' ... 'Voided trip'
    date           object 2022-01-28
    hour           int32 12
  * origin         (origin) object POLYGON ((1066997.4698108435 212947.336632...
  * destination    (destination) object POLYGON ((1000036.9036584795 194829.4...
Data variables:
    trip_distance  (payment_type, origin, destination) float64 nan nan ... nan
    trips_count    (payment_type, origin, destination) float64 nan nan ... nan
    tip_amount     (payment_type, origin, destination) float64 nan nan ... nan
    fare_amount    (payment_type, origin, destination) float64 nan nan ... nan
Indexes:
    origin       GeometryIndex (crs=EPSG:2263)
    destination  GeometryIndex (crs=EPSG:2263)

Spatial query#

Spatial-aware data selection using the “query” mode with a single geometry and a given predicate:

taxi_trips.sel(origin=box(998142, 191215, 1024321, 211194), method="intersects")
<xarray.Dataset>
Dimensions:        (payment_type: 6, date: 40, hour: 24, origin: 25,
                    destination: 257)
Coordinates:
  * payment_type   (payment_type) <U11 'Credit card' 'Cash' ... 'Voided trip'
  * date           (date) object 2008-12-31 2009-01-01 ... 2022-04-06 2022-05-18
  * hour           (hour) int32 0 1 2 3 4 5 6 7 8 ... 15 16 17 18 19 20 21 22 23
  * origin         (origin) object POLYGON ((1000036.9036584795 194829.433560...
  * destination    (destination) object POLYGON ((933100.9183527103 192536.08...
Data variables:
    trip_distance  (payment_type, date, hour, origin, destination) float64 na...
    trips_count    (payment_type, date, hour, origin, destination) float64 na...
    tip_amount     (payment_type, date, hour, origin, destination) float64 na...
    fare_amount    (payment_type, date, hour, origin, destination) float64 na...
Indexes:
    origin       GeometryIndex (crs=EPSG:2263)
    destination  GeometryIndex (crs=EPSG:2263)

Spatial query using the sel() method with predicates other than "nearest" supports only scalar geometries as an input. If you want to query using an array of geometries, you can use the .xvec.query() method instead.

taxi_trips.xvec.query(
    "origin", [Point(1064321, 211194), Point(1064321, 211194).buffer(500)]
)
<xarray.Dataset>
Dimensions:        (payment_type: 6, date: 40, hour: 24, origin: 4,
                    destination: 257)
Coordinates:
  * payment_type   (payment_type) <U11 'Credit card' 'Cash' ... 'Voided trip'
  * date           (date) object 2008-12-31 2009-01-01 ... 2022-04-06 2022-05-18
  * hour           (hour) int32 0 1 2 3 4 5 6 7 8 ... 15 16 17 18 19 20 21 22 23
  * origin         (origin) object POLYGON ((1060888.899369195 212784.6402245...
  * destination    (destination) object POLYGON ((933100.9183527103 192536.08...
Data variables:
    trip_distance  (payment_type, date, hour, origin, destination) float64 na...
    trips_count    (payment_type, date, hour, origin, destination) float64 na...
    tip_amount     (payment_type, date, hour, origin, destination) float64 na...
    fare_amount    (payment_type, date, hour, origin, destination) float64 na...
Indexes:
    origin       GeometryIndex (crs=EPSG:2263)
    destination  GeometryIndex (crs=EPSG:2263)

.xvec.query() is a wrapper around shapely.STRtree and returns the subset of the original object where the bounding box of each input geometry intersects the bounding box of geometry in a GeometryIndex. If a predicate is provided, the tree geometries are first queried based on the bounding box of the input geometry. Then they are further filtered to those that meet the predicate when comparing the input geometry to the tree geometry: predicate(geometry, index_geometry).

taxi_trips.xvec.query(
    "origin",
    [Point(1064321, 211194), Point(1064321, 211194).buffer(500)],
    predicate="within",
)
<xarray.Dataset>
Dimensions:        (payment_type: 6, date: 40, hour: 24, origin: 2,
                    destination: 257)
Coordinates:
  * payment_type   (payment_type) <U11 'Credit card' 'Cash' ... 'Voided trip'
  * date           (date) object 2008-12-31 2009-01-01 ... 2022-04-06 2022-05-18
  * hour           (hour) int32 0 1 2 3 4 5 6 7 8 ... 15 16 17 18 19 20 21 22 23
  * origin         (origin) object POLYGON ((1066997.4698108435 212947.336632...
  * destination    (destination) object POLYGON ((933100.9183527103 192536.08...
Data variables:
    trip_distance  (payment_type, date, hour, origin, destination) float64 na...
    trips_count    (payment_type, date, hour, origin, destination) float64 na...
    tip_amount     (payment_type, date, hour, origin, destination) float64 na...
    fare_amount    (payment_type, date, hour, origin, destination) float64 na...
Indexes:
    origin       GeometryIndex (crs=EPSG:2263)
    destination  GeometryIndex (crs=EPSG:2263)

Since multiple query geometries may return the same index geometry, the method, by default, returns duplicated observations. That can be filtered by passing unique=True.

taxi_trips.xvec.query(
    "origin",
    [Point(1064321, 211194), Point(1064321, 211194).buffer(500)],
    predicate="within",
    unique=True,
)
<xarray.Dataset>
Dimensions:        (payment_type: 6, date: 40, hour: 24, origin: 1,
                    destination: 257)
Coordinates:
  * payment_type   (payment_type) <U11 'Credit card' 'Cash' ... 'Voided trip'
  * date           (date) object 2008-12-31 2009-01-01 ... 2022-04-06 2022-05-18
  * hour           (hour) int32 0 1 2 3 4 5 6 7 8 ... 15 16 17 18 19 20 21 22 23
  * origin         (origin) object POLYGON ((1066997.4698108435 212947.336632...
  * destination    (destination) object POLYGON ((933100.9183527103 192536.08...
Data variables:
    trip_distance  (payment_type, date, hour, origin, destination) float64 na...
    trips_count    (payment_type, date, hour, origin, destination) float64 na...
    tip_amount     (payment_type, date, hour, origin, destination) float64 na...
    fare_amount    (payment_type, date, hour, origin, destination) float64 na...
Indexes:
    origin       GeometryIndex (crs=EPSG:2263)
    destination  GeometryIndex (crs=EPSG:2263)

When using a predicate "dwithin" (search for geometries within a set distance) you can also pass the distance argument.

taxi_trips.xvec.query(
    "origin",
    [Point(1064321, 211194), Point(1064321, 211194).buffer(500)],
    predicate="dwithin",
    unique=True,
    distance=5000,
)
<xarray.Dataset>
Dimensions:        (payment_type: 6, date: 40, hour: 24, origin: 3,
                    destination: 257)
Coordinates:
  * payment_type   (payment_type) <U11 'Credit card' 'Cash' ... 'Voided trip'
  * date           (date) object 2008-12-31 2009-01-01 ... 2022-04-06 2022-05-18
  * hour           (hour) int32 0 1 2 3 4 5 6 7 8 ... 15 16 17 18 19 20 21 22 23
  * origin         (origin) object POLYGON ((1060888.899369195 212784.6402245...
  * destination    (destination) object POLYGON ((933100.9183527103 192536.08...
Data variables:
    trip_distance  (payment_type, date, hour, origin, destination) float64 na...
    trips_count    (payment_type, date, hour, origin, destination) float64 na...
    tip_amount     (payment_type, date, hour, origin, destination) float64 na...
    fare_amount    (payment_type, date, hour, origin, destination) float64 na...
Indexes:
    origin       GeometryIndex (crs=EPSG:2263)
    destination  GeometryIndex (crs=EPSG:2263)