How to structure a geospatial database for multi-state retail chains

This page solves one specific task: laying out a PostGIS schema that keeps spatial queries sub-second when a single retail portfolio spans dozens of states, by combining jurisdiction-level partitioning, per-partition spatial indexes, and a deterministic ingest path that never lets an invalid geometry reach production.

Prerequisites

Before running anything below, the following must already be in place:

Requirement	Minimum	Notes
PostgreSQL	13+	Declarative partitioning is production-stable from 11; 13+ propagates `ANALYZE` to child partitions.
PostGIS extension	3.x	`CREATE EXTENSION postgis;` plus `pgcrypto` for `gen_random_uuid()`.
Python	3.10+	For the validation and ingest pipeline.
`geopandas`	0.13+	Reads the source layer and runs the spatial-containment join.
`psycopg2-binary`	2.9+	Bulk insert via `execute_values`.
`shapely`	2.x	Topology repair with `make_valid`.
State boundary layer	TIGER/Line or equivalent	Used to derive `state_code` by containment rather than by a hard-coded bounding box.

The geometry column standardizes on EPSG:4326 (WGS 84) at ingest, mirroring the CRS discipline established in Data Validation Rules for Store Coordinates. For metric work (buffers, radius searches, drive-time joins) geometries are projected on the fly with ST_Transform(geom, 5070) to EPSG:5070 (North America Albers Equal Area), or to a state-specific UTM zone.

Configuration and execution parameters

The table below collects the decisions that shape this schema. Choose them once, up front; changing the partition key after data is loaded forces a full table rebuild.

Parameter	Value used here	Why
Storage CRS	EPSG:4326	Lossless lat/lon, the lingua franca for ingestion from GPS, POI feeds, and parcel layers.
Analytics CRS	EPSG:5070	Equal-area projection for distance/area math across the contiguous US.
Partition strategy	`LIST (state_code)`	Retail records are heavily state-skewed; LIST gives clean pruning per jurisdiction.
Partition key in PK	`(site_id, state_code)`	PostgreSQL requires the partition key to be part of any primary key on a partitioned table.
Index type	GiST per partition	Accelerates bounding-box filtering before exact geometry evaluation.
Index build mode	`CONCURRENTLY`	Avoids exclusive locks during business hours.
Catch-all	`DEFAULT` partition	Absorbs unmapped territories without halting ingestion.
Bulk insert page size	5000	Balances round-trips against transaction memory in `execute_values`.

Foundational schema and SRID enforcement

Multi-state datasets aggregate coordinates from disparate sources, so distance calculations (ST_Distance) and spatial joins (ST_Intersects) silently produce invalid results unless the reference system is pinned at the schema level. The non-partitioned form makes the constraints explicit:

sql

CREATE TABLE retail_sites (
    site_id       UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    state_code    CHAR(2) NOT NULL,
    store_name    VARCHAR(150),
    lease_status  VARCHAR(20) CHECK (lease_status IN ('active', 'pending', 'closed', 'under_review')),
    geom          GEOMETRY(Point, 4326) NOT NULL,
    geom_valid    BOOLEAN GENERATED ALWAYS AS (ST_IsValid(geom)) STORED,
    created_at    TIMESTAMPTZ DEFAULT NOW(),
    updated_at    TIMESTAMPTZ DEFAULT NOW(),
    CONSTRAINT enforce_srid       CHECK (ST_SRID(geom) = 4326),
    CONSTRAINT enforce_valid_geom CHECK (ST_IsValid(geom))
);

The enforce_srid and enforce_valid_geom constraints reject malformed coordinates at insertion time, eliminating downstream topology errors before they can corrupt a trade-area model.

Declarative partitioning by jurisdiction

Retail footprints are highly skewed: large states routinely generate 10–15× the records of smaller ones. LIST partitioning on state_code isolates I/O, enables partition pruning, and reduces VACUUM overhead during high-throughput loads.

sql

-- Partitioned table (PostgreSQL 11+)
CREATE TABLE retail_sites (
    site_id      UUID          NOT NULL,
    state_code   CHAR(2)       NOT NULL,
    store_name   VARCHAR(150),
    lease_status VARCHAR(20),
    geom         GEOMETRY(Point, 4326) NOT NULL,
    created_at   TIMESTAMPTZ DEFAULT NOW(),
    updated_at   TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (site_id, state_code)   -- partition key must be part of PK
) PARTITION BY LIST (state_code);

CREATE TABLE retail_sites_ca      PARTITION OF retail_sites FOR VALUES IN ('CA');
CREATE TABLE retail_sites_tx      PARTITION OF retail_sites FOR VALUES IN ('TX');
CREATE TABLE retail_sites_ny      PARTITION OF retail_sites FOR VALUES IN ('NY');
CREATE TABLE retail_sites_default PARTITION OF retail_sites DEFAULT;

The composite primary key (site_id, state_code) is mandatory. Queries filtered by state_code then trigger automatic partition pruning, collapsing sequential scans into targeted index lookups, and the DEFAULT partition captures unmapped territories or pipeline failures without halting ingestion.

Constraint-driven spatial indexing

Partitioning without spatial indexing yields negligible gains: each partition needs its own GiST index. Build them CONCURRENTLY so the operation does not take an exclusive lock during business hours.

sql

-- Spatial index per partition
CREATE INDEX CONCURRENTLY idx_retail_sites_ca_geom
    ON retail_sites_ca USING GIST (geom);
CREATE INDEX CONCURRENTLY idx_retail_sites_tx_geom
    ON retail_sites_tx USING GIST (geom);
CREATE INDEX CONCURRENTLY idx_retail_sites_ny_geom
    ON retail_sites_ny USING GIST (geom);

-- Partial index for active-lease proximity queries
CREATE INDEX CONCURRENTLY idx_retail_sites_tx_active_geom
    ON retail_sites_tx USING GIST (geom)
    WHERE lease_status = 'active';

GiST indexes accelerate bounding-box filtering before exact geometry evaluation. The partial index on lease_status = 'active' is smaller and faster than a full GiST index for the dominant dashboard query pattern. Monitor index bloat through pg_stat_user_indexes and rebuild during maintenance windows with REINDEX CONCURRENTLY.

Annotated validation and ingest pipeline

Field-collected GPS points routinely contain duplicates, null geometries, or coordinates outside valid jurisdictional bounds. The pipeline below repairs topology, assigns state_code by spatial containment against a real boundary layer, and bulk-inserts only conforming rows. Note the explicit CRS assertion: nothing here operates on bare lat/lon without first declaring EPSG:4326.

python

import logging
import geopandas as gpd
import psycopg2
from psycopg2.extras import execute_values
from shapely.geometry import Point
from shapely.validation import make_valid

logging.basicConfig(level=logging.INFO)


def validate_and_ingest(
    gdf: gpd.GeoDataFrame,
    states: gpd.GeoDataFrame,   # boundary layer with a 'state_code' column
    conn_str: str,
) -> int:
    # 1. Assert the CRS explicitly; never assume an inbound layer is WGS84.
    if gdf.crs is None:
        gdf = gdf.set_crs(epsg=4326)
    gdf = gdf.to_crs(epsg=4326)
    states = states.to_crs(epsg=4326)

    # 2. Drop null geometries and repair any invalid topology.
    gdf = gdf.dropna(subset=["geometry"]).copy()
    gdf["geometry"] = gdf["geometry"].apply(make_valid)

    # 3. Keep only valid points (make_valid can return non-point types).
    valid_mask = gdf["geometry"].apply(lambda g: isinstance(g, Point) and g.is_valid)
    gdf = gdf[valid_mask].copy()

    # 4. Derive state_code by spatial containment, not a hard-coded bbox.
    gdf = gpd.sjoin(gdf, states[["state_code", "geometry"]],
                    how="left", predicate="within")
    gdf["state_code"] = gdf["state_code"].fillna("ZZ")  # routes to DEFAULT partition

    # 5. Bulk insert using EWKT geometry literals.
    insert_sql = """
        INSERT INTO retail_sites (site_id, state_code, store_name, lease_status, geom)
        VALUES %s
        ON CONFLICT (site_id, state_code) DO NOTHING
    """
    records = [
        (
            str(row["site_id"]),
            row["state_code"],
            row["store_name"],
            row.get("lease_status", "pending"),
            f"SRID=4326;POINT({row['geometry'].x} {row['geometry'].y})",
        )
        for _, row in gdf.iterrows()
    ]

    with psycopg2.connect(conn_str) as conn:
        with conn.cursor() as cur:
            execute_values(cur, insert_sql, records, page_size=5000)
    logging.info("Ingested %d validated records.", len(records))
    return len(records)

Deriving state_code from a within predicate against the boundary layer means a point that falls outside every state polygon lands in the DEFAULT partition (here flagged ZZ) instead of being silently mislabeled. For the standalone containment logic and axis-swap heuristics this depends on, see Automating coordinate validation with Python and Shapely.

Failure modes and debugging

Symptom	Likely cause	Fix
`Geometry SRID (0) does not match column SRID (4326)`	Source layer loaded without a CRS, EWKT written without `SRID=4326;`.	Call `set_crs`/`to_crs` first; always prefix literals with `SRID=4326;`.
`new row for relation violates check constraint "enforce_valid_geom"`	A self-intersecting or null-island geometry slipped past repair.	Confirm `make_valid` ran and the point passed `is_valid`; quarantine the row.
Rows pile up in `retail_sites_default`	Boundary layer in a different CRS than the points, so `sjoin` matches nothing.	Reproject both layers to EPSG:4326 before the join; check `state_code` null rate.
`no partition of relation "retail_sites" found for row`	A `state_code` value has no matching partition and no `DEFAULT` exists.	Add the missing `FOR VALUES IN (...)` partition or a `DEFAULT` partition.
Slow proximity queries despite indexes	Query predicate doesn’t match a partial index, or planner skipped pruning.	Ensure the filter includes `state_code`; run `ANALYZE`; verify with `EXPLAIN`.
`ERROR: cannot create index on partitioned table ... CONCURRENTLY`	Tried to index the parent concurrently.	Build the index on each child partition concurrently, as shown above.

Verification

Confirm the load before exposing the table to analytics workers:

sql

-- 1. Row count and how many landed in DEFAULT (should be near zero).
SELECT tableoid::regclass AS partition, count(*)
FROM retail_sites
GROUP BY 1 ORDER BY 2 DESC;

-- 2. No invalid geometries survived the constraints.
SELECT count(*) FROM retail_sites WHERE NOT ST_IsValid(geom);  -- expect 0

-- 3. Bounding-box sanity: every point inside the continental envelope.
SELECT count(*) FROM retail_sites
WHERE NOT ST_Within(geom, ST_MakeEnvelope(-125, 24, -66, 50, 4326));  -- expect 0

-- 4. Pruning works: this should scan only the CA partition.
EXPLAIN SELECT site_id FROM retail_sites WHERE state_code = 'CA';

-- 5. Refresh planner statistics after the bulk load.
ANALYZE retail_sites;

A clean run shows the bulk of rows distributed across the expected state partitions, a negligible DEFAULT count, zero invalid or out-of-envelope geometries, and an EXPLAIN plan touching a single partition. With those checks green, the table is ready to feed point-in-polygon catchment joins, drive-time intersections, and demographic overlays without query degradation.

Setting Up PostGIS for Retail Analytics — the extension setup, tuning, and operations this schema builds on.
Automating coordinate validation with Python and Shapely — the standalone validator behind the ingest pipeline.
Performing point-in-polygon joins for store catchments — the first downstream consumer of this partitioned table.

← Back to Setting Up PostGIS for Retail Analytics

How to structure a geospatial database for multi-state retail chains

Prerequisites #

Configuration and execution parameters #

Foundational schema and SRID enforcement #

Declarative partitioning by jurisdiction #

Constraint-driven spatial indexing #

Annotated validation and ingest pipeline #

Failure modes and debugging #

Verification #

Related #