Python script for normalizing demographic data across ZIP codes

This page solves one precise task in the demographic enrichment pipeline: turning raw US Census ACS variables, addressed by postal ZIP code, into a deterministic, comparable feature set scaled across disparate postal geographies — the normalization step that runs before any audience weighting or site scoring.

ZIP codes are administrative mail-routing constructs, not statistical boundaries, so merging them directly with ACS data produces persistent spatial misalignment: a single ZIP may straddle several Census ZIP Code Tabulation Areas (ZCTAs), and a ZCTA may absorb several ZIPs. A production-grade normalization script therefore has to resolve ZIP-to-ZCTA correspondence, survive API throttling, impute sparse rural cells, and rescale variables so a high-density urban core and a low-density rural route remain comparable. The output is the normalized feature matrix consumed by Weighting Demographic Variables for Target Audiences inside the broader demographic data integration workflow.

Prerequisites

Before running the script below, confirm the following inputs and packages are in place:

Python packages. pandas, numpy, geopandas, shapely, requests, and scikit-learn (for SimpleImputer). Pin them so monthly scheduled runs stay reproducible:
bash
```
pip install pandas==2.2.3 numpy==2.1.3 geopandas==1.0.1 requests==2.32.3 shapely==2.0.6 scikit-learn==1.5.2
```

A Census API key, plus runtime configuration kept out of version control:

env

CENSUS_API_KEY=your_census_api_key_here
CACHE_DIR=./data_cache
LOG_LEVEL=INFO
MAX_WORKERS=4

ZCTA boundary geometry. Download the latest ZCTA TIGER/Line shapefile from the U.S. Census Bureau and store it under ./data_cache/zcta_boundaries/. The same boundary handling and equal-area CRS conventions used in How to Join ACS 5-Year Estimates to Custom Trade Area Polygons apply here — pre-caching the parsed boundaries to Parquet sharply cuts cold-start latency.
A ZIP-to-ZCTA correspondence. The preferred production source is the HUD USPS ZIP-to-ZCTA crosswalk, a deterministic table join with no spatial computation. The fallback is a point-in-polygon match of ZIP centroids against ZCTA polygons, which is the spatial join discussed in Performing Point-in-Polygon Joins for Store Catchments.

Configuration and execution parameters

The pipeline is driven by a small set of flags. Tuning these is the difference between a clean run and silent data corruption, so prefer a parameter table over hard-coded constants.

Parameter	Type	Default	Notes
`acs_variables`	`list[str]`	—	ACS table codes, e.g. `B01003_001E` (total population), `B19013_001E` (median household income), `B15001_001E` (educational attainment).
`population_col`	`str`	`B01003_001E`	Variable used as the population weight; excluded from min-max scaling.
`chunk_size`	`int`	`50`	ZCTAs per Census API request — keep under ~50 to respect URL-length limits.
`retry.total`	`int`	`5`	HTTP retry attempts on `429/5xx` responses.
`retry.backoff_factor`	`float`	`2`	Exponential backoff multiplier between retries.
`imputer.strategy`	`str`	`median`	Imputation for sparse rural cells; median resists outlier skew better than mean.
`target_crs`	EPSG	`4326` for joins; `5070` for areal	Use `EPSG:5070` (Conus Albers) for any area-proportional interpolation so areas are computed in square metres, not degrees.
`audience_weights`	`dict[str, float]`	`None`	Optional per-variable multipliers applied after normalization for format-specific scaling.

Annotated implementation

The script is a single resilient class. It configures an HTTP session with retry and backoff, resolves ZIPs to ZCTAs, fetches ACS variables in chunks, imputes nulls, then applies population-weighted min-max normalization with optional audience multipliers. Inline comments mark the load-bearing decisions.

python

import os
import logging
import pandas as pd
import numpy as np
import geopandas as gpd
from pathlib import Path
from typing import Optional, Dict, List
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from sklearn.impute import SimpleImputer
from shapely.geometry import Point

logging.basicConfig(
    level=os.getenv("LOG_LEVEL", "INFO"),
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s"
)
logger = logging.getLogger(__name__)


class CensusAPIError(Exception):
    pass


class SpatialAlignmentError(Exception):
    pass


class NormalizationError(Exception):
    pass


class DemographicNormalizer:
    def __init__(self, census_api_key: str, cache_dir: str = "./data_cache"):
        self.api_key = census_api_key
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(parents=True, exist_ok=True)
        self.session = self._configure_session()
        # ACS 5-year 2023 release; update year annually
        self.base_url = "https://api.census.gov/data/2023/acs/acs5"
        self.zcta_gdf: Optional[gpd.GeoDataFrame] = None

    def _configure_session(self) -> requests.Session:
        # Retry on throttling (429) and transient server errors with exponential backoff.
        session = requests.Session()
        retry_strategy = Retry(
            total=5,
            backoff_factor=2,
            status_forcelist=[429, 500, 502, 503, 504],
            allowed_methods=["GET"]
        )
        session.mount("https://", HTTPAdapter(max_retries=retry_strategy))
        return session

    def fetch_acs_for_zctas(self, variables: List[str], zcta_list: List[str]) -> pd.DataFrame:
        """
        Fetch ACS variables for a list of ZCTAs.
        The Census API supports querying specific ZCTAs via 'for=zip code tabulation area:XXXXX,...'
        Keep batches under ~50 ZCTAs to stay within URL length limits.
        """
        if not zcta_list:
            raise ValueError("Empty ZCTA list provided.")

        chunk_size = 50
        all_data = []

        for i in range(0, len(zcta_list), chunk_size):
            chunk = zcta_list[i:i + chunk_size]
            params = {
                "get": ",".join(variables),
                "for": f"zip code tabulation area:{','.join(chunk)}",
                "key": self.api_key,
            }
            try:
                resp = self.session.get(self.base_url, params=params, timeout=30)
                resp.raise_for_status()
                data = resp.json()
                if len(data) <= 1:  # header row only -> no records
                    logger.warning("No data returned for ZCTA chunk starting at index %d.", i)
                    continue
                df = pd.DataFrame(data[1:], columns=data[0])
                all_data.append(df)
            except requests.exceptions.RequestException as e:
                logger.error("API request failed for chunk %d: %s", i, e)
                raise CensusAPIError(f"Failed to fetch ACS data: {e}")

        if not all_data:
            return pd.DataFrame()
        return pd.concat(all_data, ignore_index=True)

    def load_zcta_boundaries(self, shapefile_path: str) -> gpd.GeoDataFrame:
        # Cache the parsed shapefile to Parquet; reading the cache avoids re-parsing on every run.
        cache_path = self.cache_dir / "zcta_index.parquet"
        if cache_path.exists():
            logger.info("Loading cached ZCTA boundaries.")
            return gpd.read_parquet(cache_path)

        logger.info("Loading ZCTA shapefile from %s", shapefile_path)
        gdf = gpd.read_file(shapefile_path)
        gdf = gdf.to_crs(epsg=4326)  # assert geographic CRS for the intersect join
        gdf.to_parquet(cache_path)
        return gdf

    def zip_to_zcta(self, zip_codes: List[str], zcta_gdf: gpd.GeoDataFrame) -> pd.DataFrame:
        """
        Map postal ZIPs to ZCTAs.
        The preferred production approach is the HUD USPS ZIP-to-ZCTA crosswalk, a
        deterministic table join without spatial computation. The spatial approach
        below is a fallback when centroids for input ZIPs are available.
        """
        # In production: load a ZIP centroid file and do a point-in-polygon join.
        # Here we demonstrate the structure; replace dummy Point(0,0) with real centroids.
        zip_points = gpd.GeoDataFrame(
            {"zip_code": zip_codes},
            geometry=[Point(0, 0) for _ in zip_codes],  # replace with real ZIP centroids
            crs="EPSG:4326"
        )
        logger.warning(
            "Using placeholder ZIP centroids. Replace with real centroid data or "
            "HUD USPS crosswalk for production use."
        )
        try:
            joined = gpd.sjoin(zip_points, zcta_gdf, how="left", predicate="intersects")
            # TIGER/Line ZCTA shapefile uses ZCTA5CE20 for 2020-vintage boundaries
            zcta_col = "ZCTA5CE20" if "ZCTA5CE20" in joined.columns else joined.columns[-1]
            return joined[["zip_code", zcta_col]].drop_duplicates().rename(
                columns={zcta_col: "zcta"}
            )
        except Exception as e:
            raise SpatialAlignmentError(f"Spatial join failed: {e}")

    def impute_nulls(self, df: pd.DataFrame, numeric_cols: List[str]) -> pd.DataFrame:
        """Applies median imputation for sparse rural geographies."""
        imputer = SimpleImputer(strategy="median")
        df = df.copy()
        df[numeric_cols] = imputer.fit_transform(df[numeric_cols])
        return df

    def normalize_and_weight(
        self,
        df: pd.DataFrame,
        numeric_cols: List[str],
        population_col: str,
        audience_weights: Optional[Dict[str, float]] = None
    ) -> pd.DataFrame:
        """Applies population-weighted min-max normalization and custom audience scaling."""
        try:
            weights = np.array(df[population_col].values, dtype=float)
            weights = np.where(weights <= 0, 1e-6, weights)  # guard against divide-by-zero

            norm_df = df.copy()
            for col in numeric_cols:
                col_min = norm_df[col].min()
                col_max = norm_df[col].max()
                denom = col_max - col_min
                if denom == 0:  # constant column -> no spread to scale
                    norm_df[col] = 0.0
                else:
                    norm_df[col] = (norm_df[col] - col_min) / denom

                # Population-weighted score
                norm_df[f"{col}_weighted"] = norm_df[col] * weights

            if audience_weights:
                for col, weight in audience_weights.items():
                    if col in norm_df.columns:
                        norm_df[col] *= weight
                        logger.info("Applied audience multiplier %.2f to %s", weight, col)

            return norm_df
        except Exception as e:
            raise NormalizationError(f"Normalization failed: {e}")

    def run_pipeline(
        self,
        target_zips: List[str],
        acs_variables: List[str],
        population_col: str = "B01003_001E",
        audience_weights: Optional[Dict[str, float]] = None
    ) -> pd.DataFrame:
        logger.info("Starting demographic normalization pipeline.")
        zcta_gdf = self.load_zcta_boundaries(
            str(self.cache_dir / "zcta_boundaries/tl_2022_us_zcta520.shp")
        )

        # 1. Map ZIPs to ZCTAs
        zip_zcta_map = self.zip_to_zcta(target_zips, zcta_gdf)
        target_zctas = zip_zcta_map["zcta"].dropna().unique().tolist()

        # 2. Fetch raw ACS data for the resolved ZCTAs
        raw_df = self.fetch_acs_for_zctas(acs_variables, target_zctas)
        if raw_df.empty:
            logger.critical("Pipeline halted: No ACS data retrieved.")
            return pd.DataFrame()

        # 3. Join ZIP -> ZCTA mapping back to ACS data
        zcta_col = "zip code tabulation area"
        merged = raw_df.merge(zip_zcta_map, left_on=zcta_col, right_on="zcta", how="inner")

        # 4. Imputation & normalization
        numeric_cols = [c for c in acs_variables if c != population_col]
        merged[numeric_cols] = merged[numeric_cols].apply(pd.to_numeric, errors="coerce")
        merged = self.impute_nulls(merged, numeric_cols)
        final_df = self.normalize_and_weight(merged, numeric_cols, population_col, audience_weights)

        logger.info("Pipeline complete. Processed %d geographies.", len(final_df))
        return final_df


if __name__ == "__main__":
    normalizer = DemographicNormalizer(
        census_api_key=os.getenv("CENSUS_API_KEY", ""),
        cache_dir=os.getenv("CACHE_DIR", "./data_cache")
    )

    output = normalizer.run_pipeline(
        target_zips=["90210", "10001", "60601"],
        acs_variables=["B01003_001E", "B19013_001E", "B15001_001E"],
        population_col="B01003_001E",
        audience_weights={"B19013_001E": 1.25}
    )
    print(output.head())

Spatial alignment and boundary interpolation

ZIP codes do not align with ZCTAs, so the resolution step is where most silent errors originate. For enterprise deployments, the HUD USPS ZIP-to-ZCTA crosswalk is a deterministic table join that avoids spatial computation entirely. When crosswalk data is unavailable, fall back to areal interpolation: compute the overlapping polygon area between ZIP delivery routes and ZCTA boundaries, then prorate ACS counts by the intersection ratio. Both layers must share the same projected CRS — use EPSG:5070 for area-preserving calculations — otherwise the proration ratio is computed in degrees and is meaningless. Always validate join cardinality; unexpected one-to-many matches indicate topology errors in the source shapefiles, the same class of defect addressed in Fixing Sliver Polygons in Spatial Join Operations.

Statistical scaling and audience weighting

Raw ACS counts are incomparable across geographies because of population variance. The normalize_and_weight method applies population-weighted min-max scaling:

\text{score}_i = \frac{x_i - \min(x)}{\max(x) - \min(x)} \times \text{population}_i

This preserves relative demographic intensity while penalizing low-population noise. Format-specific scaling then layers on top: a premium grocery format might apply a 1.35× multiplier to median household income and educational attainment, while a discount retailer prioritizes population density and vehicle ownership. The multiplier schema itself — including the sum-to-one constraint and sensitivity validation — is defined in Weighting Demographic Variables for Target Audiences.

Failure modes and debugging

Symptom	Likely cause	Fix
`CensusAPIError` after several retries	Sustained `429` throttling or malformed variable code	Confirm each ACS code exists in the 2023 ACS5 release; lower `chunk_size`; the `backoff_factor=2` already spaces retries — raise `retry.total` only if throttling is brief.
Empty result DataFrame	All ZIPs resolved to ZCTAs with no ACS coverage, or wrong vintage year	Check `target_zctas` is non-empty after `dropna()`; verify `base_url` year matches a released ACS vintage.
Every ZIP maps to the same/wrong ZCTA	Placeholder `Point(0, 0)` centroids still in use	Replace with real ZIP centroids or the HUD crosswalk; never ship the demonstration centroids.
`sjoin` returns zero matches	CRS mismatch between ZIP points and ZCTA polygons	Assert both layers are `EPSG:4326` before the intersect; the loader calls `to_crs(epsg=4326)` for exactly this reason.
`imputation_rate` above 0.35	Sparse rural ZCTAs with suppressed ACS cells	Drop to tract- or block-group-level fallback, covered in Imputing Missing Census Block Group Data.
All normalized values are `0.0` for a column	Constant column (`max == min`)	Expected behaviour — the guard sets the column to `0.0`; drop the variable if it carries no signal.
`NaN` scores after normalization	Non-numeric ACS strings (e.g. annotation flags `-666666666`)	The `pd.to_numeric(errors="coerce")` step converts them to `NaN`; ensure imputation runs after coercion, as in `run_pipeline`.

Verification

Confirm correct output before handing the matrix to the weighting stage:

Row-count parity. The output row count equals the count of input ZIPs that resolved to a ZCTA with ACS coverage — no silent drops from the inner merge.
Range bounds. Every base normalized column sits in [0, 1] before audience multipliers are applied: final_df[numeric_cols].between(0, 1).all().all() is True.
Weighted-column presence. Each scaled variable has a matching *_weighted companion column.
Imputation budget. Track imputation_rate (share of cells filled) and alert above 0.35; persistent breaches signal you should drop to a finer census geography.
Spot-check a known ZIP. Pick a familiar high-income ZIP (e.g. 90210) and confirm its income score ranks near the top of the batch — a fast sanity check that scaling polarity is correct.

When these checks pass, the normalized feature matrix is safe to feed downstream into audience weighting and site scoring.

Weighting Demographic Variables for Target Audiences — the weighting stage that consumes this normalized matrix.
How to Join ACS 5-Year Estimates to Custom Trade Area Polygons — the polygon-based alternative to ZIP-keyed enrichment.
Imputing Missing Census Block Group Data — finer-grained fallback when ZCTA cells are sparse.

← Back to Weighting Demographic Variables for Target Audiences

Python script for normalizing demographic data across ZIP codes

Prerequisites #

Configuration and execution parameters #

Annotated implementation #

Spatial alignment and boundary interpolation #

Statistical scaling and audience weighting #

Failure modes and debugging #

Verification #

Related #