Best Practices for Securing PII in Customer Location Datasets

This page shows exactly how to tokenize personally identifiable information (PII) out of customer mobility traces and land the sanitized, full-precision geometry in an encrypted, retention-locked S3 prefix — so trade area modeling keeps its analytical value without ever persisting plaintext identifiers.

In retail site selection, mobility traces, loyalty check-ins, and demographic overlays are the raw material for catchment modeling. Those datasets routinely carry PII — device advertising IDs, hashed emails, customer keys — that must be cryptographically isolated before it reaches the storage layer described in Configuring AWS S3 for Geospatial Data Lakes. The task this page solves is narrow and concrete: enforce a hardened bucket boundary, then run a Python ingestion step that tokenizes identifiers and writes GeoParquet with mandatory SSE-KMS headers, all while keeping coordinate precision intact for downstream spatial joins.

Prerequisites

Before running the ingestion step below, you need the following in place:

Python packages: boto3, pandas, geopandas, shapely, and pyarrow (for the GeoParquet writer). Install with pip install boto3 pandas geopandas pyarrow.
An AWS KMS customer-managed key in the same region as the target bucket, with kms:GenerateDataKey and kms:Decrypt granted to the pipeline execution role.
Environment variables: AWS_KMS_KEY_ID (the key ARN) and PII_TOKENIZATION_SECRET (a high-entropy secret used as the HMAC key — store it in AWS Secrets Manager, never in source).
Raw input as a CSV (or Parquet) of mobility traces containing at minimum latitude, longitude, and one or more identifier columns (customer_id, email, device_advertising_id).
A target CRS decision. This page keeps source data in EPSG:4326 (WGS 84 lon/lat) so that precision is preserved end to end; reprojection happens later in the analytical layer, not at ingestion. The same CRS-discipline applies to all coordinate handling — see automating coordinate validation with Python and Shapely.

Configuration and execution parameters

The table below lists the settings that govern this task. Defaults are tuned for retail mobility data subject to GDPR and CCPA retention obligations.

Parameter	Where set	Value / type	Notes
`s3:x-amz-server-side-encryption`	Bucket policy condition	`aws:kms` (string)	Denies any upload that is not SSE-KMS encrypted.
KMS key ARN	Bucket policy + `AWS_KMS_KEY_ID`	`arn:aws:kms:...:key/...`	A single customer-managed key per dataset boundary.
Object Lock mode	`put-object-lock-configuration`	`GOVERNANCE`	Allows privileged override; use `COMPLIANCE` for legal holds.
Retention `Days`	Object Lock rule	`365` (int)	Match your shortest lawful retention window.
`PII_TOKENIZATION_SECRET`	Env / Secrets Manager	256-bit secret	HMAC key; rotating it re-pseudonymizes the namespace.
`TARGET_EPSG`	Pipeline constant	`4326` (int)	Source precision preserved; reproject downstream only.
`pii_columns`	Pipeline constant	list of column names	Tokenized before any spatial operation runs.
Latitude bounds	Validation predicate	`[-90.0, 90.0]`	Out-of-range rows dropped.
Longitude bounds	Validation predicate	`[-180.0, 180.0]`	Out-of-range rows dropped.
Null Island guard	Validation predicate	reject `(0.0, 0.0)`	Removes default-coordinate noise that distorts catchments.

Step 1 — Enforce encryption and retention on the bucket

The bucket is the cryptographic boundary. Configure it to require AWS KMS-managed keys (SSE-KMS) and enable Object Lock so retention windows survive accidental or malicious deletion. The following policy denies any PutObject that is not SSE-KMS encrypted with the designated key:

json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EnforceSSEKMS",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::retail-geospatial-pii-lake/*",
      "Condition": {
        "StringNotEquals": {
          "s3:x-amz-server-side-encryption": "aws:kms"
        }
      }
    },
    {
      "Sid": "RestrictKMSKeyUsage",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::retail-geospatial-pii-lake/*",
      "Condition": {
        "StringNotLike": {
          "s3:x-amz-server-side-encryption-aws-kms-key-id": "arn:aws:kms:us-east-1:123456789012:key/1234abcd-12ab-34cd-56ef-1234567890ab"
        }
      }
    }
  ]
}

Object Lock must be enabled at bucket creation — it cannot be applied retroactively. Create the bucket and set a GOVERNANCE-mode retention window:

bash

aws s3api create-bucket \
  --bucket retail-geospatial-pii-lake \
  --region us-east-1 \
  --object-lock-enabled-for-bucket

aws s3api put-object-lock-configuration \
  --bucket retail-geospatial-pii-lake \
  --object-lock-configuration '{"ObjectLockEnabled":"Enabled","Rule":{"DefaultRetention":{"Mode":"GOVERNANCE","Days":365}}}'

For the partitioning scheme and lifecycle policies that complement this baseline, the parent Configuring AWS S3 for Geospatial Data Lakes covers ingestion layout for high-frequency spatial telemetry.

Step 2 — Tokenize and ingest

Raw identifiers must be replaced with deterministic tokens before any object is written. The pipeline below reads CSV traces, tokenizes PII with HMAC-SHA256, validates coordinates while asserting the CRS, and uploads GeoParquet with explicit SSE-KMS headers. Deterministic tokenization is important: the same customer_id always maps to the same token, so visit-frequency and repeat-customer analysis still works on the pseudonymized data, but the mapping cannot be reversed without the secret.

python

import io
import os
import logging
import hashlib
import hmac
import pandas as pd
import geopandas as gpd
import boto3
from botocore.exceptions import ClientError

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger(__name__)

KMS_KEY_ID = os.environ["AWS_KMS_KEY_ID"]
_HMAC_SECRET = os.environ["PII_TOKENIZATION_SECRET"].encode("utf-8")
BUCKET_NAME = "retail-geospatial-pii-lake"
TARGET_EPSG = 4326  # WGS 84 lon/lat — full source precision preserved


def tokenize_pii(value: str) -> str:
    """Deterministic, non-reversible HMAC-SHA256 token for a PII field."""
    return hmac.new(_HMAC_SECRET, value.encode("utf-8"), hashlib.sha256).hexdigest()


def validate_and_transform(df: pd.DataFrame) -> gpd.GeoDataFrame:
    """Drop out-of-range / Null Island rows and build CRS-tagged geometry."""
    valid_mask = (
        df["latitude"].between(-90.0, 90.0) &
        df["longitude"].between(-180.0, 180.0) &
        ~((df["latitude"] == 0.0) & (df["longitude"] == 0.0))  # reject Null Island
    )
    df_clean = df[valid_mask].copy()
    df_clean["geometry"] = gpd.points_from_xy(df_clean["longitude"], df_clean["latitude"])
    # Explicit CRS assertion — never operate on bare lat/lon without one.
    gdf = gpd.GeoDataFrame(df_clean, geometry="geometry", crs=f"EPSG:{TARGET_EPSG}")
    assert gdf.crs is not None and gdf.crs.to_epsg() == TARGET_EPSG
    logger.info("Retained %d/%d valid spatial records.", len(gdf), len(df))
    return gdf


def upload_to_encrypted_s3(gdf: gpd.GeoDataFrame, object_key: str) -> bool:
    """Write GeoParquet to S3 with mandatory KMS encryption headers."""
    s3_client = boto3.client("s3")
    try:
        buffer = io.BytesIO()
        gdf.to_parquet(buffer, index=False)
        s3_client.put_object(
            Bucket=BUCKET_NAME,
            Key=object_key,
            Body=buffer.getvalue(),
            ContentType="application/octet-stream",
            ServerSideEncryption="aws:kms",  # satisfies the deny-by-default policy
            SSEKMSKeyId=KMS_KEY_ID,
        )
        logger.info("Uploaded %s with SSE-KMS.", object_key)
        return True
    except ClientError as e:
        logger.error("S3 upload failed: %s", e.response["Error"]["Message"])
        return False


def process_mobility_batch(input_path: str, output_key: str) -> None:
    """End-to-end ingestion for a batch of retail mobility traces."""
    logger.info("Processing batch: %s", input_path)
    df = pd.read_csv(input_path)

    # Tokenize PII BEFORE any spatial operation so plaintext never persists.
    pii_columns = ["customer_id", "email", "device_advertising_id"]
    for col in pii_columns:
        if col in df.columns:
            df[col] = df[col].apply(lambda x: tokenize_pii(str(x)) if pd.notna(x) else x)

    gdf = validate_and_transform(df)
    upload_to_encrypted_s3(gdf, output_key)


if __name__ == "__main__":
    process_mobility_batch(
        input_path="data/raw/mobility_trace_202410.csv",
        output_key="processed/2024/10/mobility_trace_sanitized.parquet",
    )

The ordering is the security property: tokenization runs first, so even a crash mid-batch cannot leave plaintext identifiers in memory-buffered output bound for S3.

Step 3 — Preserve spatial utility

Securing PII must not blunt the analytical signal. Site planners need sub-meter precision for drive-time isochrones, competitor proximity, and micro-catchment delineation. The pipeline keeps exact EPSG:4326 coordinates while decoupling them from identity — so a spatial join against demographic grids loses nothing. When downstream teams query the sanitized data, run joins and aggregation inside a governed environment; the schema and indexing patterns for that live in how to structure a geospatial database for multi-state retail chains.

Operational controls that sit alongside the pipeline:

Coordinate precision masking for sharing: for external vendors, apply spatial jitter or hexbin aggregation so precise traces are anonymized while catchment-level signal survives.
Access boundary enforcement: use AWS Lake Formation or IAM condition keys (aws:PrincipalOrgID) to restrict query execution to authorized VPC endpoints.
Audit trail: enable S3 Server Access Logging and CloudTrail data events to capture every GetObject and PutObject for immutable provenance.

Failure modes and debugging

AccessDenied on a correct-looking upload. The deny-by-default policy fires when SSE headers are missing or the key ARN does not match the StringNotLike condition. Confirm both ServerSideEncryption="aws:kms" and SSEKMSKeyId are passed, and that AWS_KMS_KEY_ID matches the policy’s key ARN exactly (including region and account).
InvalidRequest: Object Lock at upload. Object Lock was not enabled at bucket creation. It cannot be retrofitted — recreate the bucket with --object-lock-enabled-for-bucket.
CRS is None / AssertionError in validate_and_transform. points_from_xy produces geometry without a CRS; the explicit crs= argument and the follow-up assertion catch this. Never let a GeoDataFrame proceed downstream without a CRS — unlabeled coordinates silently break later spatial joins.
Visit-frequency analysis breaks after a deploy. Rotating PII_TOKENIZATION_SECRET changes every token, so historical tokens no longer match new ones. Treat secret rotation as a deliberate re-pseudonymization event and reprocess affected partitions, or version the secret.
Empty output / most rows dropped. Inputs defaulting to (0, 0) are rejected by the Null Island guard, and swapped lat/lon columns fail the bounds check. Log the retained-vs-total count (the pipeline already does) and inspect a sample of dropped rows.

Verification

After a batch runs, confirm correctness before trusting the output:

Row counts: compare the logged “Retained N/M” line against expectations; a large gap signals bad coordinates or a column swap.
Tokenization applied: read the written object back and assert the pii_columns contain 64-character hex digests, not plaintext: df["customer_id"].str.fullmatch(r"[0-9a-f]{64}").all().
Geometry validity and CRS: gdf = gpd.read_parquet(...); check gdf.crs.to_epsg() == 4326 and gdf.geometry.is_valid.all().
Bounding-box sanity: gdf.total_bounds should fall inside your operating region — values outside it mean lat/lon are reversed.
Encryption + retention on the object: aws s3api head-object should report ServerSideEncryption: aws:kms and the correct SSEKMSKeyId; get-object-retention should return the GOVERNANCE window.
IAM least privilege: the execution role should hold only s3:PutObject, kms:GenerateDataKey, and kms:Decrypt, scoped to this bucket and key ARN — and a DeleteObject attempt under GOVERNANCE mode should fail without bypass-governance-retention.

Configuring AWS S3 for Geospatial Data Lakes — partitioning, IAM, and lifecycle for the bucket this pipeline writes into.
Automating coordinate validation with Python and Shapely — the validation discipline behind the Null Island and bounds checks.
How to structure a geospatial database for multi-state retail chains — where the sanitized data is queried for catchment analysis.

← Back to Configuring AWS S3 for Geospatial Data Lakes

Best Practices for Securing PII in Customer Location Datasets

Prerequisites #

Configuration and execution parameters #

Step 1 — Enforce encryption and retention on the bucket #

Step 2 — Tokenize and ingest #

Step 3 — Preserve spatial utility #

Failure modes and debugging #

Verification #

Related #