How to store GeoJSON in Apache Iceberg tables
Storing raw GeoJSON payloads in a lakehouse table degrades query performance, breaks vectorized execution, and introduces uncontrolled schema drift. The production objective is to normalize incoming GeoJSON into deterministic, columnar primitives that enable manifest-level spatial pruning while preserving downstream GIS compatibility. This guide details the exact configuration, encoding, and optimization steps required to operationalize GeoJSON ingestion in Apache Iceberg.
1. Deterministic Schema & WKB Encoding
Iceberg’s type system currently lacks a native GEOMETRY primitive. Persisting JSON strings forces runtime deserialization on every scan, starving the optimizer of column statistics. The production standard is to decompose GeoJSON into Well-Known Binary (WKB) and explicit bounding box columns. WKB is endian-safe, compact, and directly consumable by spatial UDFs without JSON parsing overhead.
CREATE TABLE gis.assets (
asset_id STRING NOT NULL,
geom_wkb BINARY NOT NULL,
bbox_min_x DOUBLE,
bbox_max_x DOUBLE,
bbox_min_y DOUBLE,
bbox_max_y DOUBLE,
srid INT DEFAULT 4326,
properties MAP<STRING, STRING>,
ingestion_ts TIMESTAMP NOT NULL,
partition_date DATE
)
USING iceberg
PARTITIONED BY (partition_date, bucket(16, asset_id))
LOCATION 's3://lakehouse-prod/gis/assets'
TBLPROPERTIES (
'write.parquet.compression-codec' = 'zstd',
'write.metadata.delete-after-commit.enabled' = 'true',
'write.metadata.previous-versions-max' = '10'
);
Key Configuration Notes:
BINARYstores raw WKB bytes. AvoidSTRINGorJSONtypes.- Bounding box columns (
bbox_min_x, etc.) must be typed asDOUBLEto enable range-based manifest pruning. propertiesusesMAP<STRING, STRING>for residual attributes. Critical keys should be promoted to explicit typed columns to prevent downstream type coercion failures.- Table properties enforce ZSTD compression and aggressive metadata cleanup to prevent manifest bloat on high-frequency spatial upserts.
2. Partitioning & Write Configuration
Spatial data requires partitioning strategies that align with query access patterns. Date-based partitioning combined with hash distribution prevents small-file proliferation and balances write parallelism. Avoid partitioning directly on coordinate values; it creates sparse partitions and degrades compaction efficiency.
Configure Spark/Iceberg write parameters to guarantee deterministic file layout:
spark.conf.set("spark.sql.iceberg.vectorization.enabled", "true")
spark.conf.set("spark.sql.iceberg.planning.preserve-data-grouping", "true")
spark.conf.set("spark.sql.adaptive.enabled", "true")
For high-throughput ingestion, set write.distribution-mode=hash to co-locate related geometries and minimize cross-node shuffles during spatial joins. Enable write.parquet.page-size-bytes=1048576 to optimize column chunk alignment for spatial UDF scans.
3. Idempotent Ingestion Pipeline (PySpark)
The ingestion layer must parse GeoJSON, compute WKB, extract bounding boxes, and write to Iceberg in a single transactional pass. Use ST_GeomFromGeoJSON and ST_AsBinary equivalents via spatial libraries (GeoPandas, Sedona, or GeoTrellis) before DataFrame construction.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, current_timestamp, lit
import json
import struct
# Pseudocode for WKB + BBOX extraction (replace with Sedona/GeoPandas UDFs in prod)
def normalize_geojson(row):
geom = json.loads(row["geojson"])
coords = geom["geometry"]["coordinates"]
# Flatten to compute bbox (implementation depends on geometry type)
xs = [c[0] for c in flatten_coords(coords)]
ys = [c[1] for c in flatten_coords(coords)]
bbox = (min(xs), max(xs), min(ys), max(ys))
wkb = geom_to_wkb(geom["geometry"]) # Use shapely.wkb.dumps()
return (row["id"], wkb, bbox[0], bbox[1], bbox[2], bbox[3], row.get("props", {}))
df = spark.read.json("s3://ingest-bucket/daily_geojson/*.json")
normalized = df.rdd.map(normalize_geojson).toDF([
"asset_id", "geom_wkb", "bbox_min_x", "bbox_max_x",
"bbox_min_y", "bbox_max_y", "properties"
])
(normalized
.withColumn("ingestion_ts", current_timestamp())
.withColumn("partition_date", to_date(current_timestamp()))
.writeTo("gis.assets")
.option("mergeSchema", "false")
.append())
Critical Constraint: Disable mergeSchema on spatial tables. Uncontrolled property evolution breaks manifest statistics and invalidates downstream spatial indexes. Schema changes must be applied via explicit ALTER TABLE DDL.
4. Query Optimization & Manifest Pruning
Spatial predicate pushdown in Iceberg relies on manifest-level min/max statistics. When querying, always filter on the extracted bounding box columns before invoking spatial UDFs. This allows the query planner to skip Parquet files entirely.
SELECT asset_id, geom_wkb
FROM gis.assets
WHERE partition_date = '2024-11-01'
AND bbox_min_x <= -74.0060
AND bbox_max_x >= -73.9800
AND bbox_min_y <= 40.7128
AND bbox_max_y >= 40.7500
AND ST_Intersects(ST_GeomFromWKB(geom_wkb), ST_PolygonFromText('POLYGON(...)'));
The bounding box filter executes during file planning, while ST_Intersects runs only on candidate rows. This two-stage evaluation reduces I/O by 80–95% on tables exceeding 100M rows. For sustained performance, run CALL spark_catalog.system.rewrite_data_files('gis.assets', strategy => 'sort', sort_order => 'bbox_min_x ASC, bbox_min_y ASC') weekly to cluster spatially adjacent records and improve Z-order-like locality.
5. Failure Modes & Resolution Matrix
| Symptom | Root Cause | Resolution |
|---|---|---|
Full table scan on ST_Contains |
Missing or untyped bbox columns; query planner cannot push predicates | Add bbox_min/max_x/y as DOUBLE; rewrite queries to filter bbox before UDF execution |
Invalid WKB or endian mismatch errors |
Mixed-endian WKB generation across ingestion nodes | Enforce little-endian WKB output (shapely.wkb.dumps(geom, hex=False)); validate with ST_IsValid(ST_GeomFromWKB(...)) |
| Manifest corruption / slow metadata reads | Excessive previous-versions-max or disabled metadata cleanup |
Set write.metadata.previous-versions-max=10 and write.metadata.delete-after-commit.enabled=true |
Schema drift on properties map |
Ingesting unvalidated JSON with dynamic keys | Promote high-cardinality keys to explicit columns; enforce NOT NULL constraints; disable mergeSchema |
| Vectorized execution disabled | Parquet column alignment mismatch or unsupported UDFs | Verify spark.sql.iceberg.vectorization.enabled=true; ensure spatial UDFs are registered as deterministic and non-blocking |
Tracking Iceberg Spatial Type Support is required for future migration to native geometry primitives, but current production deployments must rely on WKB + bbox normalization to guarantee deterministic query planning. This architecture aligns with core Spatial Lakehouse Fundamentals & Architecture principles to prevent compute lock-in and maintain manifest-level statistics integrity.
For specification compliance, reference the OGC Simple Features Specification for WKB encoding rules and the Apache Iceberg Specification for manifest statistics generation. Implement strict schema validation at the ingestion boundary, enforce deterministic partitioning, and validate spatial UDF determinism before promoting to production.