Modernizing Spatial Workflows

Benchmarking DuckDB & Parquet vs. The Legacy Stack

Pukar Bhandari

2026-02-07

The Challenge: Data Validation

  • The Task: We have 1.2 Million address points in Utah.
  • The Goal: Validate that the CountyID attribute matches the actual spatial location.
  • The Method: Perform a Point-in-Polygon join against County Boundaries.
  • The Constraint: Do it fast, reproducible, and (ideally) with low memory.

“Is the address labeled ‘Salt Lake County’ actually inside the Salt Lake County polygon?”

The “Modern” Stack: Parquet

What is it?

  • Columnar Storage: Reads only the columns you need.
  • Typed Schema: No more guessing if “049” is a string or number.
  • Compression: Snappy/ZSTD compression reduces file size by ~60%.
  • GeoParquet: Standardizes how geometry is stored (WKB) so any tool can read it.

Parquet Structure

The “Modern” Stack: DuckDB

What is it?

  • “SQLite for Analytics”: An embedded SQL OLAP database.
  • Vectorized Engine: Processes data in batches, not row-by-row.
  • Spatial Extension: Adds geometry support (ST_Within, ST_Point).
  • Zero-Copy: Queries Parquet files directly without importing data into its own storage.

Benchmark 01: Legacy R (sf)

The “Standard” Way: Load everything into RAM, process, output.

# 1. READ (Slow)
# Reads all 1.2M rows into memory immediately
pts <- read_csv("addresses.csv") |>
  st_as_sf(coords = c("x", "y"), crs = 4326)

# 2. TRANSFORM
# Must align CRS manually
cnt <- st_read("counties.shp") |> st_transform(4326)

# 3. SPATIAL JOIN (The Bottleneck)
# Uses GEOS (row-by-row C++ library)
joined <- st_join(pts, cnt, join = st_within)

# 4. AGGREGATE
result <- joined |> count(FIPS_ST)

Memory Hog

If your dataset is larger than your RAM, this script crashes.

Benchmark 02: Modern R (DuckDB)

The “Streaming” Way: Define what you want, let the engine optimize.

# 1. CONNECT
con <- dbConnect(duckdb())
dbExecute(con, "INSTALL spatial; LOAD spatial;")

# 2. REGISTER VIEWS (Zero-Copy)
# "Point to the files, don't read them yet."
dbExecute(con, "CREATE VIEW p AS SELECT * FROM 'addresses.parquet'")
dbExecute(con, "CREATE VIEW c AS SELECT * FROM 'counties.parquet'")

# 3. SPATIAL SQL
# The engine optimizes the join strategy (Index vs Scan)
sql <- "
  SELECT c.FIPS_STR, COUNT(*)
  FROM p, c
  WHERE ST_Within(ST_Point(p.x, p.y), ST_GeomFromWKB(c.geometry))
  GROUP BY c.FIPS_STR
"
res <- dbGetQuery(con, sql)

Zero-Copy

DuckDB creates the “views” in milliseconds. It only reads the specific byte-ranges for x, y, and geometry during the actual query.

Benchmark 03: Legacy Python (GeoPandas)

The Python Standard: Wraps pandas and shapely.

# 1. READ
# Pandas has high memory overhead for strings
df = pd.read_csv("addresses.csv", dtype={"CountyID": str})
pts = gpd.GeoDataFrame(df,
        geometry=gpd.points_from_xy(df.x, df.y))

# 2. READ SHAPEFILE
# "FIPS_STRING" is truncated to "FIPS_ST" (10 char limit)
cnt = gpd.read_file("counties.shp").to_crs("EPSG:4326")

# 3. SPATIAL JOIN
# Uses PyGEOS/Shapely 2.0 (Vectorized but RAM heavy)
joined = gpd.sjoin(pts, cnt, predicate="within")

# 4. AGGREGATE
res = joined.groupby("FIPS_ST").size()

Benchmark 04: Modern Python (DuckDB)

Same Engine, Different Syntax: The SQL logic is portable.

# 1. SETUP
import duckdb
con = duckdb.connect()
con.install_extension("spatial"); con.load_extension("spatial")

# 2. REGISTER
# Identical to R version
con.execute("CREATE VIEW p AS SELECT * FROM 'addresses.parquet'")
con.execute("CREATE VIEW c AS SELECT * FROM 'counties.parquet'")

# 3. EXECUTE
# We reuse the EXACT same SQL query string
query = """
  SELECT c.FIPS_STR, COUNT(*)
  FROM p, c
  WHERE ST_Within(ST_Point(p.x, p.y), ST_GeomFromWKB(c.geometry))
  GROUP BY c.FIPS_STR
"""
df = con.execute(query).df()

Results: Execution Time

How long does the user wait?

Results: Peak Memory Usage

What is the cost of running this infrastructure?

Why does this matter?

1. Scalability

  • Legacy: Fails when Data > RAM.
  • Modern: Processes 100GB of data on a laptop with 8GB RAM (Streaming).

2. File Formats

  • Shapefile: 3 files (.shp, .shx, .dbf). Truncates column names. 2GB Limit.
  • Parquet: 1 file. Preserves full names and types. No practical limit.

3. Portability

  • SQL: The SQL query in R was identical to the SQL query in Python.
  • Logic is moved to the Data layer, not the Script layer.

Conclusion

Feature Legacy Stack Modern Stack
Format CSV / Shapefile Parquet / GeoParquet
Engine In-Memory (sf/pandas) Out-of-Core (DuckDB)
Join Speed ~15s ~1s
RAM Usage ~2 GB ~200 MB
Limit RAM Size Disk Space

The “Modern Data Stack” isn’t just hype.

It’s simply better engineering.