Modernizing Spatial Workflows

Benchmarking DuckDB & Parquet vs. The Legacy Stack

Pukar Bhandari

2026-02-07

The Challenge: Data Validation

The Task: We have 1.2 Million address points in Utah.
The Goal: Validate that the CountyID attribute matches the actual spatial location.
The Method: Perform a Point-in-Polygon join against County Boundaries.
The Constraint: Do it fast, reproducible, and (ideally) with low memory.

“Is the address labeled ‘Salt Lake County’ actually inside the Salt Lake County polygon?”

The “Modern” Stack: Parquet

What is it?

Columnar Storage: Reads only the columns you need.
Typed Schema: No more guessing if “049” is a string or number.
Compression: Snappy/ZSTD compression reduces file size by ~60%.
GeoParquet: Standardizes how geometry is stored (WKB) so any tool can read it.

The “Modern” Stack: DuckDB

What is it?

“SQLite for Analytics”: An embedded SQL OLAP database.
Vectorized Engine: Processes data in batches, not row-by-row.
Spatial Extension: Adds geometry support (ST_Within, ST_Point).
Zero-Copy: Queries Parquet files directly without importing data into its own storage.

Benchmark 01: Legacy R (`sf`)

The “Standard” Way: Load everything into RAM, process, output.

# 1. READ (Slow)
# Reads all 1.2M rows into memory immediately
pts <- read_csv("addresses.csv") |>
  st_as_sf(coords = c("x", "y"), crs = 4326)

# 2. TRANSFORM
# Must align CRS manually
cnt <- st_read("counties.shp") |> st_transform(4326)

# 3. SPATIAL JOIN (The Bottleneck)
# Uses GEOS (row-by-row C++ library)
joined <- st_join(pts, cnt, join = st_within)

# 4. AGGREGATE
result <- joined |> count(FIPS_ST)

Memory Hog

If your dataset is larger than your RAM, this script crashes.

Benchmark 02: Modern R (`DuckDB`)

The “Streaming” Way: Define what you want, let the engine optimize.

# 1. CONNECT
con <- dbConnect(duckdb())
dbExecute(con, "INSTALL spatial; LOAD spatial;")

# 2. REGISTER VIEWS (Zero-Copy)
# "Point to the files, don't read them yet."
dbExecute(con, "CREATE VIEW p AS SELECT * FROM 'addresses.parquet'")
dbExecute(con, "CREATE VIEW c AS SELECT * FROM 'counties.parquet'")

# 3. SPATIAL SQL
# The engine optimizes the join strategy (Index vs Scan)
sql <- "
  SELECT c.FIPS_STR, COUNT(*)
  FROM p, c
  WHERE ST_Within(ST_Point(p.x, p.y), ST_GeomFromWKB(c.geometry))
  GROUP BY c.FIPS_STR
"
res <- dbGetQuery(con, sql)

Zero-Copy

DuckDB creates the “views” in milliseconds. It only reads the specific byte-ranges for x, y, and geometry during the actual query.

Benchmark 03: Legacy Python (`GeoPandas`)

The Python Standard: Wraps pandas and shapely.

# 1. READ
# Pandas has high memory overhead for strings
df = pd.read_csv("addresses.csv", dtype={"CountyID": str})
pts = gpd.GeoDataFrame(df,
        geometry=gpd.points_from_xy(df.x, df.y))

# 2. READ SHAPEFILE
# "FIPS_STRING" is truncated to "FIPS_ST" (10 char limit)
cnt = gpd.read_file("counties.shp").to_crs("EPSG:4326")

# 3. SPATIAL JOIN
# Uses PyGEOS/Shapely 2.0 (Vectorized but RAM heavy)
joined = gpd.sjoin(pts, cnt, predicate="within")

# 4. AGGREGATE
res = joined.groupby("FIPS_ST").size()

Benchmark 04: Modern Python (`DuckDB`)

Same Engine, Different Syntax: The SQL logic is portable.

# 1. SETUP
import duckdb
con = duckdb.connect()
con.install_extension("spatial"); con.load_extension("spatial")

# 2. REGISTER
# Identical to R version
con.execute("CREATE VIEW p AS SELECT * FROM 'addresses.parquet'")
con.execute("CREATE VIEW c AS SELECT * FROM 'counties.parquet'")

# 3. EXECUTE
# We reuse the EXACT same SQL query string
query = """
  SELECT c.FIPS_STR, COUNT(*)
  FROM p, c
  WHERE ST_Within(ST_Point(p.x, p.y), ST_GeomFromWKB(c.geometry))
  GROUP BY c.FIPS_STR
"""
df = con.execute(query).df()

Results: Execution Time

How long does the user wait?

Results: Peak Memory Usage

What is the cost of running this infrastructure?

Why does this matter?

1. Scalability

Legacy: Fails when Data > RAM.
Modern: Processes 100GB of data on a laptop with 8GB RAM (Streaming).

2. File Formats

Shapefile: 3 files (.shp, .shx, .dbf). Truncates column names. 2GB Limit.
Parquet: 1 file. Preserves full names and types. No practical limit.

3. Portability

SQL: The SQL query in R was identical to the SQL query in Python.
Logic is moved to the Data layer, not the Script layer.

Conclusion

Feature	Legacy Stack	Modern Stack
Format	CSV / Shapefile	Parquet / GeoParquet
Engine	In-Memory (sf/pandas)	Out-of-Core (DuckDB)
Join Speed	~15s	~1s
RAM Usage	~2 GB	~200 MB
Limit	RAM Size	Disk Space

Modernizing Spatial Workflows

The Challenge: Data Validation

The “Modern” Stack: Parquet

What is it?

The “Modern” Stack: DuckDB

What is it?

Benchmark 01: Legacy R (`sf`)

Benchmark 02: Modern R (`DuckDB`)

Benchmark 03: Legacy Python (`GeoPandas`)

Benchmark 04: Modern Python (`DuckDB`)

Results: Execution Time

Results: Peak Memory Usage

Why does this matter?

1. Scalability

2. File Formats

3. Portability

Conclusion

The “Modern Data Stack” isn’t just hype.

It’s simply better engineering.

Modernizing Spatial Workflows

The Challenge: Data Validation

The “Modern” Stack: Parquet

What is it?

The “Modern” Stack: DuckDB

What is it?

Benchmark 01: Legacy R (sf)

Benchmark 02: Modern R (DuckDB)

Benchmark 03: Legacy Python (GeoPandas)

Benchmark 04: Modern Python (DuckDB)

Results: Execution Time

Results: Peak Memory Usage

Why does this matter?

1. Scalability

2. File Formats

3. Portability

Conclusion

The “Modern Data Stack” isn’t just hype.

It’s simply better engineering.

Benchmark 01: Legacy R (`sf`)

Benchmark 02: Modern R (`DuckDB`)

Benchmark 03: Legacy Python (`GeoPandas`)

Benchmark 04: Modern Python (`DuckDB`)