Benchmarking DuckDB & Parquet vs. The Legacy Stack
2026-02-07
CountyID attribute matches the actual spatial location.“Is the address labeled ‘Salt Lake County’ actually inside the Salt Lake County polygon?”
ST_Within, ST_Point).
sf)The “Standard” Way: Load everything into RAM, process, output.
# 1. READ (Slow)
# Reads all 1.2M rows into memory immediately
pts <- read_csv("addresses.csv") |>
st_as_sf(coords = c("x", "y"), crs = 4326)
# 2. TRANSFORM
# Must align CRS manually
cnt <- st_read("counties.shp") |> st_transform(4326)
# 3. SPATIAL JOIN (The Bottleneck)
# Uses GEOS (row-by-row C++ library)
joined <- st_join(pts, cnt, join = st_within)
# 4. AGGREGATE
result <- joined |> count(FIPS_ST)Memory Hog
If your dataset is larger than your RAM, this script crashes.
DuckDB)The “Streaming” Way: Define what you want, let the engine optimize.
# 1. CONNECT
con <- dbConnect(duckdb())
dbExecute(con, "INSTALL spatial; LOAD spatial;")
# 2. REGISTER VIEWS (Zero-Copy)
# "Point to the files, don't read them yet."
dbExecute(con, "CREATE VIEW p AS SELECT * FROM 'addresses.parquet'")
dbExecute(con, "CREATE VIEW c AS SELECT * FROM 'counties.parquet'")
# 3. SPATIAL SQL
# The engine optimizes the join strategy (Index vs Scan)
sql <- "
SELECT c.FIPS_STR, COUNT(*)
FROM p, c
WHERE ST_Within(ST_Point(p.x, p.y), ST_GeomFromWKB(c.geometry))
GROUP BY c.FIPS_STR
"
res <- dbGetQuery(con, sql)Zero-Copy
DuckDB creates the “views” in milliseconds. It only reads the specific byte-ranges for x, y, and geometry during the actual query.
GeoPandas)The Python Standard: Wraps pandas and shapely.
# 1. READ
# Pandas has high memory overhead for strings
df = pd.read_csv("addresses.csv", dtype={"CountyID": str})
pts = gpd.GeoDataFrame(df,
geometry=gpd.points_from_xy(df.x, df.y))
# 2. READ SHAPEFILE
# "FIPS_STRING" is truncated to "FIPS_ST" (10 char limit)
cnt = gpd.read_file("counties.shp").to_crs("EPSG:4326")
# 3. SPATIAL JOIN
# Uses PyGEOS/Shapely 2.0 (Vectorized but RAM heavy)
joined = gpd.sjoin(pts, cnt, predicate="within")
# 4. AGGREGATE
res = joined.groupby("FIPS_ST").size()DuckDB)Same Engine, Different Syntax: The SQL logic is portable.
# 1. SETUP
import duckdb
con = duckdb.connect()
con.install_extension("spatial"); con.load_extension("spatial")
# 2. REGISTER
# Identical to R version
con.execute("CREATE VIEW p AS SELECT * FROM 'addresses.parquet'")
con.execute("CREATE VIEW c AS SELECT * FROM 'counties.parquet'")
# 3. EXECUTE
# We reuse the EXACT same SQL query string
query = """
SELECT c.FIPS_STR, COUNT(*)
FROM p, c
WHERE ST_Within(ST_Point(p.x, p.y), ST_GeomFromWKB(c.geometry))
GROUP BY c.FIPS_STR
"""
df = con.execute(query).df()How long does the user wait?
What is the cost of running this infrastructure?
| Feature | Legacy Stack | Modern Stack |
|---|---|---|
| Format | CSV / Shapefile | Parquet / GeoParquet |
| Engine | In-Memory (sf/pandas) | Out-of-Core (DuckDB) |
| Join Speed | ~15s | ~1s |
| RAM Usage | ~2 GB | ~200 MB |
| Limit | RAM Size | Disk Space |
Transportation Data Scientist Candidate | WFRC Interview