Forget the Data Lake: In‑Process Analytics with DuckDB, Arrow Flight, and WebAssembly in 2025
Most teams don’t need a lake to answer a question. In 2025, analytics is moving back into the application process: DuckDB scans Parquet in milliseconds, Apache Arrow Flight streams columnar batches directly to clients, and DuckDB compiled to WebAssembly lets you ship Business Intelligence to the browser. You can keep Spark, Trino, and the lakehouse where they shine (multi‑petabyte ETL, cross‑team governance), but for interactive analytics, embedded transformations, and product telemetry, in‑process analytics is faster to build, cheaper to run, and often faster end‑to‑end.
This article is a practical deep dive for engineers: the building blocks (DuckDB, Arrow, and WASM), reference patterns, limits you will hit, a reproducible benchmark recipe, and a migration playbook to turn a lake‑centric BI stack into a lean, local‑first analytics layer.
TL;DR
- In‑process analytics means you run the query engine inside your service or client, not in a remote cluster.
- DuckDB executes vectorized SQL over local files and object storage (Parquet/CSV/S3/HTTP) with strong pushdown and parallelism.
- Apache Arrow and Arrow Flight give you a zero‑copy columnar memory model and a high‑throughput RPC to move data as Arrow RecordBatches, not as JSON/CSV text.
- WebAssembly (duckdb‑wasm) runs the same SQL engine in browsers and serverless contexts, enabling offline and edge BI.
- Expect single‑node performance that rivals or beats distributed stacks for many interactive workloads, with a fraction of the operational overhead.
- Limits: memory, concurrency (per‑process), streaming ingestion, governance, and browser constraints. Use the warehouse/lake for multi‑team, multi‑writer, and large‑scale ETL.
- Migration: identify candidate dashboards, export to Parquet, wire Arrow Flight endpoints, embed DuckDB in services and the browser, measure, and iterate.
Why in‑process, and why now?
The modern data stack optimized for scale and shared compute. It also optimized for operational complexity and latency. A typical path for a product analytics query today: browser → REST → API → warehouse connector → cluster scheduler → scan JSON/Parquet → serialize rows to JSON → API → browser. Most of the time is spent orchestrating machinery and converting formats.
Three industry shifts made in‑process analytics compelling:
- Columnar everywhere: Apache Parquet and Apache Arrow normalize in‑memory and on‑disk columnar formats. You can often scan exactly the columns you need with predicate pushdown.
- Vectorized, single‑node engines: DuckDB runs analytical SQL at memory bandwidth on a laptop or in a container, and it scales down gracefully. It also scales out logically inside your app topology: one engine per service instance.
- Portable compute: WebAssembly and WASI run native code safely in browsers and serverless runtimes. duckdb‑wasm brings a production‑quality SQL engine to clients.
The result is a simpler path: application → local Parquet/S3 via HTTPFS → DuckDB query → Arrow Flight → client. Fewer hops, binary columnar end‑to‑end, and strong locality. For many BI interactions, this beats a lakehouse round‑trip by seconds and dollars.
Building blocks
DuckDB in short
- Embedded OLAP database designed for in‑process analytics.
- Vectorized execution, parallel operators, and out‑of‑core algorithms when data exceeds memory.
- Reads/writes Parquet, CSV, JSON; can query S3/HTTP directly via the httpfs extension; supports Iceberg and Delta readers via extensions.
- Rich SQL including window functions, struct/list types, JSON functions, and joins that exploit statistics for pushdown.
- Integrates natively with Apache Arrow: it can both consume and produce Arrow RecordBatches.
Useful docs:
- duckdb.org/docs
- duckdb‑wasm repo: github.com/duckdb/duckdb-wasm
- Extensions: duckdb.org/docs/extensions
Apache Arrow and Arrow Flight
- Arrow is an in‑memory columnar format designed for zero‑copy interchange across languages and processes.
- Arrow Flight is a high‑performance RPC (over gRPC) for sending Arrow RecordBatches; Flight SQL layers a SQL interface on top.
- Benefits vs JSON/CSV over REST: fewer bytes on the wire (no text encoding), vectorized decoding, schemaful, and stream‑friendly backpressure.
Docs:
- Arrow: arrow.apache.org
- Flight: arrow.apache.org/docs/format/Flight.html
- Flight SQL: arrow.apache.org/docs/format/FlightSql.html
WebAssembly (WASM)
- Safe, portable binary format for running compiled code in browsers and serverless runtimes.
- duckdb‑wasm provides two backends: a pure WASM build that runs anywhere, and a Web Worker + SharedArrayBuffer variant for parallelism in cross‑origin‑isolated contexts.
- Ideal for offline analytics, privacy‑sensitive computations (data never leaves the client), and responsive BI.
Architectural patterns
-
Embedded BI in the service tier
- Each API instance embeds DuckDB. Data lives in object storage (S3/GCS/Azure Blob) as Parquet. When a user opens a dashboard, the service issues a SQL query directly against Parquet files, then streams Arrow to the frontend.
- Pros: Minimal infrastructure, low latency, predictable cost. Cons: concurrency and per‑instance caching strategies matter.
-
Local‑first desktop BI
- Electron/Tauri app bundles duckdb and queries local Parquet or synced S3 objects. Analysts can load files, join with local CSVs, and export results without a network connection.
-
Browser BI with duckdb‑wasm
- The web app fetches Parquet via HTTP(S) or from the File System Access API, registers those as tables in duckdb‑wasm, and runs SQL directly on the client. For large datasets, use sampling or pre‑aggregated tiles.
-
Flight gateway for columnar microservices
- Replace JSON REST endpoints with an Arrow Flight service that pushes columnar batches. Downstream services or frontends consume with pyarrow, Node, or Rust/Go clients.
-
Hybrid warehouse + in‑process
- Keep ETL and governance in BigQuery/Snowflake/Spark. Periodically export curated Parquet to S3. Serve product analytics and internal explorations via DuckDB + Flight. This removes the hot path from the warehouse without duplicating compute for every request.
Concrete data access patterns
Query Parquet/CSV/S3 locally with DuckDB
SQL examples (DuckDB CLI or any binding):
sql-- Enable HTTP/S3 support INSTALL httpfs; LOAD httpfs; -- Configure S3 credentials (environment variables also work) SET s3_region='us-east-1'; SET s3_access_key_id='AKIA...'; SET s3_secret_access_key='...'; SET s3_use_ssl=true; SET s3_url_style='path'; -- Query a partitioned dataset on S3 with hive-style partitions duckdb> SELECT device_type, COUNT(*) AS n, AVG(duration_ms) AS avg_ms FROM read_parquet('s3://my-bucket/events/date=2025-09-01/*.parquet') WHERE duration_ms < 10000 AND country = 'US' GROUP BY device_type ORDER BY n DESC; -- Register a dataset as a view for reuse CREATE VIEW events AS SELECT * FROM read_parquet('s3://my-bucket/events/date=*/hour=*/*.parquet'); -- Pushdown filters and projections are automatic when reading Parquet SELECT date, device_type, COUNT(*) FROM events WHERE date BETWEEN '2025-09-01' AND '2025-09-07' AND device_type IN ('ios','android') GROUP BY 1,2;
Notes:
- Use globbing or hive partitions to avoid listing millions of objects.
- Keep Parquet row groups reasonably large (128–512 MB uncompressed) to maximize scan throughput and pushdown efficiency.
- Prefer ZSTD for compression; it often reduces IO without hurting CPU too much for analytics.
Stream Arrow batches over Flight
A minimal Python Flight server that serves a dataset as Arrow batches:
pythonimport pyarrow as pa import pyarrow.flight as fl import duckdb class EventsFlightServer(fl.FlightServerBase): def __init__(self, location): super().__init__(location) self._con = duckdb.connect() self._con.execute("INSTALL httpfs; LOAD httpfs;") self._con.execute("SET s3_region='us-east-1';") # ... load credentials via env/IMDS or AWS SDK def do_get(self, context, ticket): sql = ticket.ticket.decode('utf-8') table = self._con.execute(sql).fetch_arrow_table() return fl.RecordBatchStream(table) def get_flight_info(self, context, descriptor): # Provide schema for discovery sql = descriptor.command.decode('utf-8') schema = self._con.execute(sql + ' LIMIT 0').fetch_arrow_table().schema endpoints = [fl.FlightEndpoint(b'', [self.location])] return fl.FlightInfo(schema, descriptor, endpoints, -1, -1) if __name__ == '__main__': server = EventsFlightServer('grpc://0.0.0.0:8815') server.serve()
Client consuming Arrow and converting to Pandas or DuckDB directly:
pythonimport pyarrow as pa import pyarrow.flight as fl import duckdb client = fl.FlightClient('grpc://localhost:8815') sql = """ SELECT device_type, COUNT(*) n FROM read_parquet('s3://my-bucket/events/date=2025-09-01/*.parquet') GROUP BY 1 ORDER BY n DESC """ reader = client.do_get(fl.Ticket(sql.encode('utf-8'))) # As a PyArrow Table table = reader.read_all() print(table.to_pandas()) # Or register Arrow directly in DuckDB for downstream joins con = duckdb.connect() con.register('t', table) print(con.execute('SELECT * FROM t ORDER BY n DESC').fetch_df())
Compared to JSON, Flight avoids serialization overhead and preserves column types. Downstream consumers can stream batches as they arrive.
Run DuckDB in the browser with WebAssembly
A minimal setup using duckdb‑wasm to run SQL against a Parquet file fetched over HTTP:
html<script type='module'> import * as duckdb from 'https://cdn.skypack.dev/@duckdb/duckdb-wasm'; const JSDELIVR_BUNDLES = duckdb.getJsDelivrBundles(); const bundle = await duckdb.selectBundle(JSDELIVR_BUNDLES); const worker = new Worker(bundle.mainWorker); const logger = new duckdb.ConsoleLogger(); const db = new duckdb.AsyncDuckDB(logger, worker); await db.instantiate(bundle.mainModule, bundle.pthreadWorker); const conn = await db.connect(); await conn.query("INSTALL httpfs; LOAD httpfs;"); await conn.query("CREATE TABLE t AS SELECT 1 AS id, 'hello' AS msg;"); // Query remote parquet directly (CORS must allow it) const sql = ` SELECT country, COUNT(*) n FROM read_parquet('https://cdn.example.com/datasets/events/2025-09-01/*.parquet') WHERE duration_ms < 10000 GROUP BY 1 ORDER BY n DESC LIMIT 10 `; const result = await conn.query(sql); console.table(result.toArray()); </script>
Notes:
- For parallel query execution in the browser, you need cross‑origin isolation to enable SharedArrayBuffer.
- Large datasets should be pre‑aggregated server‑side; browsers still have memory limits (commonly in the 1–2 GB range per tab) and network constraints.
Performance: what to expect (and how to measure)
Rather than claim absolute numbers, here’s how to benchmark in your environment and what patterns typically dominate.
Methodology: TPC‑H on Parquet with DuckDB
- Generate TPC‑H data and export to Parquet using DuckDB itself:
sql-- In DuckDB CLI INSTALL tpch; LOAD tpch; CALL dbgen(sf=10); -- scale factor 10 (~10 GB raw) -- Export each table to partitioned Parquet COPY lineitem TO 'parquet/tpch/lineitem' (FORMAT PARQUET, ROW_GROUP_SIZE=12800000, PARTITION_BY (l_shipdate)); COPY orders TO 'parquet/tpch/orders' (FORMAT PARQUET, ROW_GROUP_SIZE=8000000, PARTITION_BY (o_orderdate)); COPY customer TO 'parquet/tpch/customer' (FORMAT PARQUET); COPY part TO 'parquet/tpch/part' (FORMAT PARQUET); COPY partsupp TO 'parquet/tpch/partsupp' (FORMAT PARQUET); COPY supplier TO 'parquet/tpch/supplier' (FORMAT PARQUET); COPY nation TO 'parquet/tpch/nation' (FORMAT PARQUET); COPY region TO 'parquet/tpch/region' (FORMAT PARQUET);
- Run standard TPC‑H queries with timings enabled:
sql.timer on PRAGMA threads=8; -- or number of cores -- Example Query 1 SELECT l_returnflag, l_linestatus, SUM(l_quantity) AS sum_qty, SUM(l_extendedprice) AS sum_base_price, SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price, SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge, AVG(l_quantity) AS avg_qty, AVG(l_extendedprice) AS avg_price, AVG(l_discount) AS avg_disc, COUNT(*) AS count_order FROM read_parquet('parquet/tpch/lineitem/*') WHERE l_shipdate <= DATE '1998-09-02' GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag, l_linestatus;
- Repeat with data on S3 by pointing read_parquet to s3:// paths and enabling httpfs.
Interpretation:
- On NVMe SSD, DuckDB often scans data near storage throughput with good pushdown. For aggregations and joins with selective predicates, expect sub‑second to a few seconds at SF1–SF10 on modern CPUs.
- S3 brings network bandwidth and per‑request latency into play. Use larger row groups and fewer files to reduce HTTP requests. CDN or S3 Transfer Acceleration can help for geo‑distributed users.
Arrow Flight vs JSON/CSV
To compare transport overhead:
- Create an API endpoint serving the same dataset via JSON and via Flight.
- Measure end‑to‑end time to first row and total time to process N rows in a client that materializes data into columns (e.g., Pandas DataFrame).
- You should observe significantly better CPU utilization and lower wall‑clock with Flight, especially for wide tables (many columns) and numeric data, because you skip parsing and type inference.
Browser with duckdb‑wasm
- Measure with developer tools. Capture memory usage and CPU while querying local Parquet.
- Expect slower than native DuckDB due to WASM overhead, but still interactive for tens of millions of values, especially when queries are selective. Use pre‑aggregations for very large scans.
Practical performance levers
- Predicate pushdown and column pruning: Filter on partition keys and columns with statistics. Write Parquet with min/max stats enabled (default) and avoid types that limit stats effectiveness (e.g., high‑cardinality strings without dictionary encoding).
- Row group size: Larger row groups reduce IO overhead and increase scan efficiency, but too large can hurt parallelism. Start at 128–512 MB uncompressed.
- File counts: Thousands of tiny files kill performance over S3 due to request overhead. Compact into larger files.
- Compression: ZSTD at moderate levels often yields a good CPU/IO tradeoff. Snappy is faster to decompress but larger on disk.
- Caching: DuckDB’s HTTPFS can cache to local disk. Layer a CDN in front of S3 for repeated global access. Warm caches in your service startup if hot dashboards are known.
- Threads: Set PRAGMA threads to match CPU cores. In multi‑tenant services, cap it to avoid noisy neighbors.
Where in‑process analytics shines
- Product analytics and embedded dashboards where the working set is in the tens of gigabytes per tenant or per day, not petabytes.
- Operational reporting with strict P95 latency targets (< 1–2 s) where cluster warm‑up and orchestration latencies dominate.
- Privacy‑sensitive or offline scenarios where raw data must remain on device (browser/desktop) and only aggregates leave.
- Data science notebooks and feature engineering loops that benefit from local Parquet scans rather than round‑trips to a remote warehouse.
Limits and trade‑offs (read this before you rewrite everything)
-
Concurrency and isolation:
- DuckDB is embedded; it supports multiple connections but is not a multi‑writer network database. Use one process per tenant or workload class, or a pool of processes.
- For highly concurrent workloads, consider sharding by tenant or precomputing heavy joins offline.
-
Transactions and catalogs:
- You don’t get ACID over object storage by default. Treat Parquet datasets as immutable snapshots. If you need table evolution, consider reading Iceberg/Delta Lake metadata (DuckDB provides read‑only Iceberg/Delta extensions). Writes with full ACID semantics are better kept in a warehouse or in an ACID table format service.
-
Streaming ingestion:
- DuckDB is not a streaming engine. It can consume micro‑batches quickly, but you’ll need a separate ingestion path (e.g., Kafka → Flink/Spark → Parquet) or periodic dumps.
-
Governance and security:
- Row‑level security, masking, auditing, and lineage are application concerns now. Build them into your SQL layer and transport (Flight) with strong authN/Z, not as an afterthought.
-
Browser constraints:
- Memory ceilings, CORS, SharedArrayBuffer requirements, and lack of direct S3 auth in the browser force design choices. Use signed URLs or proxy via a Flight endpoint when needed.
-
Object storage semantics:
- Major clouds (e.g., Amazon S3) provide strong read‑after‑write and listing consistency, but S3‑compatible stores may differ. Design for immutability and manifest files to avoid surprises.
-
Tooling maturity:
- Arrow Flight is stable and widely implemented, but it’s still less common than REST/JSON in some stacks. Expect to invest in SDK distribution and observability (tracing, metrics) for Flight services.
Engineering guidance: data layout and APIs
-
Parquet layout
- Partition by low‑cardinality, high‑selectivity columns used for filtering (e.g., date, tenant, region).
- Sort within partitions by a secondary predicate key to improve row group pruning.
- Choose column encodings consciously: dictionary encoding for strings; check that dictionary does not overflow.
- Record schemas in a central place; validate on write to avoid schema drift.
-
Manifest files
- Maintain a small manifest (CSV/JSON/Parquet) that lists object paths for a logical dataset version. Query manifests instead of globbing the bucket to control exactly which files belong to a snapshot.
-
Arrow Flight API design
- Define stable schemas; include metadata for units and null semantics.
- Use TLS and token‑based auth (e.g., mTLS or JWT). Keep credentials out of browser clients; proxy if necessary.
- Paginate naturally via batch sizes; backpressure is built‑in. Avoid row limits that force the server to materialize all results before streaming.
-
Caching
- Layer: CDN (global), service local disk (HTTPFS), in‑memory per process (LRU of recent Arrow tables or row groups).
- Invalidate on manifest version change rather than on timestamps.
-
Version pinning and reproducibility
- Pin DuckDB engine versions in services and WASM bundles. Automate backfill testing when bumping versions.
Migration playbook
-
Define success metrics
- Target P95 latency for the top N dashboards. Cost ceiling per month for analytics compute. Developer productivity metrics (lines of glue code removed, services simplified).
-
Inventory candidates
- Classify dashboards/queries by data size, selectivity, and refresh cadence. Flag those that read curated Parquet or could be exported from the warehouse daily/hourly. Exclude cross‑team ad‑hoc or compliance queries initially.
-
Export curated datasets to Parquet
- From the warehouse, create Airflow/DBT jobs that write partitioned Parquet snapshots to S3. Consider Iceberg/Delta if you need schema evolution and time travel, but you can start with plain Parquet + manifest.
- Normalize data types (timestamps, decimals) to avoid surprises across engines.
-
Prototype with DuckDB locally
- Replicate 3–5 key dashboards as SQL scripts over Parquet. Measure with .timer. Validate results against warehouse queries. Iterate on partitioning to meet latency targets.
-
Introduce Arrow Flight
- Build a small Flight service that accepts a restricted SQL dialect or a parameterized query list. Replace JSON REST endpoints in the frontend with Arrow Flight clients where feasible. Measure network and CPU savings.
-
Bring analytics into the app servers
- Embed DuckDB into the relevant microservices. Cache hot datasets and results. Establish a per‑tenant concurrency model. Add resource guards (thread caps, memory limits) and circuit breakers.
-
Browser/offline pilots with duckdb‑wasm
- Enable a subset of dashboards to run client‑side when data is small or pre‑aggregated. Use signed URLs or a Flight proxy for access. Gate behind a feature flag and collect performance/UX feedback.
-
Observability and correctness
- Add golden query tests: same SQL (or semantics) run in warehouse and DuckDB, diff results on CI for every build. Instrument Flight servers with metrics (throughput, batches/s, backpressure). Trace queries end‑to‑end.
-
Security and governance
- Implement row‑level filtering in SQL or via manifest scoping (one manifest per tenant). Enforce authZ at the Flight gateway. Log access events. Redact PII at the data layer.
-
Rollout and cost review
- Migrate dashboards in batches. Track warehouse compute savings and object storage egress. Tune cache policies. Decide where Spark/Trino remains essential (heavy ETL, multi‑hour batch jobs, cross‑team datasets).
End‑to‑end example: a thin analytics service
A minimal Python service that exposes a parameterized query over Flight, backed by DuckDB scanning S3 Parquet.
python# server.py import os import pyarrow as pa import pyarrow.flight as fl import duckdb SQL_TEMPLATES = { 'events_by_device': ( """ SELECT device_type, COUNT(*) AS n, AVG(duration_ms) AS avg_ms FROM read_parquet($path) WHERE date BETWEEN $start AND $end AND tenant_id = $tenant GROUP BY 1 ORDER BY n DESC """, ['path','start','end','tenant'] ), } class Gateway(fl.FlightServerBase): def __init__(self, location): super().__init__(location) self.con = duckdb.connect() self.con.execute('INSTALL httpfs; LOAD httpfs;') self.con.execute(f"SET s3_region='{os.getenv('AWS_REGION','us-east-1')}';") def do_get(self, context, ticket): cmd = ticket.ticket.decode('utf-8') name, args_str = cmd.split(':', 1) sql_tmpl, params = SQL_TEMPLATES[name] kv = dict([tuple(kv.split('=')) for kv in args_str.split('&')]) # Bind parameters safely using DuckDB parameter binding q = self.con.execute(sql_tmpl, { 'path': kv['path'], 'start': kv['start'], 'end': kv['end'], 'tenant': kv['tenant'] }) return fl.RecordBatchStream(q.fetch_arrow_table()) if __name__ == '__main__': server = Gateway('grpc://0.0.0.0:8815') server.serve()
Client (Node) consuming via Flight and sending to the browser as Arrow IPC or JSON for visualization:
javascript// client.js import * as flight from 'apache-arrow/flight'; const client = flight.connect('grpc://localhost:8815'); const params = new URLSearchParams({ path: 's3://my-bucket/events/date=*/hour=*/*.parquet', start: '2025-09-01', end: '2025-09-07', tenant: 'acme' }); const stream = await client.get({ ticket: Buffer.from(`events_by_device:${params.toString()}`) }); const batches = []; for await (const b of stream) batches.push(b); const table = flight.tableFromIPC(batches); console.log(table.schema.toString());
This pattern keeps the server stateless, lets you scale horizontally by running more instances, and avoids materializing massive JSON payloads.
Cost model and ops
- Compute: Reuse your existing app compute. DuckDB is a library; you pay in CPU time per query. Set thread caps and quotas to avoid noisy neighbor effects.
- Storage: Parquet in object storage is cheap. The cost driver is egress and request count for S3. Avoid tiny files; enable transfer acceleration/CDN if global.
- Networking: Arrow Flight reduces bytes transferred vs JSON. For remote clients, compress batches at the transport layer.
- Caching: Local disk caches reduce egress. Track hit ratios and align cache TTLs with data refresh cadence.
- Observability: Treat queries as first‑class operations. Emit metrics: scans bytes, row groups read, batches produced, predicate selectivity.
Common questions
- Do I still need Spark/Trino? Yes—for heavy ETL, large joins across teams, and long‑running transformations. But you can remove them from the hot path of interactive BI if your working sets fit a single node’s bandwidth and memory.
- What about ACID tables like Iceberg/Delta? DuckDB can read them via extensions (read‑only). Use them for governance and evolution while serving read workloads in‑process.
- Will this scale? Horizontally, yes—by sharding workloads across app instances or tenants, with object storage as the shared data plane. Not as a single coordinated node for thousands of concurrent ad‑hoc users.
- Is browser analytics secure? Data loaded into the browser is exposed to that client. Only push what the user is authorized to see; prefer aggregates for sensitive data.
The bottom line
If your users wait seconds for charts that scan a few gigabytes, you’re paying the orchestration tax. In‑process analytics—with DuckDB for execution, Arrow/Flight for transport, and WebAssembly for client‑side compute—lets you ship fast, constrained, private analytics with minimal infrastructure. Keep the lakehouse for what it’s great at. For everything else, bring the query to the data and the data into your app process.
Invest a week to prototype: export Parquet, wire a Flight endpoint, embed DuckDB, and try duckdb‑wasm in your browser. Measure, compare, and decide with data. In 2025, a lot of teams are discovering they can forget the lake for most interactive workflows—and not miss it.