Your Queue Is Not a Workflow Engine: Durable Execution with Temporal and Step Functions in 2025
Theres a common anti-pattern in backend systems that refuses to die: using a queue and a couple of cron jobs as a stand-in for a workflow engine. It worksuntil it doesnt. As systems accrue business rules, SLAs, and failure modes, the glue code around queues metastasizes: ad-hoc retries, dedup windows, scattered timers, compensations crammed into catch blocks, and war rooms around backfills. By the time youve built a homegrown orchestrator, youre still missing correctness guarantees and observability.
Durable execution platformsTemporal and AWS Step Functions are the most commonly deployed in 2025solve the same class of problems but with very different ergonomics. This article is a pragmatic teardown for engineers who own production job systems today: why a queue is not a workflow engine, what durable execution buys you, when to adopt Temporal or Step Functions, how to migrate incrementally, how to estimate costs, and what gotchas to expect.
TL;DR
- Use queues/cron for simple, short-lived, fire-and-forget tasks where at-most one retry and coarse SLAs are acceptable.
- Use a workflow engine for multi-step business processes, long-running timers, correctness-critical retries, human-in-the-loop tasks, and cross-service sagas.
- Temporal offers language-native workflows and deterministic replay with strong developer ergonomics and multi-region options. Step Functions offers a managed, IAM-native orchestrator with rich AWS integrations and predictable ops on AWS.
- Migration is often incremental: keep your existing workers as activities and move orchestration logic first.
- Costs scale with steps and duration. Be conservative in estimating state transitions (Step Functions) or workflow actions/history (Temporal). Avoid leaking unbounded history.
The problem with just use a queue
Queues are excellent at decoupling producers and consumers under load, but they are not a durable orchestrator. Symptoms show up as systems grow:
- At-least-once delivery means consumers must be idempotent. Teams often bolt on shaky idempotency with dedupe windows or Redis locks.
- Failure handling is accidental and non-uniform. Some consumers retry forever; others drop to a DLQ. Some implement backoff; others hammer downstream systems.
- Timeouts and timers are brittle. Visibility timeouts arent business timeouts. Scheduling retry in 7 days via cron or delayed queues fractures reliability and ownership.
- Cross-service sagas are ad hoc. If step 5 fails, how do you safely compensate steps 14 with nontrivial semantics?
- Backfills are scary. Replaying past jobs risks double-charging, duplicate shipments, or cascading replays because idempotency is unclear or untestable.
- Versioning is guesswork. What happens to a job mid-flight when you deploy new code or change message schema?
- Multi-region is hand-rolled. You mix global dedupe, leader election, and failover logic with business code.
Each of these problems has an answer with queues, but the integration burden is on you. Durable execution flips that: the platform supplies those guarantees and you focus on business logic.
Durable execution in one paragraph
Durable execution records a workflows progress as an append-only event history, and replays that history against deterministic workflow code to recover state after failures, deployments, or restarts. Timers, retries, and signals become events, not in-memory promises. Execution can last from seconds to months without leaking goroutines, lambdas, or k8s pods. This is the core idea behind Temporal (and its predecessor Cadence) and underpins Step Functions Standard workflows.
Temporal vs. Step Functions in 2025snapshot
Both are production-proven, but they represent different bets.
-
Temporal
- Language-native workflows (TypeScript, Go, Java, Python, PHP). Workflows are code with determinism constraints. Activities are normal code.
- Strong local dev experience: unit test workflows, replay event histories, run a dev server.
- Multi-cloud/self-hosted and managed (Temporal Cloud). Global namespaces for multi-region failover in certain tiers.
- Very flexible dynamic control flow, durable timers, signals/queries, and worker versioning.
-
AWS Step Functions
- Managed orchestration as a state machine defined in Amazon States Language (ASL). Integrates deeply with AWS (Lambda, ECS/Fargate, EventBridge, DynamoDB, SQS, SageMaker, Bedrock, etc.).
- Standard vs. Express workflows: Standard emphasizes long-running, durable history; Express emphasizes high throughput, short-duration orchestrations with different pricing and limits.
- No custom runtime inside the workflow; business logic typically lives in Lambda/ECS. Versions and aliases enable controlled rollouts of state machine definitions.
- IAM-native security and CloudWatch/CloudTrail/CloudWatch Logs for observability.
If youre fully on AWS, Step Functions minimizes ops for orchestration. If you want language-native modeling, multi-cloud portability, or very complex/dynamic control flow, Temporal is a strong fit.
Pragmatic teardown: the hard parts of background jobs
This section walks through the usual trouble spots, how teams patch them with queues, and what durable execution gives you. Code examples included.
1) Retries and backoff
-
With queues
- You build a retry policy per consumer or per message. Visibility timeouts vs. your backoff logic often conflict.
- DLQs capture failure but lose context. Knowing why a job failed across attempts is cumbersome.
-
With Temporal
- Retries are a first-class attribute of activities. Policies apply per call with exponential backoff and maximum attempts.
- Failure metadata survives and is queryable from the workflow. You can branch logic by failure type.
Temporal (TypeScript) activity with retry policy:
ts// workflow/src/workflows/order.ts import * as wf from '@temporalio/workflow'; import type * as activities from '../activities'; const { chargeCard } = wf.proxyActivities<typeof activities>({ startToCloseTimeout: '2 minutes', retry: { initialInterval: '1s', backoffCoefficient: 2, maximumAttempts: 6, nonRetryableErrorTypes: ['CardDeclinedError'] }, }); export async function paymentOnly(orderId: string, totalCents: number): Promise<string> { // Durable retry semantics; retries do not lose workflow state return await chargeCard(orderId, totalCents); }
- With Step Functions
Retry
andCatch
are first-class in ASL, including backoff, jitter, and error matching.
Step Functions ASL snippet with retry and catch:
json{ "StartAt": "ChargeCard", "States": { "ChargeCard": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke", "Parameters": {"FunctionName": "charge-card"}, "Retry": [ { "ErrorEquals": ["States.ALL"], "IntervalSeconds": 1, "BackoffRate": 2.0, "MaxAttempts": 6 }, { "ErrorEquals": ["CardDeclinedError"], "MaxAttempts": 0 } ], "Catch": [{"ErrorEquals": ["States.ALL"], "Next": "NotifyFailure"}], "Next": "Success" }, "NotifyFailure": {"Type": "Task", "Resource": "arn:aws:states:::sns:publish", "End": true}, "Success": {"Type": "Succeed"} } }
2) Idempotency and exactly-once illusions
-
With queues
- Delivery is at-least-once; duplicates happen. Dedupe windows (e.g., SQS) help but do not guarantee semantic idempotency.
- You must carry business idempotency keys through every step and ensure downstream effectors use conditional writes.
-
With Temporal
- Each activity invocation has a stable identity inside the workflow event history; if a retry occurs, its a retry of the same logical attempt, not a new business event.
- You still need idempotency at side-effect boundaries, but durable workflow semantics dramatically reduce accidental duplication.
Temporal activity example using idempotency keys:
ts// activities.ts (runs in normal Node/ECS/VM worker process) export async function chargeCard(orderId: string, totalCents: number): Promise<string> { // Use business idempotency at the edge (e.g., Stripe idempotency key) const idemKey = `order:${orderId}:charge`; const res = await paymentsApi.charge({ amount: totalCents, idempotencyKey: idemKey }); return res.paymentId; }
- With Step Functions
- Each Task is driven by an idempotent Lambda/ECS handler. Use deterministic idempotency keys (e.g., input hash) per real-world effect.
- Express vs. Standard differs in duplicate risk patterns when you handle retries externally; prefer Step Functions built-in
Retry
to keep all attempt metadata in execution history.
3) Timeouts and durable timers
-
With queues
- Visibility timeouts protect the queue, not your business SLA. Scheduling delayed retries via cron or delayed queues breaks locality of logic and complicates observability.
-
With Temporal
sleep
is durable. You can wait minutes or months. The system tracks pending timers; no pods remain alive.
Temporal durable timer:
tsimport * as wf from '@temporalio/workflow'; export async function kycFollowup(userId: string) { // Send initial email (activity omitted) await wf.sleep('7 days'); // durable; survives worker restarts and deployments // Send reminder if still incomplete (query or activity) }
- With Step Functions
Wait
state handles sleep. Its recorded in execution history and survives failures. For human-in-the-loop, use callback tasks with tokens.
Wait state snippet:
json{ "Type": "Wait", "Seconds": 604800, "Next": "SendReminder" }
Callback with task token (human approval pattern):
json{ "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken", "Parameters": { "FunctionName": "request-approval", "Payload": {"taskToken.$": "$$.Task.Token", "request.$": "$"} }, "TimeoutSeconds": 259200, "Next": "Proceed" }
4) Sagas and compensations
-
With queues
- You can implement compensations, but failure ordering is tricky. Partial failures during compensation are easy to mishandle, and the logic sprawls across services.
-
With Temporal
- Workflows coordinate activities; activities are allowed to be side-effecting. The saga pattern is natural: register compensations as you succeed, then run them in reverse on failure.
Temporal TypeScript saga:
tsimport * as wf from '@temporalio/workflow'; import type * as activities from '../activities'; const acts = wf.proxyActivities<typeof activities>({ startToCloseTimeout: '5m' }); export async function orderWorkflow(input: { orderId: string; items: Array<{sku: string; qty: number}>; totalCents: number }) { const saga = new wf.Saga({ parallelCompensation: false }); try { const paymentId = await acts.chargeCard(input.orderId, input.totalCents); saga.addCompensation(acts.refund, paymentId); const reservationId = await acts.reserveInventory(input.orderId, input.items); saga.addCompensation(acts.releaseInventory, reservationId); await acts.createShipment(input.orderId, reservationId); return { status: 'shipped', paymentId, reservationId } as const; } catch (err) { await saga.compensate(); // durable compensation throw err; } }
- With Step Functions
- Model compensations with
Catch
transitions to cleanup branches. Keep compensation logic explicit.
- Model compensations with
ASL compensation pattern:
json{ "StartAt": "Charge", "States": { "Charge": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke", "Parameters": {"FunctionName": "charge"}, "Next": "Reserve", "Catch": [{"ErrorEquals": ["States.ALL"], "Next": "Fail"}] }, "Reserve": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke", "Parameters": {"FunctionName": "reserveInventory"}, "Next": "Ship", "Catch": [{"ErrorEquals": ["States.ALL"], "Next": "Refund"}] }, "Ship": {"Type": "Task", "Resource": "arn:aws:states:::lambda:invoke", "End": true}, "Refund": {"Type": "Task", "Resource": "arn:aws:states:::lambda:invoke", "Parameters": {"FunctionName": "refund"}, "Next": "Fail"}, "Fail": {"Type": "Fail"} } }
5) Backfills, replays, and partial reprocessing
-
With queues
- Backfill often means re-enqueuing millions of messages and hoping dedupe logic holds. You lack visibility and fine-grained control.
-
With Temporal
- Use Schedules to run or backfill workflows on historical ranges. You can query and cancel in bulk. Use
ContinueAsNew
to segment long histories.
- Use Schedules to run or backfill workflows on historical ranges. You can query and cancel in bulk. Use
Temporal backfill sketch:
ts// Create a Schedule that runs daily; backfill last 30 days import { Client } from '@temporalio/client'; const client = new Client(); await client.schedule.create({ scheduleId: 'daily-statement', spec: { cronExpressions: ['0 3 * * *'] }, action: { type: 'startWorkflow', workflowType: 'generateStatements', taskQueue: 'statements' } }); // Backfill window await client.schedule.backfill('daily-statement', { startAt: new Date(Date.now()-30*86400e3), endAt: new Date() });
- With Step Functions
- Use Distributed Map to fan out over S3/Dynamo inputs and replay logic safely. Partition input files per day to control scope.
Distributed Map (partial) ASL:
json{ "StartAt": "ListObjects", "States": { "ListObjects": { "Type": "Task", "Resource": "arn:aws:states:::aws-sdk:s3:listObjectsV2", "Parameters": {"Bucket": "backfill-input", "Prefix": "2025/05/"}, "Next": "ForEachObject" }, "ForEachObject": { "Type": "Map", "ItemProcessor": { "ProcessorConfig": {"Mode": "DISTRIBUTED"}, "StartAt": "ProcessFile", "States": { "ProcessFile": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke", "Parameters": {"FunctionName": "process-file", "Payload.$": "$"}, "End": true } } }, "ItemsPath": "$.Contents", "End": true } } }
6) Workflow versioning and safe deploys
-
With queues
- Consumers evolve, but in-flight messages might use an old schema. You end up supporting multiple decoders and conditionals for months.
-
With Temporal
- You can version workflow code with built-ins like
patched
/getVersion
and worker versioning. Both old and new code can run concurrently; in-flight workflows continue with the code they started with.
- You can version workflow code with built-ins like
Temporal TypeScript versioning:
tsimport * as wf from '@temporalio/workflow'; export async function flowV2() { if (wf.patched('use-new-shipping-logic')) { // new path } else { // old path } }
- With Step Functions
- State Machine Versions and Aliases decouple publish from traffic shifting. New executions start on an alias pointing to a version; existing executions continue on their original definition. You can shift traffic gradually, similar to Lambda aliases.
Deployment sketch (CDK-ish pseudocode):
tsconst sm = new sfn.StateMachine(this, 'OrderSM', { definitionBody: sfn.DefinitionBody.fromChainable(def) }); const version = sm.addVersion(); const alias = new sfn.Alias(this, 'OrderSMAlias', { base: sm, version }); // Shift callers to use alias ARN; update alias to point to new versions when ready.
7) Multi-region and failure domains
-
With queues
- Cross-region orchestration requires duplicating queues, cross-region event buses, global locks/dedupe, and failover playbooks.
-
With Temporal
- Temporal supports global namespaces with multi-cluster replication (Temporal Cloud tier-dependent or self-managed). Workers can run in multiple regions; sticky task queues reduce latency. Failover switches the active cluster while preserving workflow histories.
- Activities can execute close to data; signals and timers replicate across regions.
-
With Step Functions
- Step Functions are regional. Multi-region means deploying the same state machine in multiple regions, routing traffic via Route 53/CloudFront, and using global idempotency to prevent duplicate side effects on failover.
- EventBridge can replicate events cross-region. Keep state at the edges idempotent and store processed tokens in a globally replicated store (e.g., DynamoDB global tables).
When should you replace cron/queues with a workflow engine?
Use these rules of thumb:
-
Keep queues/cron if:
- Each job is a single idempotent step with bounded runtime (< a few minutes) and you can accept occasional manual cleanup.
- Failure handling is uniform and simple (retry up to N, then alert), business cost of transient duplication is negligible.
- No human-in-the-loop, no multi-day timers, and no cross-service transactions.
-
Adopt a workflow engine if any of the following hold:
- A process has 2+ steps across services with compensation paths.
- You need durable timers > 15 minutes or human checkpoints (approvals, uploads, KYC).
- Backfills or partial replays must be safe and observable.
- Correctness matters under redeployments, regional failover, and node restarts.
- You want centralized visibility of progress, retries, and inputs/outputs per execution.
Pick Temporal when:
- Developers want to express orchestration in code with full control flow, and you need portability or hybrid-cloud.
- You need long-lived, dynamic workflows with high fan-out/fan-in, signals, queries, and custom metrics.
Pick Step Functions when:
- Youre mostly AWS and value a fully managed service, deep AWS service integrations, IAM-based authz, and CloudWatch-native ops.
- Your steps primarily run in Lambda/ECS and you prefer declarative state machines.
Migration patterns: from queue glue to durable execution
Avoid a big bang. Migrate orchestration first, then activities.
- Inventory and carve out orchestration
- Identify a business process currently split across topics/queues and cron.
- Document the happy path and compensations.
- Declare the workflow boundaries and external side effects (payments, email, shipping).
- Keep workers, wrap as activities
- Temporal: treat existing services as activities; add idempotency tokens at the boundary.
- Step Functions: keep Lambdas/ECS tasks; build a State Machine that orchestrates them.
- Centralize retries/timeouts
- Remove per-worker custom retry loops. Set retry policies in the workflow engine.
- Standardize timeouts and backoff semantics per activity.
- Add durability for timers
- Replace cron/delayed queues with
sleep
(Temporal) orWait
(Step Functions).
- Introduce compensations explicitly
- Build saga-like compensation steps that are tested in isolation.
- Backfill safely
- Temporal: use Schedules or batch starts, and prefer
ContinueAsNew
for long-running replays. - Step Functions: use Distributed Map over partitioned historical input.
- Version gradually
- Temporal: use
patched
/worker versioning; run both code paths until old workflows drain. - Step Functions: publish Versions and shift Aliases.
- Expand blast radius carefully
- Start with a single process and keep the old system as a fallback. Build dashboards and alerts around the new engines metrics.
Cost modeling: queues vs. durable execution
Costs vary widely by traffic pattern. The goal is not exact numbers, but to get the orders of magnitude right and avoid surprises.
-
Queues + Lambda/ECS
- You pay for requests, data transfer, and compute time. Queues are cheap per request. You often pay with developer time for correctness.
-
Step Functions
- Standard Workflows are priced per state transition. Express Workflows charge per request and duration (and may not charge per-transition). Both also incur costs of the tasks they call (Lambda, ECS, SDK integrations).
- Estimation approach: count the average number of states per execution (include retries) and multiply by executions per day. Factor in
Map
/Parallel
fan-out. Check the official pricing page for your region and the precise model (Standard vs. Express) because rates vary and AWS evolves offerings.
-
Temporal
- Temporal Cloud typically charges based on workflow actions/events (workflow tasks, activity tasks, timers, signals) and storage/retention of history, plus egress. Self-hosting adds compute, storage, and ops cost.
- Estimation approach: count activities per workflow, expected retries, timers, and signals. Model peak concurrent workflows and history retention. Avoid unbounded histories by using
ContinueAsNew
.
Rules of thumb:
- If your workflow runs dozens of steps with heavy fan-out per execution, Express (Step Functions) or Temporal is often cheaper than wiring many Lambdas with glue, especially when you factor op-ex and developer productivity.
- For long-running workflows (days to months) with sparse activity, Standard (Step Functions) or Temporal shines; you dont pay for idle compute.
- For very simple single-step async tasks, queues + Lambda remain cost effective.
Always prototype a single representative workflow and run a week-long shadow traffic test. Record actual state transitions/actions, duration distributions, and retries. Feed this into the vendors pricing calculators.
Operational gotchas and how to avoid them
Temporal gotchas
-
Non-determinism
- Workflow code must be deterministic. No reading system time or random values directly; use workflow APIs (
wf.now
,wf.random
) or pass values via activities/signals. External calls must happen in activities, not workflows.
- Workflow code must be deterministic. No reading system time or random values directly; use workflow APIs (
-
History growth
- Long-lived workflows that loop many times can accumulate large histories, increasing latency and cost. Use
ContinueAsNew
to truncate history at logical checkpoints.
- Long-lived workflows that loop many times can accumulate large histories, increasing latency and cost. Use
-
Activity heartbeats and timeouts
- Long-running activities should heartbeat to detect worker death and allow server-side cancellation. Set realistic
scheduleToClose
vs.startToClose
timeouts.
- Long-running activities should heartbeat to detect worker death and allow server-side cancellation. Set realistic
-
Worker versioning
- Dont roll workflow-breaking changes without
patched
/getVersion
. Plan code changes in phases and drain old executions.
- Dont roll workflow-breaking changes without
-
Observability
- Use search attributes for business keys (orderId, customerId) to index workflows. Emit application metrics from activities. Store external call IDs for traceability.
Step Functions gotchas
-
Limits and quotas
- Watch concurrency, state transition rate limits, payload size limits, and Map/Distributed Map quotas. These vary by account/region and change over time. Request increases ahead of large launches.
-
Express vs. Standard semantics
- Express has different execution duration limits and observability trade-offs compared to Standard. Choose based on workflow duration/throughput and the need for persisted detailed history.
-
State input/output size and logging
- Passing large payloads between states can explode CloudWatch Logs and costs. Trim inputs using
ResultSelector
andResultPath
. Store large artifacts in S3 and pass references.
- Passing large payloads between states can explode CloudWatch Logs and costs. Trim inputs using
-
External side effects
- Keep Lambda handlers idempotent; rely on deterministic idempotency keys (e.g., execution ID + logical step). Step Functions will retry; ensure your effectors can tolerate it.
-
Human callbacks
- When using task tokens, secure the callback endpoints and set reasonable timeouts. Or use Standard with Wait states gated by EventBridge events to avoid token leakage.
End-to-end example: migrating an order pipeline
Consider an order pipeline today:
- Producer writes
OrderCreated
to Kafka/SNS. - SQS consumer charges card, posts to inventory topic, another consumer reserves inventory, a cron kicks a shipping batch every 10 minutes, and compensations are sprinkled in catch blocks. Backfills require running SQL scripts and re-enqueuing messages.
Target with Temporal:
- Keep
chargeCard
,reserveInventory
, andcreateShipment
as existing services, expose them as activities. - Build
orderWorkflow
that coordinates the saga, including durable timers for delayed shipment windows and email reminders. - Replace cron with workflow timers.
- Use Schedules to generate nightly statements and backfills with safety.
- Put orderId and paymentId as search attributes; add alerts for retries > threshold and compensation occurrences.
Target with Step Functions:
- Define a state machine with
Charge -> Reserve -> Ship
and compensations on failure. - Use
Wait
states for business timeouts,Map
for multi-item parallelization, andCallback
for manual approvals. - Connect upstream via EventBridge rules that start executions with the event payload; filter by event pattern.
- Move backfills into Distributed Map over S3 partitions or Dynamo queries.
Operationally:
- Point only new orders to the workflow engine initially; keep old path for weeks.
- Build dashboards summarizing execution counts, retries, failure types, compensation rate, average end-to-end latency.
- Conduct a controlled backfill on a small partition and compare downstream side effects with shadow mode.
Security, compliance, and data hygiene
-
PII in history
- Both systems record inputs/outputs in execution history. Minimize sensitive data in history. Store PII in a secure store and pass opaque IDs. Use data retention policies and encryption (KMS/IAM for Step Functions; encryption and retention settings for Temporal/Temporal Cloud).
-
Authz
- Step Functions benefits from IAM integration at state-level service integrations. Temporal requires you to configure authentication/authorization for the service and workers (mTLS, per-namespace ACLs in managed offerings).
-
Auditing
- Durable histories are an audit log. Design queries and export pipelines to satisfy compliance (e.g., export completed workflow metadata to a data lake).
Opinionated guidance
- Do not build your own orchestrator if your process spans more than two steps and you care about correctness. Durable timers and compensations are fundamentally hard to get right.
- Keep business side effects idempotent even with a workflow engine. Durable execution reduces duplication but does not grant exactly-once at the external boundary.
- Tame history growth early with explicit checkpointing. Make ContinueAsNew part of your design vocabulary.
- Prefer explicit compensations over try again until it works. Retries are necessary but not sufficient for correctness.
- For multi-region, lean on platform primitives (Temporal global namespaces, AWS multi-region deployment patterns) rather than rolling your own.
References and further reading
- Temporal documentation: https://docs.temporal.io/
- Temporal TypeScript SDK: https://docs.temporal.io/typescript
- AWS Step Functions Developer Guide: https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html
- Step Functions ASL spec: https://states-language.net/spec.html
- AWS Step Functions pricing: https://aws.amazon.com/step-functions/pricing/
- Temporal Cloud pricing and concepts: https://temporal.io/cloud
- The Saga pattern: H. Garcia-Molina, K. Salem, Sagas, 1987
- Uber Engineering: Introducing Cadence a distributed, scalable, durable, and highly available orchestration engine
- Martin Kleppmann: Designing Data-Intensive Applications (retries, idempotency, exactly-once illusions)
Closing
Your queue is not a workflow engine, and thats okay. Queues are great at what they do. But as your business processes evolve into multi-step, long-running, correctness-critical flows, durable execution pays for itself in fewer outages, safer backfills, and faster iteration. In 2025, both Temporal and Step Functions are mature and battle-tested. Choose the one that fits your organizations stack and skills, migrate incrementally, and lean into the platforms guarantees instead of rebuilding them in application code.