Durable Execution for AI Agents: Temporal's Architecture for Production Reliability

What This Post Covers

In The Anatomy of Agentic Code Assist, we looked at how agents like OpenHands work: event streams, sandboxed execution, tool use, the CodeAct framework. That post covered the agent itself, what it does and how it’s built. This post covers a different layer: the infrastructure that keeps agents running reliably in production.

When an agent runs for hours, makes hundreds of tool calls, and interacts with flaky LLM APIs, a whole class of infrastructure problems emerge that application-level code cannot solve:

State loss on process crashes: a worker dies mid-workflow and hours of accumulated context disappear. The agent restarts from scratch, re-executing every LLM call and tool invocation.
LLM API rate limits and timeouts: 429s, 500s, socket timeouts, multi-minute latencies. A reflexion loop running 10 cycles can consume 50x the tokens of a linear pass if any step fails and forces a restart.
Debugging non-deterministic behavior: the same prompt produces different outputs, different tool call sequences, different results. Without a complete execution trace, reproducing production bugs is close to impossible.
Tasks exceeding server timeouts: agent sessions lasting minutes to hours die on deployments, fail during scaling events, and exceed web server timeout limits.
Ambiguous recovery after parallel fan-out crashes: the agent launches ten parallel tool calls. The process crashes after seven complete. Which results were already obtained? Which need re-execution?
Losing context during human-in-the-loop waits: the agent pauses for human approval, potentially for hours or days. The server holding that state needs to remain available, or all accumulated context is lost.
Error cascades across multi-agent systems: a single failure in one agent propagates downstream without corrective mechanisms. Simple retry logic at the tail end is inadequate because the agent may have already deviated significantly from the intended path.

Temporal is an orchestration platform built around durable execution. We’ll walk through its architecture, understand why each design decision exists, and look at how OpenAI’s Codex team uses it in production.

The core idea can be expressed as a state transition: $S_{t+1} = f(S_t, M(S_t, T_t))$. Agent state evolves through deterministic orchestration ($f$) of non-deterministic operations ($M$ = LLM response, $T$ = tool results). Temporal separates these two concerns at the infrastructure level. The deterministic part goes in workflows. The non-deterministic part goes in activities.

Workflows and Activities

The fundamental design decision in Temporal: split all code into two categories based on determinism.

Workflows

A Workflow is the agent’s control flow, the logic that decides which tools to call, in what order, what to do with results, and when to wait for human input. Workflows run as ordinary code in Python, TypeScript, Go, or Java, with one hard constraint: they must be deterministic. Given the same inputs and the same activity results, a workflow must produce the same sequence of commands every time.

A Workflow Execution can run for seconds, hours, or years. It persists through infrastructure failures. The workflow doesn’t know or care about crashes; from its perspective, execution is continuous.

Activities

Activities are where all side effects live: LLM API calls, tool executions, database writes, HTTP requests. Anything that can fail, timeout, or produce different results on re-execution. Temporal records every activity result in a persistent Event History, an append-only log that serves as the authoritative record for the entire workflow’s state.

Why This Split Matters

The determinism requirement is what enables replay-based recovery (which we’ll cover in the next section). Here’s the reasoning: if we know the workflow logic is deterministic, and we have a recorded log of all activity results, we can reconstruct the exact workflow state after a crash. We don’t need developer-written checkpoint code. We don’t need serialization logic. We just replay the deterministic code with the previously recorded results, and we arrive at the same state.

This raises an obvious question: LLMs are non-deterministic, so how does this work? The answer maps directly to how agents already operate. The LLM call goes in an activity – it’s non-deterministic, its result gets recorded. The logic deciding what to call and when goes in the workflow – it’s deterministic. The agent loop says “if the LLM returned a tool call, execute that tool; if it returned a final answer, return it.” That orchestration logic doesn’t change between runs.

A Complete Agent Loop

Here’s what a complete agent workflow looks like in Python:

from temporalio import workflow, activity
from temporalio.common import RetryPolicy
from datetime import timedelta
from dataclasses import dataclass

@dataclass
class LLMRequest:
    goal: str
    history: list
    available_tools: list

@activity.defn
async def call_llm(request: LLMRequest) -> dict:
    # Non-deterministic: LLM API call lives here
    response = await llm_client.chat(
        messages=request.history,
        tools=request.available_tools,
    )
    return {"action": response.action, "params": response.params}

@activity.defn
async def execute_tool(tool_name: str, params: dict) -> str:
    # Non-deterministic: tool execution lives here
    return await tool_registry.execute(tool_name, params)

@workflow.defn
class AIAgentWorkflow:
    @workflow.run
    async def run(self, user_goal: str) -> str:
        conversation_history = []
        llm_retry = RetryPolicy(
            initial_interval=timedelta(seconds=1),
            backoff_coefficient=2.0,
            maximum_interval=timedelta(seconds=60),
            maximum_attempts=10,
        )

        while not self.is_goal_achieved(conversation_history):
            # Deterministic: this decision logic is the workflow
            next_action = await workflow.execute_activity(
                call_llm,
                LLMRequest(
                    goal=user_goal,
                    history=conversation_history,
                    available_tools=self.get_available_tools(),
                ),
                start_to_close_timeout=timedelta(seconds=120),
                retry_policy=llm_retry,
            )

            if next_action["action"] == "tool_call":
                # Parallel tool execution when multiple tools requested
                results = await asyncio.gather(*[
                    workflow.execute_activity(
                        execute_tool,
                        tool["name"], tool["params"],
                        start_to_close_timeout=timedelta(seconds=30),
                    )
                    for tool in next_action.get("tool_calls", [])
                ])
                conversation_history.extend(results)
            else:
                conversation_history.append(next_action)

        return self.format_final_result(conversation_history)

Workflow / Activity Split

Deterministic orchestration on the left, non-deterministic side effects on the right, Event History in the center

Workflow DETERMINISTIC

Start Agent Loop

while not is_goal_achieved():

↓

Call LLM

execute_activity(call_llm, ...)

↓

Check Result

if action == "tool_call":

↓

Execute Tool

execute_activity(execute_tool, ...)

↓

Append to History

conversation_history.extend(results)

↺ Loop back to LLM call

Event History

WorkflowStarted

workflow_id: "agent-42"

ActivityScheduled

call_llm → pending

ActivityCompleted

result: {action: "tool_call"}

WorkflowTaskCompleted

decision: schedule tool

ActivityScheduled

execute_tool → pending

ActivityCompleted

result: "file edited OK"

WorkflowTaskCompleted

decision: continue loop

Activities NON-DETERMINISTIC

LLM API Call ↻ retry

POST /v1/chat/completions

↓

Tool Execution ↻ retry

tool_registry.execute(...)

↓

Result Recorded

event_history.append(result)

Click any workflow step to highlight its corresponding activity and event history entries

Deterministic Replay

Replay is the mechanism that makes Temporal’s fault tolerance work. Let’s walk through it in detail, because understanding replay is the key to understanding why the rest of the architecture looks the way it does.

The Event History

Every workflow execution has an Event History: an append-only log stored in Temporal’s persistence layer. When an activity completes, Temporal records both the request and the result.

What Happens on a Crash

Here’s a concrete scenario. An agent workflow is at step 4 of 7. It has completed three LLM calls and tool executions, and is partway through the fourth:

The worker process crashes (OOM, deployment, hardware failure)
The Temporal server detects the failure (heartbeat timeout or task timeout)
Another worker picks up the workflow from the task queue
Temporal re-executes the workflow code from the beginning
When the code reaches activity calls that already completed (steps 1–3), Temporal returns the previously recorded results from the event history instead of re-executing them
The workflow code deterministically reaches the exact same state it was in before the crash: same local variables, same loop counter, same conversation history
Forward execution resumes from step 4. Only now does an actual activity get dispatched

Because the workflow code is deterministic, replaying it with the same activity results always produces the same sequence of commands. The entire call stack and state are reconstructed with no developer-written checkpoint code. This is different from simple checkpointing because the developer never has to decide what to checkpoint or when – the replay mechanism reconstructs everything automatically from the event history.

The Determinism Contract

The determinism requirement imposes hard constraints on workflow code. You cannot use:

random() – use workflow.random() instead
datetime.now() – use workflow.now() instead
time.sleep() – use workflow.sleep() or timers instead
Direct I/O (network calls, file reads) – these must go in activities
Threading or subprocess creation – use activities or child workflows

For AI engineers, this constraint is less restrictive than it sounds. LLM calls and tool executions are inherently side effects, so they already belong in activities. The orchestration logic that decides what to call and when – “call the LLM, check if it returned a tool call, execute the tool, loop” – doesn’t use random numbers or system clocks.

Here’s what non-determinism violations look like in practice:

# WRONG: non-deterministic workflow code
@workflow.defn
class BadAgentWorkflow:
    @workflow.run
    async def run(self, goal: str) -> str:
        if random.random() > 0.5:        # different result on replay
            strategy = "aggressive"
        else:
            strategy = "conservative"

        timestamp = datetime.now()         # different on replay
        await asyncio.sleep(5)             # blocks the event loop

# CORRECT: deterministic workflow code
@workflow.defn
class GoodAgentWorkflow:
    @workflow.run
    async def run(self, goal: str) -> str:
        if workflow.random().random() > 0.5:   # deterministic across replays
            strategy = "aggressive"
        else:
            strategy = "conservative"

        timestamp = workflow.now()              # deterministic across replays
        await workflow.sleep(5)                 # durable timer, survives crashes

Contrast with OpenHands

Both Temporal and OpenHands use event sourcing, but for different purposes. OpenHands records events (CmdRunAction, FileWriteAction, observations) for debuggability and observability. You can replay the event sequence to understand what the agent did. Temporal records events so the workflow can be reconstructed after a crash as if nothing happened. Same architectural pattern, different goals.

Formalization

If History = $[(a_1, r_1), (a_2, r_2), \ldots, (a_k, r_k)]$ records completed activities, then replay returns $r_1 \ldots r_k$ from history and only executes $a_{k+1}$ forward. The workflow’s determinism guarantees that replaying with recorded results produces the same sequence of activity commands, so the state at step $k$ is identical to the state before the crash.

Deterministic Replay

Watch how Temporal recovers from a crash by replaying the event history

Step 1

call_llm

→

Step 2

exec_tool

→

Step 3

call_llm

→

Step 4

exec_tool

→

Step 5

call_llm

→

Step 6

exec_tool

→

Step 7

call_llm

Event History (persisted in database)

Server Architecture

Temporal runs as four server-side services plus a persistence layer, with user-managed workers running externally.

The Four Services

Frontend Service: a stateless gRPC gateway. All client and worker communication flows through it. Handles rate limiting, routing, and authorization. Horizontally scalable because it holds no state.

History Service: owns workflow state and persists event histories. This is the most important component. Manages state transitions across configurable History Shards, which are the unit of concurrent throughput scaling. Each shard handles a subset of workflows. More shards = more concurrent workflows.

Matching Service: hosts Task Queues and dispatches work to workers. When a workflow needs an activity executed, the Matching Service places it on the appropriate task queue. When a worker polls for work, the Matching Service assigns a task.

Workers: stateless external processes that you deploy and manage. Workers long-poll task queues via gRPC, execute workflow or activity code, and report results back. Because workers hold no state, they can be killed, restarted, or scaled horizontally without any coordination. The Temporal server is always the authoritative record.

Task Queues

Task Queues provide a routing layer that becomes important for agent workloads. Workflow tasks and activity tasks flow through separate queues. You can route activities to specialized worker pools (GPU workers for inference, lightweight workers for API calls) by assigning them to different task queues. This lets teams scale heterogeneous agent workloads independently.

Component	Responsibility	Failure Impact
Frontend Service	gRPC gateway, rate limiting, routing	Clients can’t connect (stateless, restart recovers)
History Service	Workflow state, event persistence, shard management	Workflow progress pauses until recovery
Matching Service	Task queue hosting, work dispatch	Tasks queue but don’t dispatch (no work lost)
Workers	Execute workflow/activity code, report results	Pending tasks reassigned to other workers
Persistence (DB)	Durable storage for event histories	All services degraded until DB recovers

Temporal Server Architecture

Four services, a persistence layer, and stateless workers

Entry point for all workflow operations. Uses gRPC to communicate with Frontend.

💻

Client / SDK

Start workflows, send signals

↓gRPC requests

Clients can't connect. Stateless — restart recovers immediately.

🌐

Frontend Service

gRPC gateway, rate limiting

STATELESS

↙commands

↘routing

Workflow progress pauses until recovery. No data lost — state in DB.

📚

History Service

Event persistence, state transitions

SHARDS: N

Tasks queue but don't dispatch. No work lost — resumes on recovery.

📋

Matching Service

Task queue hosting, work dispatch

TASK QUEUES

↓reads / writes

↓task dispatch

All services degraded until DB recovers. This is the critical data store.

🗃

Persistence

PostgreSQL / Cassandra

DURABLE STORAGE

Pending tasks reassigned to other workers. Zero state lost.

⚙

Workers

Your code — workflow + activity

STATELESS

W1 W2 W3

Workers report results back to Frontend via gRPC long-poll

Hover over any service to see its failure impact

Primitives for Agent Patterns

Beyond workflows and activities, Temporal provides several primitives that map to common agent coordination problems.

Signals

Signals are asynchronous messages sent to a running workflow. The workflow can react at any point in its execution. This is the mechanism for human-in-the-loop: the agent reaches a decision point, calls workflow.wait_condition(), and a signal carrying the human’s approval resumes it.

The workflow can wait hours or days. It consumes no compute while waiting because its state lives in the event history, not in a running process. No worker is tied up, no server is keeping a connection open. The state is persisted in the database and can be reconstructed on demand when the signal arrives.

Queries

Queries let external systems read workflow state without modifying it. This powers dashboards and monitoring: “What step is the agent on? What was the last LLM response? How many tokens has it consumed?” The query handler runs against the in-memory workflow state and returns immediately.

Updates

Updates combine a signal and a query: send a command to the workflow and get a response. This is useful for interactive agent control (“redo step 2 with different parameters”) where you need to both modify the workflow’s behavior and confirm the modification was accepted.

Replit, for example, uses Workflow Updates for human-in-the-loop consent. When their agent wants to perform a destructive action, it pauses and waits for the user to accept or reject via an Update.

ContinueAsNew

Each workflow execution is limited to 51,200 events or 50MB of event history. For agents making hundreds of tool calls, history grows fast; each activity generates roughly 3 events. If activities return large LLM payloads (500KB+), the 50MB limit becomes binding well before the event count limit.

ContinueAsNew addresses this by atomically starting a fresh execution with the same Workflow ID, carrying forward essential state while resetting the history. The old history is archived. For long-running agents, this is how you keep the workflow alive indefinitely.

Human-in-the-Loop Pattern

@workflow.defn
class AgentWithHumanApproval:
    def __init__(self):
        self.approved = False
        self.current_step = "initializing"
        self.pending_action = None

    @workflow.signal
    async def approve(self, decision: str):
        self.approved = decision == "yes"

    @workflow.query
    def get_status(self) -> dict:
        return {
            "step": self.current_step,
            "pending_action": self.pending_action,
            "approved": self.approved,
        }

    @workflow.run
    async def run(self, goal: str) -> str:
        while not self.is_complete():
            action = await workflow.execute_activity(
                call_llm, goal,
                start_to_close_timeout=timedelta(seconds=120),
            )

            if action.requires_approval:
                self.pending_action = action.description
                self.current_step = "awaiting_approval"
                # Workflow state persists in DB -- no compute cost while waiting
                await workflow.wait_condition(lambda: self.approved)
                self.approved = False  # reset for next approval

            self.current_step = "executing"
            result = await workflow.execute_activity(
                execute_tool, action.tool, action.params,
                start_to_close_timeout=timedelta(seconds=60),
            )
        return self.format_result()

The workflow.wait_condition(lambda: self.approved) line is where the agent pauses. It can sit there for minutes, hours, or days. If the server restarts, if workers are redeployed, the workflow’s state survives. When the signal arrives, any available worker picks it up and resumes execution.

Agent Primitives Timeline

Signals, Queries, Updates, and wait points across an agent's lifecycle

Query

get_status() → {step: "awaiting_approval"}

↓

Signal

approve("yes")

↓

Update

modify_params() → {ok: true}

↓

Start

init workflow

→

LLM Call

call_llm

→

Tool Exec

exec_tool

→

LLM Call

needs approval

→

WAIT

approval needed

⏱ ZERO COMPUTE

→

Signal

approve("yes")

→

Tool Exec

approved action

→

Update

modify params

→

LLM Call

final reasoning

→

Complete

return result

ContinueAsNew

history reset

Event History Events: 0 / 51,200

↻ ContinueAsNew: history reset to 0, workflow continues with fresh execution

Retry Policies and Error Handling

LLM APIs fail routinely. Rate limits (429), server errors (500), socket timeouts, multi-minute latencies. These are the norm for agents making hundreds of calls, and different activities need different retry strategies.

Declarative Retry Policies

Retry policies are configured per activity with several parameters: initial interval, backoff coefficient, maximum interval, maximum attempts, and non-retryable error types. The important part is that retries happen at the infrastructure level. If a worker crashes during a retry cycle, another worker picks up with the retry state intact. The developer writes no retry logic.

Why Different Activities Need Different Strategies

LLM calls need aggressive retry with exponential backoff. Rate limits are transient, and the cost of not retrying (losing all accumulated context and starting the agent run from scratch) far outweighs the cost of waiting 30 seconds for capacity. Configure high maximum attempts (10+) with a long maximum interval.

Tool executions need limited retries. Tools may not be idempotent – running git commit twice produces different results. Blindly retrying could cause duplicate side effects. Configure low maximum attempts (2–3) and mark certain error types as non-retryable.

Human notifications often need no retry at all. Fire-and-forget: if the Slack message fails, don’t block the workflow.

llm_retry = RetryPolicy(
    initial_interval=timedelta(seconds=1),
    backoff_coefficient=2.0,
    maximum_interval=timedelta(seconds=60),
    maximum_attempts=10,
    non_retryable_error_types=["InvalidPromptError"],
)

tool_retry = RetryPolicy(
    initial_interval=timedelta(seconds=2),
    maximum_attempts=3,
    non_retryable_error_types=["PermissionDenied", "NotIdempotent"],
)

# Heartbeating for long-running activities
@activity.defn
async def execute_long_tool(task: dict) -> str:
    result = ""
    for i, chunk in enumerate(process_chunks(task)):
        activity.heartbeat({"progress": i, "last_chunk": chunk.id})
        result = await process(chunk)
    return result

Heartbeats

For long-running activities, the worker periodically reports progress via heartbeats. If the heartbeat stops (worker crashed), Temporal reschedules the activity on another worker. The new worker can read the last heartbeat details to resume from the last checkpoint rather than starting over. This matters for activities processing large datasets or running multi-step tool executions.

Saga Patterns for Multi-Agent Systems

When multiple agents coordinate, failure handling gets complex. Temporal supports saga patterns where compensation logic runs when a step fails. If a planning agent fails, downstream execution agents’ pending activities can be cancelled rather than left hanging. If the response agent produces an unsatisfactory draft, compensation logic can route back to the research agent for additional context.

Activity Type	Retry Strategy	Rationale
LLM API call	Aggressive backoff, 10+ attempts	Rate limits are transient; restart cost is enormous
Idempotent tools (search, read)	Moderate backoff, 3–5 attempts	Safe to re-execute; failures are usually transient
Non-idempotent tools (write, deploy)	Limited, 1–2 attempts	Re-execution may cause side effects
Human notification	No retry	Fire-and-forget; don’t block the workflow
Long-running computation	Heartbeat + resume from checkpoint	Avoid restarting expensive work from scratch

Production Case Study: OpenAI Codex

OpenAI’s Codex, their cloud-based coding agent that writes, tests, and iterates on code, uses Temporal as its core orchestration backbone. Will Wang, a software engineer on the Codex team, confirmed publicly that “Temporal is a critical part of the infrastructure powering Codex, responsible for executing our core control flows.” He described it as enabling the team to “easily reason about concurrency, correctness, and fault tolerance” while scaling a complicated distributed system.

Codex sessions run for 6+ hours on complex tasks. The entire agent loop (prompt construction, model inference, tool calls, result observation, loop back) runs as a Temporal Workflow. Each LLM call and tool execution is an Activity with its own retry policy and timeout. A single “turn” can involve hundreds of tool calls.

The Codex harness manages three conversation primitives: Items (atomic I/O units like messages or diffs), Turns (one unit of agent work from user input), and Threads (the durable container for an ongoing session, with persisted event history supporting resume, fork, and archive operations). Thread persistence – OpenAI describes threads as “durable containers” with “persisted event history” supporting reconnection – aligns directly with Temporal’s Event History.

Codex has a self-review pattern internally called the “Ralph Wiggum Loop”: the agent reviews its own changes, requests additional agent reviews, and iterates until all reviewers are satisfied. In Temporal terms, the review results arrive as signals, and the workflow decides whether to iterate or complete.

The relationship extends beyond Codex. In July 2025, OpenAI and Temporal launched a formal integration adding durable execution to the OpenAI Agents SDK. Every agent invocation runs as a Temporal Activity, orchestration runs as a Temporal Workflow. Temporal also processes millions of ChatGPT Images generation workflows. Venkat Venkataramani (OpenAI’s VP of App Infrastructure) reinforced this at Temporal’s Series D announcement: “Durable execution is a core requirement for modern AI systems.”

Framework Integrations

Temporal integrates with existing agent frameworks so teams don’t have to rewrite their agent logic from scratch. The pattern is the same across integrations: Temporal provides the durability layer, the framework provides the agent logic.

PydanticAI + Temporal

PydanticAI has first-class Temporal support via a TemporalAgent wrapper that preserves PydanticAI’s type-safety while offloading non-deterministic model requests and tool calls to Temporal activities. The orchestration logic lives in a deterministic workflow, and all I/O-bound tasks are automatically wrapped as activities.

One significant design decision: thread-based workflows. Each conversation thread gets its own Temporal workflow that persists for the lifetime of the conversation. This is more efficient than stateless approaches because the system only processes new messages, maintaining context within workflow state rather than re-sending the entire history for every inference.

from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIModel
from temporalio.client import Client

# Define the agent with PydanticAI's type-safe interface
support_agent = Agent(
    model=OpenAIModel("gpt-4o"),
    system_prompt="You are a customer support agent.",
    result_type=SupportResponse,  # Pydantic model for type-safe output
)

@support_agent.tool
async def lookup_order(ctx, order_id: str) -> OrderDetails:
    return await db.get_order(order_id)

# Wrap with Temporal for durability
from pydantic_ai_temporal import TemporalAgent

temporal_agent = TemporalAgent(
    agent=support_agent,
    client=await Client.connect("localhost:7233"),
    task_queue="support-agents",
)

# Each conversation gets a durable workflow
result = await temporal_agent.run(
    "What's the status of order #12345?",
    thread_id="customer-session-abc",
)

OpenAI Agents SDK + Temporal

The OpenAI Agents SDK integration centers on the activity_as_tool helper. This function automatically generates OpenAI-compatible tool schemas directly from Temporal activity signatures. The agent reasons about and invokes activities as tools, with every tool call backed by durable execution.

import { activityAsTool } from "@temporalio/openai-agents";
import { OpenAIAgentsPlugin } from "@temporalio/openai-agents";

// Temporal activities become tools the agent can call
const searchTool = activityAsTool(searchDocuments, {
  startToCloseTimeout: "30s",
  retryPolicy: { maximumAttempts: 3 },
});

const writeTool = activityAsTool(writeDocument, {
  startToCloseTimeout: "60s",
  retryPolicy: { maximumAttempts: 1 },
});

// Agent orchestration runs as a Temporal Workflow
// Each tool call is a durable Activity
const plugin = new OpenAIAgentsPlugin({
  client: temporalClient,
  taskQueue: "agent-workers",
  tools: [searchTool, writeTool],
});

Developers use the OpenAIAgentsPlugin to configure the Temporal client and worker, enabling integrated tracing that provides visibility through both the Temporal UI and OpenAI dashboards.

When Temporal Adds Unnecessary Complexity

Temporal is not always the right choice. Here’s where it adds more complexity than value:

Simple agents: a single LLM call followed by one tool call doesn’t benefit from durable execution infrastructure. One comparison found that adding Temporal to a simple document indexing pipeline required “rearchitecting the app, splitting it into two services, adding a runtime dependency on a third service, and adding over 100 lines of code” where a lighter-weight approach achieved the same with 7 lines.
Prototyping and experimentation: when you’re iterating on agent architecture, the determinism constraints and operational overhead slow you down.
Sub-30-second agents: if the agent completes before infrastructure failures become likely, the cost of durable execution exceeds the benefit.
Teams without infrastructure engineering capacity: self-hosted Temporal requires operating four services plus a database. If you don’t have the team to manage this, the operational burden may outweigh the reliability gains.

Trade-offs

Temporal’s guarantees come with trade-offs that shape day-to-day development experience.

Operational Complexity

Self-hosted Temporal requires deploying four independent services plus a persistence database (PostgreSQL, MySQL, or Cassandra) and optionally Elasticsearch for advanced visibility. This is not a single process with a single run command.

Learning Curve

Engineers must internalize: workflows vs activities, determinism rules, event history mechanics, signals, queries, updates, ContinueAsNew, versioning strategies, worker configuration.

The determinism constraint confuses newcomers, especially because LLMs are inherently non-deterministic. The resolution (LLM calls go in activities, not workflows) is simple once understood, but the documentation framing perpetuates the misconception.

Event History Limits

Each workflow execution is limited to 51,200 events or 50MB. An activity generates roughly 3 events. If activities return large LLM payloads (500KB+), the 50MB limit becomes binding well before the event count limit. The mitigation – ContinueAsNew, which atomically starts a fresh execution with carried-over state – works but adds architectural complexity. Teams building agents with many LLM calls must implement payload offloading (store large payloads in S3, pass references) and proactively manage history growth.

Latency

Temporal Cloud’s minimum end-to-end latency is roughly 100ms per workflow step, with a single activity round-trip taking approximately 220ms. Local Activities save ~50ms per call but sacrifice heartbeating and independent retry capabilities. For agents where sub-second interactivity matters (chatbot-like interactions), this overhead accumulates across many steps. Agents with 50+ steps per interaction may see 5–10 seconds of pure infrastructure overhead.

Versioning

Code changes to workflow logic can cause non-determinism errors during replay of running workflows. If a running workflow was started with version 1 of the code and a worker running version 2 picks it up, the replay may produce different activity commands, causing a non-determinism exception. Temporal provides patching APIs and worker versioning, but patches accumulate in code and “need to be removed with extreme care.” Airbyte documented struggles with non-determinism exceptions, ultimately deciding to fail affected workflows rather than attempting recovery. Safe deployment requires replay testing against production event histories in CI.

Trade-off	Impact	Mitigation
Operational complexity	4+ services to manage, or cloud costs	Temporal Cloud; start with dev server locally
Learning curve	2–3 weeks for team onboarding	Start with simple workflows, add primitives incrementally
Event history limits	51,200 events / 50MB cap per execution	ContinueAsNew + payload offloading to S3
Latency overhead	~100ms/step, ~220ms/activity round-trip	Local Activities for latency-sensitive paths
Versioning complexity	Non-determinism errors on code changes	Replay testing in CI, worker versioning

Closing Thoughts

We covered a lot of ground here: the workflow/activity split, deterministic replay, server architecture, coordination primitives, retry strategies, and how OpenAI’s Codex team puts it all together.

The core design insight is the separation of deterministic orchestration from non-deterministic execution. Once you accept that split, replay-based recovery falls out as a consequence – and with it, most of the infrastructure problems we listed at the top of this post.

OpenAI, Replit, Block, NVIDIA, and others have independently converged on durable execution for their agent workloads. Temporal’s recent $300M Series D at a $5B valuation, with 380%+ year-over-year revenue growth driven substantially by AI workloads, suggests this is a real pattern. The company joined the Agentic AI Foundation (under the Linux Foundation) alongside Anthropic, OpenAI, and Block.

For most teams, the practical path is: prototype with something lighter (LangGraph, CrewAI), validate the agent architecture, and migrate when the agents run long enough and matter enough that you can’t afford to lose state on a crash. The operational investment is real, but so is the cost of rebuilding reliability from scratch.

References

Temporal Documentation. Core Concepts – Workflows, Activities, Workers. Temporal Technologies.
Temporal. Temporal for AI. Overview of Temporal’s AI-specific capabilities and customer stories.
Wang, W. (2025). Codex and Temporal Integration. Will Wang’s public statements on Codex’s use of Temporal for core control flows.
OpenAI. Harness Engineering: Leveraging Codex in an Agent-First World. OpenAI engineering blog on the Codex harness architecture.
Temporal. Build Durable AI Agents with Pydantic AI and Temporal. PydanticAI integration guide.
Temporal. Of Course You Can Build Dynamic AI Agents with Temporal. Temporal’s architecture for dynamic AI agent loops.
Quo (formerly OpenPhone). How We Built a Real-Time AI Voice Agent with Temporal. Production case study on Temporal primitives for voice agents.
Temporal. Production-Ready Agents with the OpenAI Agents SDK + Temporal. OpenAI Agents SDK integration announcement.
Temporal. AI Cookbook – OpenAI Agents SDK. Code examples and patterns for the OpenAI integration.
PydanticAI Documentation. Temporal Durable Execution. Official PydanticAI guide for Temporal integration.
Vanlightly, J. Explanations of deterministic replay mechanics and the determinism contract in Temporal workflows. Referenced via Temporal community resources.
Wang, X., et al. (2025). The OpenHands Software Agent SDK. arXiv preprint arXiv:2511.03690. The predecessor post’s primary reference for event sourcing comparison.

What This Post Covers#

Workflows and Activities#

Workflows#

Activities#

Why This Split Matters#

A Complete Agent Loop#

Workflow / Activity Split

Deterministic Replay#

The Event History#

What Happens on a Crash#

The Determinism Contract#

Contrast with OpenHands#

Formalization#

Deterministic Replay

Server Architecture#

The Four Services#

Task Queues#

Temporal Server Architecture

Primitives for Agent Patterns#

Signals#

Queries#

Updates#

ContinueAsNew#

Human-in-the-Loop Pattern#

Agent Primitives Timeline

Retry Policies and Error Handling#

Declarative Retry Policies#

Why Different Activities Need Different Strategies#

Heartbeats#

Saga Patterns for Multi-Agent Systems#

Production Case Study: OpenAI Codex#

Framework Integrations#

PydanticAI + Temporal#

OpenAI Agents SDK + Temporal#

When Temporal Adds Unnecessary Complexity#

Trade-offs#

Operational Complexity#

Learning Curve#

Event History Limits#

Latency#

Versioning#

Closing Thoughts#

References#

What This Post Covers

Workflows and Activities

Workflows

Activities

Why This Split Matters

A Complete Agent Loop

Deterministic Replay

The Event History

What Happens on a Crash

The Determinism Contract

Contrast with OpenHands

Formalization

Server Architecture

The Four Services

Task Queues

Primitives for Agent Patterns

Signals

Queries

Updates

ContinueAsNew

Human-in-the-Loop Pattern

Retry Policies and Error Handling

Declarative Retry Policies

Why Different Activities Need Different Strategies

Heartbeats

Saga Patterns for Multi-Agent Systems

Production Case Study: OpenAI Codex

Framework Integrations

PydanticAI + Temporal

OpenAI Agents SDK + Temporal

When Temporal Adds Unnecessary Complexity

Trade-offs

Operational Complexity

Learning Curve

Event History Limits

Latency

Versioning

Closing Thoughts

References