Skip to content
The Lab
Interactive drill · 24 scenarios

Agentic AI in production

TypeScript-nativeServerlessEdge-nativeVercel + Cloudflare

The questions I get in real architecture conversations, each with the answer I would give and a diagram. Reveal one at a time, or shuffle and drill.

Opinionated, on purpose

These are the calls I make for early-stage teams optimizing for speed, reliability, and a fast path to tens or hundreds of thousands of users, not for infinite customization or cost-tuning at millions-of-users scale. At that point the tradeoffs flip toward running your own infra. This is the build-fast lens.

0 / 24 seen
01 / 24 · Reference architecture

Walk me through how you would design a chat agent for a SaaS product. Assume Next.js + Supabase.

BrowserEdge APIDurable fnPostgresRedisMCP toolsPostgres is the source of truth; Redis is just the stream buffer.
Read all 24 in fullEvery question and answer as plain text, on one page. Skim or search.
01Reference architecture

Walk me through how you would design a chat agent for a SaaS product. Assume Next.js + Supabase.

An edge route handles auth, per-tenant rate limiting, persists the user message, and kicks off a durable function, then returns fast. I do not run the agent loop in the route, because serverless times out and a mid-loop crash would lose everything. The durable function runs the loop with each tool call as a retryable step, streaming chunks through Redis so the browser can reconnect. Postgres plus pgvector is the source of truth, Redis is just the stream buffer, and tools sit behind an MCP gateway. I defend each choice by the failure mode it prevents.

  • Edge route: auth, rate limit, persist message, start durable job, return fast
  • Durable function runs the loop; each tool call is a retryable step
  • Postgres+pgvector = source of truth, Redis = stream buffer, MCP gateway for tools
  • useChat({resume:true}) for mobile; defend each choice by the failure it prevents
Next.js (App Router)Vercel AI SDK · useChatInngestUpstash RedisSupabase · pgvectorMCP gateway
02HITL + durable execution

Your agent needs to wait for human approval that could take days. How do you implement this?

You cannot hold a process or function open for days, so this is a durable execution problem. The workflow pauses on a durable wait for an approval event with a long timeout, consuming nothing while suspended, and resumes exactly where it paused when the event arrives. The human gets a Slack or email link; clicking it emits the event. I support the four decision types (approve, edit, reject, respond) and handle the timeout path explicitly so silence means escalate, cancel, or a safe default.

  • You cannot hold a process alive for days; use durable execution
  • Workflow pauses on a durable wait for the approval event, resumes on arrival
  • Surface via Slack/email link; clicking emits the event
  • Four decisions: approve, edit, reject, respond; handle the timeout path explicitly
Inngest · step.waitForEventTemporal · signalsLangGraph · HITL MiddlewareAsyncPostgresSaver
03RAG when-not

When would you NOT use RAG?

RAG is over-prescribed. If the corpus is small and static, I put it in the prompt with caching. If the data is structured, I query it with SQL through a tool, not embeddings. If it changes often, live API calls beat stale vectors. Naive retrieval misses the right chunk often enough that it is not a safe default, so when I do use it the default is hybrid search plus a reranker, and agentic RAG for multi-hop questions. RAG is a mechanism, not the default architecture.

  • Small static corpus: prompt + caching
  • Structured data: SQL via tool call, do not embed it
  • Frequently changing data: live API calls beat stale embeddings
  • If you do RAG: hybrid + rerank (retrieval is the main failure point); agentic RAG for multi-hop
pgvectorhybrid + Voyage rerank-2.5LlamaIndexagentic RAG
04Messaging skepticism

A team says they want to add Kafka to their agent system. What do you ask them?

Three questions before agreeing. What is the sustained throughput? Below roughly 500K messages a second Kafka is overkill. Do you need event replay from a point in time? Is long retention an actual requirement? If all three are no, which is usual for an agent fanning out tool calls, Redis Streams or NATS JetStream fits with far less operational weight. Kafka 4.0 dropped ZooKeeper so it is simpler than it was, but the tradeoff still favors the lighter options here.

  • Sustained throughput? Below ~500K msg/s Kafka is overkill
  • Need replay from a point in time? Need long retention?
  • All no → Redis Streams (sub-ms) or NATS JetStream (low single-digit ms)
  • Kafka 4.0 KRaft is simpler now, but the tradeoff still favors lighter options
Redis StreamsNATS JetStreamKafka (KRaft)
05Security defense in depth

How do you prevent prompt injection in an agent that processes user-submitted documents?

Prompt injection cannot be solved at the prompt layer, because the model cannot tell document instructions from mine. So I design assuming compromise. Least privilege first: the document reader carries no write or financial credentials. The strongest defense is splitting reader from actor, where the reader processes untrusted content and emits a sanitized summary, and the actor only ever sees that summary. Add approval gates on irreversible actions and an MCP gateway for audit. The goal is to bound the blast radius, not to write a cleverer system prompt.

  • Cannot be solved at the prompt layer; design assuming compromise
  • Least privilege: the reader carries no write/financial credentials
  • Reader/actor split: actor only sees the sanitized summary
  • Approval gates on irreversible actions + MCP gateway; bound the blast radius
reader/actor splitMCP gateway · TrueFoundry/MintMCPOAuth 2.1 · RFC 8707e2b / Modal sandboxes
06Durable execution

What is durable execution and when do you need it?

It runs a function as a sequence of checkpointed steps, journaling each result before moving on. If the worker dies, a new one replays the journal: finished steps return cached results, and execution resumes from the first incomplete step, so it is as if the function never crashed. You need it for work past serverless timeouts, multi-day pauses, or expensive partial failures. The one gotcha: any code outside a step re-runs on every replay, so non-deterministic work must live inside steps.

  • Journal-based replay: each step is checkpointed before the next runs
  • On crash, replay returns cached results and resumes from the first incomplete step
  • Need it for: serverless timeouts, multi-day pauses, expensive partial failures
  • Gotcha: code outside a step re-runs on replay, so non-determinism goes inside steps
TemporalInngestTrigger.dev v4HatchetRestateCloudflare Workflows
07Framework debugging

You inherit an agent that "works on my machine" but fails in production. What is the most likely cause?

I would give a diagnostic order, not one guess. First, context window saturation, since a prompt that worked on a short local chat degrades on a long real one. Second, no durable execution, so a transient error mid-loop kills the run. Third, missing approval gates on destructive tools. Fourth, a bloated system prompt where rules get lost. Fifth, a memory layer using an in-memory saver locally but never wired to Postgres. So: check context size, then the agent-loop error logs, then whether retries survive a restart.

  • Context window saturation is the first suspect
  • No durable execution: a transient error mid-loop kills the run
  • No HITL gates on destructive tools; bloated prompt losing rules
  • Memory not wired for production; diagnostic order: context, logs, retries
LangGraph · AsyncPostgresSaverLangSmith (tracing)context budget checks
08Context engineering

Why is sub-agent isolation considered more important than context compaction in 2026?

Compaction treats the symptom, isolation treats the cause. Summarizing a full context is lossy and degrades as the window fills. Isolation keeps the noise out entirely: a sub-agent does the messy work in its own window and returns a short, clean summary, so the parent stays focused. Most major frameworks default to sub-agent isolation as the primary multi-agent pattern, AutoGen's compressing group manager being the notable exception. The design implication is to ask what to keep out of the context, not how to fit more in. Isolation beats compression.

  • Compaction is lossy and degrades as the window fills
  • Isolation keeps the noise out: sub-agent works in its own window, returns a summary
  • Most major frameworks use sub-agent isolation as the primary pattern
  • Design around what to keep OUT, not what to fit in; isolation beats compression
Claude Agent SDK · subagentsOpenAI Agents SDKMastraLangGraph
09Evaluation & monitoring

How would you evaluate whether your agent is actually getting better over time?

Offline and online answer different questions. Offline, I keep a fixed eval set of representative inputs with expected outputs or rubrics, run on every change so a regression is caught before it ships, graded with assertions where deterministic and an LLM judge where subjective. Online, I track success rate, latency, and cost per request, sliced by scenario so regressions do not hide in the average, and I log every prompt and tool call so failures are reconstructable. Most teams ship on vibes because they never built the harness, and I feed every production failure back as a new test.

  • Offline: fixed eval set run on every change, assertions + LLM judge
  • Online: success rate, latency, cost per request, sliced by scenario
  • Log every prompt and tool call so failures are reconstructable
  • Feed every production failure back as a new test case
LangSmithBraintrustOpenTelemetryLLM-as-judge
10Cost optimization

Your agent costs $0.40 per conversation and the business needs it under $0.10. What do you do?

Measure first to see where the tokens go, then attack in order. The biggest lever is model tiering: a cheap fast model routes and handles simple turns, the expensive model only does the hard reasoning. Then prompt caching for stable prefixes, context discipline by summarizing and trimming history, and response caching for repeats with a semantic cache for paraphrases. I would also question whether every conversation needs the full loop. Cost is an architecture problem, not a model-price problem.

  • Measure first: where do the tokens and dollars actually go
  • Model tiering: cheap router, frontier model only for hard reasoning
  • Prompt caching for stable prefixes; trim and summarize history
  • Response/semantic caching for repeats; cost is architecture, not price
Vercel AI Gatewaymodel tiering · Haiku→Opusprompt cachingRedis semantic cache
11Multi-agent design

When would you use a multi-agent system instead of a single agent with many tools?

My default is a single agent with good tools, because multi-agent adds coordination cost. I split only for a concrete reason: context isolation when a subtask is noisy, genuine parallelism via a planner and workers, distinct specializations needing different tools or permissions, or a deliberate critique loop with a separate evaluator. Every handoff is a place context gets lost, so I keep the number of agents as small as the problem allows and never go multi-agent for show.

  • Default to one agent with good tools; multi-agent adds coordination cost
  • Split for: context isolation, parallelism, specialization, or critique
  • Planner decomposes, workers run in parallel, orchestrator synthesizes
  • Every handoff loses context; keep agent count minimal
orchestrator-workersClaude Agent SDK · subagentsOpenAI Agents SDK · handoffsMastra
12Resumable streaming

A user backgrounds your chat app mid-response on their phone and comes back. What should happen?

They should come back to the response intact, never a broken half-message, because mobile backgrounds a lot of sessions. I decouple generation from the connection: the server writes every chunk into a Redis stream keyed by an ID, and the browser reads from there, so if the tab backgrounds the server keeps generating into Redis. On return the client resumes and replays from the buffer. The subtlety is that stop is not disconnect; a separate stop endpoint actually cancels the producer and only clears the stream ID if it still matches.

  • Goal: returning user sees the response intact, never a broken half-message
  • Decouple generation from the connection: server writes chunks to Redis
  • Browser reads from Redis; reconnect replays from the buffer
  • Stop != disconnect: a separate endpoint cancels the producer, race-safe
Vercel AI SDK v6resumable-streamUpstash RedisNext.js after()
13Memory vendor choice

Mem0, Letta, or Zep for agent memory? How do you choose?

Choose a philosophy, not a vendor. Mem0 hides the schema behind add and search and is compact, the default for user personalization. Letta exposes memory blocks the agent edits itself, for autonomous long-horizon agents where managing memory is part of the behavior. Zep with Graphiti is a temporal knowledge graph with timestamped facts for as-of queries, heavier but right for compliance and time reasoning. And for some domains, especially coding, the best memory is the filesystem or codebase itself, queried rather than embedded.

  • Choose a philosophy: how should the agent relate to its own memory
  • Mem0: hides schema, compact, default for personalization
  • Letta: agent-edited memory blocks, for autonomous long-horizon agents
  • Zep: temporal graph for as-of queries; coding? query the filesystem
Mem0Letta (MemGPT)Zep + Graphiti
14Database choice

Postgres or MongoDB for a new AI product that needs vector search?

My default is Postgres with pgvector: one database for relational data, JSON, and vectors, with ACID and row-level security for multi-tenancy, and it handles tens of millions of vectors on HNSW. That avoids running a separate vector store. MongoDB 8.x has a real pitch with in-database embeddings (auto-embedding via Voyage AI), so I pick it if the data is document-shaped or the team is already on Mongo, reading vendor benchmarks skeptically. If I ever outgrow pgvector, I add a dedicated vector DB as a sidecar rather than rewriting the data layer.

  • Default Postgres + pgvector: one DB, ACID, RLS, tens of millions of vectors on HNSW
  • Avoids a separate vector store
  • MongoDB 8.2+ autoEmbed if data is document-shaped or team is on Mongo
  • Outgrow pgvector? Add a sidecar, do not rewrite the data layer
Postgres + pgvectorConvexSupabase / NeonMongoDB Atlas 8.2+Qdrant (sidecar)
15Tool reliability

How do you handle an agent that hallucinates a tool call to a function that does not exist?

I treat every tool call from the model as untrusted input, like a browser form. A validation layer sits between the model and execution: it checks the tool name is registered and the arguments match the schema before anything runs, so an invented function never executes. On failure I feed a structured error back into context so the model self-corrects on the next turn, bounded by a retry cap that falls back to a graceful error or human handoff. The model proposes, the validation layer disposes.

  • Treat every tool call as untrusted input
  • Validation layer checks name + schema before executing
  • On failure, feed a structured error back for self-correction
  • Cap retries with a graceful fallback; model proposes, validation disposes
Zod / JSON schemastructured outputsInstructor
16End-to-end design

Design a customer support agent that can answer from internal docs and issue refunds.

I treat the two capabilities very differently. Answering from docs is read-only and low risk: index into pgvector, retrieve with hybrid search plus a reranker, run it freely in the loop. Refunds are irreversible and financial, so they get the strict treatment: a typed tool with argument validation and a human approval gate above a threshold, plus least-privilege credentials so the doc path cannot touch the refund system. An orchestrator decides answer versus lookup versus act versus escalate, everything is logged, and I verify the refund against the payment system, not the model.

  • Treat read-only and money-touching paths differently
  • Docs: pgvector + hybrid + rerank, runs freely
  • Refunds: typed tool, validation, approval gate, least privilege
  • Orchestrator routes; verify refund against the payment system, not the model
pgvector + rerankStripe (refund tool)Inngestapproval gate
17Streaming vs durability

You open an SSE stream so your long-running agent does not time out. Does that solve the timeout problem?

No. A connection is a thin pipe that must stay alive; a durable job is state held safely. People confuse the pipe being open with the work being safe. SSE does not raise the duration ceiling, it just avoids idle kills. The fix is to split them: the work lives in a durable engine that survives days, the pipe is disposable, and the server is a doorman that starts the job and returns an ID without waiting. The browser reconnects by ID and resumes from saved state. The moment the server waits, you tie the work to the life of a pipe.

  • A connection is a disposable pipe; a durable job is safely-held state
  • SSE does not raise the duration ceiling, it only avoids idle kills
  • Work lives in a durable engine; the pipe is disposable; server is a doorman
  • Browser reconnects by job ID and resumes from saved state
Inngest / TemporalVercel WorkflowsUpstash RedisInngest Realtime / Ably
18Replay & observability

What does "replay" mean in an agent system, and why would you build it?

Replay is reconstructing what an agent did by reprocessing a stored event log, instead of trusting live state. It matters because agents are non-deterministic: without it you only see the final output and cannot debug, reproduce, or safely evaluate a change. You store an append-only log of messages, plans, tool calls, results, and outputs. Then either re-execute the steps with tools and model mocked for determinism, or fold the events through a reducer to rebuild state. Durable execution gives you part of this for free, and every production failure becomes a new test.

  • Replay = reconstruct a run from a stored event log, not live state
  • Necessary because agents are non-deterministic: enables debug, repro, eval
  • Two flavors: full re-execution (mocked) and state reconstruction (reducer)
  • Durable execution gives it partly for free; failures become test cases
event sourcingInngest (step checkpoints)Postgres event logLangSmith
19Caching layers

Where do you cache in an LLM agent system, and is caching useful when every prompt looks different?

Prompts look unique but are built from reusable blocks: system instructions, tool descriptions, and context are stable, only the user input really varies. So you cache the blocks. Provider caching handles stable prefixes for free but is opaque. Application caching keys a prompt hash to its response. Component caching, the best return, caches tool outputs, retrieval results, and planner decisions. For paraphrases I use semantic caching with a similarity threshold, careful not to over-merge distinct intents. You are caching intent and reusable blocks, not whole strings.

  • Prompts are reusable blocks; only user input really varies
  • Provider: stable prefix, free but opaque; application: hash → response
  • Component (best ROI): tool outputs, retrieval, planner decisions
  • Semantic caching for paraphrases; do not over-merge distinct intents
prompt caching (Anthropic/OpenAI)Redisembeddings · semantic cache
20Runtime choice

Edge runtime or a Node server for your agent backend? What actually changes?

Hybrid, with reasoning. Edge is a lightweight isolate: native fetch, ultra-low latency, near-zero cold start, but short-lived with strict limits, no raw socket, and less predictable long connections. Node is a full runtime with socket control and reliable long-lived streaming and long tasks, at the cost of latency and cold starts. The agent loop, tools, and long conversations want Node or a durable engine; edge is the fast global front for auth, routing, and light token-pumping. The duration ceiling applies regardless of runtime, so do not put the loop on edge for no benefit. One currency note: Vercel now deprecates standalone Edge Functions in favor of Node on Fluid Compute, so true edge-native agents increasingly live on Cloudflare Workers and Durable Objects.

  • Edge: native fetch, ultra-low latency, but short-lived, strict, no socket
  • Node: socket control, reliable long SSE, long tasks, higher latency
  • Agent loop + tools want Node or a durable engine; edge is the fast front
  • Duration ceiling applies regardless; hybrid is the common answer
Cloudflare Workers (edge)Node + Fluid ComputeVercel Workflows
21System thinking

Someone argues an agent is really just a workflow with ambiguity baked in. Do you agree?

Largely yes, sharpened: an agent is a deterministic workflow engine where certain nodes delegate the decision to a stochastic policy, the LLM, while all side effects stay in deterministic layers. The LLM is a policy function inside the engine, not the controller. It matters because the LLM breaks same-input-same-output, so you wrap each fuzzy node with validation, constrained outputs, and idempotency. Unlike a plain API, the LLM node can influence control flow, but only inside bounded limits. So harden the shell and treat each LLM node as untrusted and retryable.

  • Agree, sharpened: deterministic engine with stochastic decision nodes
  • The LLM is a policy function inside the engine, not the controller
  • It breaks same-input-same-output, so wrap each node (validate, constrain)
  • Unlike an API it can steer control flow, but bounded; harden the shell
Temporal (deterministic core)LangGraph (graph)Claude Agent SDK
22Gateway, routing & cost

How do you handle model routing, provider failover, and cost control across an agent's LLM calls?

I put an AI gateway in front of every model call instead of hitting providers directly. One endpoint, many models: automatic failover when a provider degrades, retries, response caching, per-key rate limits, and hard spend budgets with real-time cost tracking. On Vercel that is the AI Gateway paired with the AI SDK; for provider neutrality, OpenRouter; on Cloudflare, their AI Gateway. Routing is policy, not app code: a cheap model handles most turns, the frontier model only the hard ones, with A/B and geo routing decided at the gateway.

  • One gateway endpoint in front of all providers: failover, retries, caching
  • Hard spend budgets and real-time cost tracking, not billing surprises
  • Routing is policy at the gateway: cheap model default, frontier for hard turns
  • Provider-neutral (OpenRouter) or platform-native (Vercel / Cloudflare gateway)
Vercel AI GatewayOpenRouterCloudflare AI GatewayAI SDK 6
23Edge-native agent platform

Can you run the whole agent on one serverless platform with no servers to manage?

Yes. On Cloudflare the agent is a Durable Object: single-threaded stateful compute with its own SQLite, so each session gets durable state, WebSockets, and scheduling for free, and the Agents SDK sits on top. Inference runs at the edge on Workers AI, retrieval on Vectorize, durable background work on Workflows v2, and untrusted code in Sandboxes. It scales to zero and bills on CPU time, so an idle agent costs nothing. The whole lifecycle lives on one platform with no cluster to run.

  • The agent IS a Durable Object: per-session SQLite state, sockets, scheduling
  • Workers AI for edge inference, Vectorize for RAG, co-located with compute
  • Workflows v2 for durable background work; Sandboxes for untrusted code
  • Scale-to-zero, CPU-time billing: an idle agent costs nothing
Cloudflare Agents SDKDurable Objects (SQLite)Workers AIVectorizeWorkflows v2Sandboxes
24Dependency supply chain

A package in your agent's dependency tree gets compromised overnight. How do you not get burned?

Treat dependencies as an attack surface, because they are: in June 2026 the @mastra npm scope was hit by a typosquat remote-access trojan across 140+ packages. The defenses are boring and they work. Pin exact versions with a committed lockfile and never blind-update in CI. Verify provenance (signed publishes) before bumping. Run agents with least privilege so a compromised dependency cannot reach secrets or write paths, isolate any untrusted code in a sandbox, and track a software bill of materials so you know what actually shipped. The blast radius is bounded by what the process can touch, not by how clever the malware is.

  • Dependencies are an attack surface: pin exact versions, commit the lockfile
  • Verify provenance before bumping; never blind-update in CI
  • Least privilege: a compromised dep cannot reach secrets or write paths
  • Isolate untrusted code; track an SBOM so you know what shipped
lockfile pinningnpm provenance / sigstoreSBOM · Socket.devleast privilege + sandbox
The Lab, by email

New tools and notes, when they land.

I publish when there's something worth your time, not on a schedule. Drop your email and I'll send new tools and notes from the bench.

No noise. Unsubscribe in one click. See the privacy policy.