The questions I get in real architecture conversations, each with the answer I would give and a diagram. Reveal one at a time, or shuffle and drill.
Opinionated, on purpose
These are the calls I make for early-stage teams optimizing for speed, reliability, and a fast path to tens or hundreds of thousands of users, not for infinite customization or cost-tuning at millions-of-users scale. At that point the tradeoffs flip toward running your own infra. This is the build-fast lens.
0 / 24 seen
01 / 24 · Reference architecture
Walk me through how you would design a chat agent for a SaaS product. Assume Next.js + Supabase.
Read all 24 in fullEvery question and answer as plain text, on one page. Skim or search.Open
01Reference architecture
Walk me through how you would design a chat agent for a SaaS product. Assume Next.js + Supabase.
An edge route handles auth, per-tenant rate limiting, persists the user message, and kicks off a durable function, then returns fast. I do not run the agent loop in the route, because serverless times out and a mid-loop crash would lose everything. The durable function runs the loop with each tool call as a retryable step, streaming chunks through Redis so the browser can reconnect. Postgres plus pgvector is the source of truth, Redis is just the stream buffer, and tools sit behind an MCP gateway. I defend each choice by the failure mode it prevents.
Your agent needs to wait for human approval that could take days. How do you implement this?
You cannot hold a process or function open for days, so this is a durable execution problem. The workflow pauses on a durable wait for an approval event with a long timeout, consuming nothing while suspended, and resumes exactly where it paused when the event arrives. The human gets a Slack or email link; clicking it emits the event. I support the four decision types (approve, edit, reject, respond) and handle the timeout path explicitly so silence means escalate, cancel, or a safe default.
You cannot hold a process alive for days; use durable execution
Workflow pauses on a durable wait for the approval event, resumes on arrival
Surface via Slack/email link; clicking emits the event
Four decisions: approve, edit, reject, respond; handle the timeout path explicitly
RAG is over-prescribed. If the corpus is small and static, I put it in the prompt with caching. If the data is structured, I query it with SQL through a tool, not embeddings. If it changes often, live API calls beat stale vectors. Naive retrieval misses the right chunk often enough that it is not a safe default, so when I do use it the default is hybrid search plus a reranker, and agentic RAG for multi-hop questions. RAG is a mechanism, not the default architecture.
Small static corpus: prompt + caching
Structured data: SQL via tool call, do not embed it
Frequently changing data: live API calls beat stale embeddings
If you do RAG: hybrid + rerank (retrieval is the main failure point); agentic RAG for multi-hop
A team says they want to add Kafka to their agent system. What do you ask them?
Three questions before agreeing. What is the sustained throughput? Below roughly 500K messages a second Kafka is overkill. Do you need event replay from a point in time? Is long retention an actual requirement? If all three are no, which is usual for an agent fanning out tool calls, Redis Streams or NATS JetStream fits with far less operational weight. Kafka 4.0 dropped ZooKeeper so it is simpler than it was, but the tradeoff still favors the lighter options here.
Sustained throughput? Below ~500K msg/s Kafka is overkill
Need replay from a point in time? Need long retention?
All no → Redis Streams (sub-ms) or NATS JetStream (low single-digit ms)
Kafka 4.0 KRaft is simpler now, but the tradeoff still favors lighter options
Redis StreamsNATS JetStreamKafka (KRaft)
05Security defense in depth
How do you prevent prompt injection in an agent that processes user-submitted documents?
Prompt injection cannot be solved at the prompt layer, because the model cannot tell document instructions from mine. So I design assuming compromise. Least privilege first: the document reader carries no write or financial credentials. The strongest defense is splitting reader from actor, where the reader processes untrusted content and emits a sanitized summary, and the actor only ever sees that summary. Add approval gates on irreversible actions and an MCP gateway for audit. The goal is to bound the blast radius, not to write a cleverer system prompt.
Cannot be solved at the prompt layer; design assuming compromise
Least privilege: the reader carries no write/financial credentials
Reader/actor split: actor only sees the sanitized summary
Approval gates on irreversible actions + MCP gateway; bound the blast radius
What is durable execution and when do you need it?
It runs a function as a sequence of checkpointed steps, journaling each result before moving on. If the worker dies, a new one replays the journal: finished steps return cached results, and execution resumes from the first incomplete step, so it is as if the function never crashed. You need it for work past serverless timeouts, multi-day pauses, or expensive partial failures. The one gotcha: any code outside a step re-runs on every replay, so non-deterministic work must live inside steps.
Journal-based replay: each step is checkpointed before the next runs
On crash, replay returns cached results and resumes from the first incomplete step
Need it for: serverless timeouts, multi-day pauses, expensive partial failures
Gotcha: code outside a step re-runs on replay, so non-determinism goes inside steps
You inherit an agent that "works on my machine" but fails in production. What is the most likely cause?
I would give a diagnostic order, not one guess. First, context window saturation, since a prompt that worked on a short local chat degrades on a long real one. Second, no durable execution, so a transient error mid-loop kills the run. Third, missing approval gates on destructive tools. Fourth, a bloated system prompt where rules get lost. Fifth, a memory layer using an in-memory saver locally but never wired to Postgres. So: check context size, then the agent-loop error logs, then whether retries survive a restart.
Context window saturation is the first suspect
No durable execution: a transient error mid-loop kills the run
No HITL gates on destructive tools; bloated prompt losing rules
Memory not wired for production; diagnostic order: context, logs, retries
Why is sub-agent isolation considered more important than context compaction in 2026?
Compaction treats the symptom, isolation treats the cause. Summarizing a full context is lossy and degrades as the window fills. Isolation keeps the noise out entirely: a sub-agent does the messy work in its own window and returns a short, clean summary, so the parent stays focused. Most major frameworks default to sub-agent isolation as the primary multi-agent pattern, AutoGen's compressing group manager being the notable exception. The design implication is to ask what to keep out of the context, not how to fit more in. Isolation beats compression.
Compaction is lossy and degrades as the window fills
Isolation keeps the noise out: sub-agent works in its own window, returns a summary
Most major frameworks use sub-agent isolation as the primary pattern
Design around what to keep OUT, not what to fit in; isolation beats compression
Claude Agent SDK · subagentsOpenAI Agents SDKMastraLangGraph
09Evaluation & monitoring
How would you evaluate whether your agent is actually getting better over time?
Offline and online answer different questions. Offline, I keep a fixed eval set of representative inputs with expected outputs or rubrics, run on every change so a regression is caught before it ships, graded with assertions where deterministic and an LLM judge where subjective. Online, I track success rate, latency, and cost per request, sliced by scenario so regressions do not hide in the average, and I log every prompt and tool call so failures are reconstructable. Most teams ship on vibes because they never built the harness, and I feed every production failure back as a new test.
Offline: fixed eval set run on every change, assertions + LLM judge
Online: success rate, latency, cost per request, sliced by scenario
Log every prompt and tool call so failures are reconstructable
Feed every production failure back as a new test case
LangSmithBraintrustOpenTelemetryLLM-as-judge
10Cost optimization
Your agent costs $0.40 per conversation and the business needs it under $0.10. What do you do?
Measure first to see where the tokens go, then attack in order. The biggest lever is model tiering: a cheap fast model routes and handles simple turns, the expensive model only does the hard reasoning. Then prompt caching for stable prefixes, context discipline by summarizing and trimming history, and response caching for repeats with a semantic cache for paraphrases. I would also question whether every conversation needs the full loop. Cost is an architecture problem, not a model-price problem.
Measure first: where do the tokens and dollars actually go
Model tiering: cheap router, frontier model only for hard reasoning
Prompt caching for stable prefixes; trim and summarize history
Response/semantic caching for repeats; cost is architecture, not price
Vercel AI Gatewaymodel tiering · Haiku→Opusprompt cachingRedis semantic cache
11Multi-agent design
When would you use a multi-agent system instead of a single agent with many tools?
My default is a single agent with good tools, because multi-agent adds coordination cost. I split only for a concrete reason: context isolation when a subtask is noisy, genuine parallelism via a planner and workers, distinct specializations needing different tools or permissions, or a deliberate critique loop with a separate evaluator. Every handoff is a place context gets lost, so I keep the number of agents as small as the problem allows and never go multi-agent for show.
Default to one agent with good tools; multi-agent adds coordination cost
Split for: context isolation, parallelism, specialization, or critique
Planner decomposes, workers run in parallel, orchestrator synthesizes
Every handoff loses context; keep agent count minimal
A user backgrounds your chat app mid-response on their phone and comes back. What should happen?
They should come back to the response intact, never a broken half-message, because mobile backgrounds a lot of sessions. I decouple generation from the connection: the server writes every chunk into a Redis stream keyed by an ID, and the browser reads from there, so if the tab backgrounds the server keeps generating into Redis. On return the client resumes and replays from the buffer. The subtlety is that stop is not disconnect; a separate stop endpoint actually cancels the producer and only clears the stream ID if it still matches.
Goal: returning user sees the response intact, never a broken half-message
Decouple generation from the connection: server writes chunks to Redis
Browser reads from Redis; reconnect replays from the buffer
Stop != disconnect: a separate endpoint cancels the producer, race-safe
Vercel AI SDK v6resumable-streamUpstash RedisNext.js after()
13Memory vendor choice
Mem0, Letta, or Zep for agent memory? How do you choose?
Choose a philosophy, not a vendor. Mem0 hides the schema behind add and search and is compact, the default for user personalization. Letta exposes memory blocks the agent edits itself, for autonomous long-horizon agents where managing memory is part of the behavior. Zep with Graphiti is a temporal knowledge graph with timestamped facts for as-of queries, heavier but right for compliance and time reasoning. And for some domains, especially coding, the best memory is the filesystem or codebase itself, queried rather than embedded.
Choose a philosophy: how should the agent relate to its own memory
Mem0: hides schema, compact, default for personalization
Letta: agent-edited memory blocks, for autonomous long-horizon agents
Zep: temporal graph for as-of queries; coding? query the filesystem
Mem0Letta (MemGPT)Zep + Graphiti
14Database choice
Postgres or MongoDB for a new AI product that needs vector search?
My default is Postgres with pgvector: one database for relational data, JSON, and vectors, with ACID and row-level security for multi-tenancy, and it handles tens of millions of vectors on HNSW. That avoids running a separate vector store. MongoDB 8.x has a real pitch with in-database embeddings (auto-embedding via Voyage AI), so I pick it if the data is document-shaped or the team is already on Mongo, reading vendor benchmarks skeptically. If I ever outgrow pgvector, I add a dedicated vector DB as a sidecar rather than rewriting the data layer.
Default Postgres + pgvector: one DB, ACID, RLS, tens of millions of vectors on HNSW
Avoids a separate vector store
MongoDB 8.2+ autoEmbed if data is document-shaped or team is on Mongo
Outgrow pgvector? Add a sidecar, do not rewrite the data layer
How do you handle an agent that hallucinates a tool call to a function that does not exist?
I treat every tool call from the model as untrusted input, like a browser form. A validation layer sits between the model and execution: it checks the tool name is registered and the arguments match the schema before anything runs, so an invented function never executes. On failure I feed a structured error back into context so the model self-corrects on the next turn, bounded by a retry cap that falls back to a graceful error or human handoff. The model proposes, the validation layer disposes.
Treat every tool call as untrusted input
Validation layer checks name + schema before executing
On failure, feed a structured error back for self-correction
Cap retries with a graceful fallback; model proposes, validation disposes
Zod / JSON schemastructured outputsInstructor
16End-to-end design
Design a customer support agent that can answer from internal docs and issue refunds.
I treat the two capabilities very differently. Answering from docs is read-only and low risk: index into pgvector, retrieve with hybrid search plus a reranker, run it freely in the loop. Refunds are irreversible and financial, so they get the strict treatment: a typed tool with argument validation and a human approval gate above a threshold, plus least-privilege credentials so the doc path cannot touch the refund system. An orchestrator decides answer versus lookup versus act versus escalate, everything is logged, and I verify the refund against the payment system, not the model.
Treat read-only and money-touching paths differently
Docs: pgvector + hybrid + rerank, runs freely
Refunds: typed tool, validation, approval gate, least privilege
Orchestrator routes; verify refund against the payment system, not the model
You open an SSE stream so your long-running agent does not time out. Does that solve the timeout problem?
No. A connection is a thin pipe that must stay alive; a durable job is state held safely. People confuse the pipe being open with the work being safe. SSE does not raise the duration ceiling, it just avoids idle kills. The fix is to split them: the work lives in a durable engine that survives days, the pipe is disposable, and the server is a doorman that starts the job and returns an ID without waiting. The browser reconnects by ID and resumes from saved state. The moment the server waits, you tie the work to the life of a pipe.
A connection is a disposable pipe; a durable job is safely-held state
SSE does not raise the duration ceiling, it only avoids idle kills
Work lives in a durable engine; the pipe is disposable; server is a doorman
Browser reconnects by job ID and resumes from saved state
What does "replay" mean in an agent system, and why would you build it?
Replay is reconstructing what an agent did by reprocessing a stored event log, instead of trusting live state. It matters because agents are non-deterministic: without it you only see the final output and cannot debug, reproduce, or safely evaluate a change. You store an append-only log of messages, plans, tool calls, results, and outputs. Then either re-execute the steps with tools and model mocked for determinism, or fold the events through a reducer to rebuild state. Durable execution gives you part of this for free, and every production failure becomes a new test.
Replay = reconstruct a run from a stored event log, not live state
Necessary because agents are non-deterministic: enables debug, repro, eval
Two flavors: full re-execution (mocked) and state reconstruction (reducer)
Durable execution gives it partly for free; failures become test cases
Where do you cache in an LLM agent system, and is caching useful when every prompt looks different?
Prompts look unique but are built from reusable blocks: system instructions, tool descriptions, and context are stable, only the user input really varies. So you cache the blocks. Provider caching handles stable prefixes for free but is opaque. Application caching keys a prompt hash to its response. Component caching, the best return, caches tool outputs, retrieval results, and planner decisions. For paraphrases I use semantic caching with a similarity threshold, careful not to over-merge distinct intents. You are caching intent and reusable blocks, not whole strings.
Prompts are reusable blocks; only user input really varies
Provider: stable prefix, free but opaque; application: hash → response
Edge runtime or a Node server for your agent backend? What actually changes?
Hybrid, with reasoning. Edge is a lightweight isolate: native fetch, ultra-low latency, near-zero cold start, but short-lived with strict limits, no raw socket, and less predictable long connections. Node is a full runtime with socket control and reliable long-lived streaming and long tasks, at the cost of latency and cold starts. The agent loop, tools, and long conversations want Node or a durable engine; edge is the fast global front for auth, routing, and light token-pumping. The duration ceiling applies regardless of runtime, so do not put the loop on edge for no benefit. One currency note: Vercel now deprecates standalone Edge Functions in favor of Node on Fluid Compute, so true edge-native agents increasingly live on Cloudflare Workers and Durable Objects.
Edge: native fetch, ultra-low latency, but short-lived, strict, no socket
Node: socket control, reliable long SSE, long tasks, higher latency
Agent loop + tools want Node or a durable engine; edge is the fast front
Duration ceiling applies regardless; hybrid is the common answer
Someone argues an agent is really just a workflow with ambiguity baked in. Do you agree?
Largely yes, sharpened: an agent is a deterministic workflow engine where certain nodes delegate the decision to a stochastic policy, the LLM, while all side effects stay in deterministic layers. The LLM is a policy function inside the engine, not the controller. It matters because the LLM breaks same-input-same-output, so you wrap each fuzzy node with validation, constrained outputs, and idempotency. Unlike a plain API, the LLM node can influence control flow, but only inside bounded limits. So harden the shell and treat each LLM node as untrusted and retryable.
Agree, sharpened: deterministic engine with stochastic decision nodes
The LLM is a policy function inside the engine, not the controller
It breaks same-input-same-output, so wrap each node (validate, constrain)
Unlike an API it can steer control flow, but bounded; harden the shell
How do you handle model routing, provider failover, and cost control across an agent's LLM calls?
I put an AI gateway in front of every model call instead of hitting providers directly. One endpoint, many models: automatic failover when a provider degrades, retries, response caching, per-key rate limits, and hard spend budgets with real-time cost tracking. On Vercel that is the AI Gateway paired with the AI SDK; for provider neutrality, OpenRouter; on Cloudflare, their AI Gateway. Routing is policy, not app code: a cheap model handles most turns, the frontier model only the hard ones, with A/B and geo routing decided at the gateway.
One gateway endpoint in front of all providers: failover, retries, caching
Hard spend budgets and real-time cost tracking, not billing surprises
Routing is policy at the gateway: cheap model default, frontier for hard turns
Provider-neutral (OpenRouter) or platform-native (Vercel / Cloudflare gateway)
Vercel AI GatewayOpenRouterCloudflare AI GatewayAI SDK 6
23Edge-native agent platform
Can you run the whole agent on one serverless platform with no servers to manage?
Yes. On Cloudflare the agent is a Durable Object: single-threaded stateful compute with its own SQLite, so each session gets durable state, WebSockets, and scheduling for free, and the Agents SDK sits on top. Inference runs at the edge on Workers AI, retrieval on Vectorize, durable background work on Workflows v2, and untrusted code in Sandboxes. It scales to zero and bills on CPU time, so an idle agent costs nothing. The whole lifecycle lives on one platform with no cluster to run.
The agent IS a Durable Object: per-session SQLite state, sockets, scheduling
Workers AI for edge inference, Vectorize for RAG, co-located with compute
Workflows v2 for durable background work; Sandboxes for untrusted code
Scale-to-zero, CPU-time billing: an idle agent costs nothing
A package in your agent's dependency tree gets compromised overnight. How do you not get burned?
Treat dependencies as an attack surface, because they are: in June 2026 the @mastra npm scope was hit by a typosquat remote-access trojan across 140+ packages. The defenses are boring and they work. Pin exact versions with a committed lockfile and never blind-update in CI. Verify provenance (signed publishes) before bumping. Run agents with least privilege so a compromised dependency cannot reach secrets or write paths, isolate any untrusted code in a sandbox, and track a software bill of materials so you know what actually shipped. The blast radius is bounded by what the process can touch, not by how clever the malware is.
Dependencies are an attack surface: pin exact versions, commit the lockfile
Verify provenance before bumping; never blind-update in CI
Least privilege: a compromised dep cannot reach secrets or write paths
Isolate untrusted code; track an SBOM so you know what shipped