Question 1

Walk me through how you would design a chat agent for a SaaS product. Assume Next.js + Supabase.

Accepted Answer

An edge route handles auth, per-tenant rate limiting, persists the user message, and kicks off a durable function, then returns fast. I do not run the agent loop in the route, because serverless times out and a mid-loop crash would lose everything. The durable function runs the loop with each tool call as a retryable step, streaming chunks through Redis so the browser can reconnect. Postgres plus pgvector is the source of truth, Redis is just the stream buffer, and tools sit behind an MCP gateway. I defend each choice by the failure mode it prevents.

Question 2

Your agent needs to wait for human approval that could take days. How do you implement this?

Accepted Answer

You cannot hold a process or function open for days, so this is a durable execution problem. The workflow pauses on a durable wait for an approval event with a long timeout, consuming nothing while suspended, and resumes exactly where it paused when the event arrives. The human gets a Slack or email link; clicking it emits the event. I support the four decision types (approve, edit, reject, respond) and handle the timeout path explicitly so silence means escalate, cancel, or a safe default.

Question 3

When would you NOT use RAG?

Accepted Answer

RAG is over-prescribed. If the corpus is small and static, I put it in the prompt with caching. If the data is structured, I query it with SQL through a tool, not embeddings. If it changes often, live API calls beat stale vectors. Naive retrieval misses the right chunk often enough that it is not a safe default, so when I do use it the default is hybrid search plus a reranker, and agentic RAG for multi-hop questions. RAG is a mechanism, not the default architecture.

Question 4

A team says they want to add Kafka to their agent system. What do you ask them?

Accepted Answer

Three questions before agreeing. What is the sustained throughput? Below roughly 500K messages a second Kafka is overkill. Do you need event replay from a point in time? Is long retention an actual requirement? If all three are no, which is usual for an agent fanning out tool calls, Redis Streams or NATS JetStream fits with far less operational weight. Kafka 4.0 dropped ZooKeeper so it is simpler than it was, but the tradeoff still favors the lighter options here.

Question 5

How do you prevent prompt injection in an agent that processes user-submitted documents?

Accepted Answer

Prompt injection cannot be solved at the prompt layer, because the model cannot tell document instructions from mine. So I design assuming compromise. Least privilege first: the document reader carries no write or financial credentials. The strongest defense is splitting reader from actor, where the reader processes untrusted content and emits a sanitized summary, and the actor only ever sees that summary. Add approval gates on irreversible actions and an MCP gateway for audit. The goal is to bound the blast radius, not to write a cleverer system prompt.

Question 6

What is durable execution and when do you need it?

Accepted Answer

It runs a function as a sequence of checkpointed steps, journaling each result before moving on. If the worker dies, a new one replays the journal: finished steps return cached results, and execution resumes from the first incomplete step, so it is as if the function never crashed. You need it for work past serverless timeouts, multi-day pauses, or expensive partial failures. The one gotcha: any code outside a step re-runs on every replay, so non-deterministic work must live inside steps.

Question 7

You inherit an agent that "works on my machine" but fails in production. What is the most likely cause?

Accepted Answer

I would give a diagnostic order, not one guess. First, context window saturation, since a prompt that worked on a short local chat degrades on a long real one. Second, no durable execution, so a transient error mid-loop kills the run. Third, missing approval gates on destructive tools. Fourth, a bloated system prompt where rules get lost. Fifth, a memory layer using an in-memory saver locally but never wired to Postgres. So: check context size, then the agent-loop error logs, then whether retries survive a restart.

Question 8

Why is sub-agent isolation considered more important than context compaction in 2026?

Accepted Answer

Compaction treats the symptom, isolation treats the cause. Summarizing a full context is lossy and degrades as the window fills. Isolation keeps the noise out entirely: a sub-agent does the messy work in its own window and returns a short, clean summary, so the parent stays focused. Most major frameworks default to sub-agent isolation as the primary multi-agent pattern, AutoGen's compressing group manager being the notable exception. The design implication is to ask what to keep out of the context, not how to fit more in. Isolation beats compression.

Question 9

How would you evaluate whether your agent is actually getting better over time?

Accepted Answer

Offline and online answer different questions. Offline, I keep a fixed eval set of representative inputs with expected outputs or rubrics, run on every change so a regression is caught before it reaches production, graded with assertions where deterministic and an LLM judge where subjective. Online, I track success rate, latency, and cost per request, sliced by scenario so regressions do not hide in the average, and I log every prompt and tool call so failures are reconstructable. Most teams go live on vibes because they never built the harness, and I feed every production failure back as a new test.

Question 10

Your agent costs $0.40 per conversation and the business needs it under $0.10. What do you do?

Accepted Answer

Measure first to see where the tokens go, then attack in order. The biggest lever is model tiering: a cheap fast model routes and handles simple turns, the expensive model only does the hard reasoning. Then prompt caching for stable prefixes, context discipline by summarizing and trimming history, and response caching for repeats with a semantic cache for paraphrases. I would also question whether every conversation needs the full loop. Cost is an architecture problem, not a model-price problem.

Question 11

When would you use a multi-agent system instead of a single agent with many tools?

Accepted Answer

My default is a single agent with good tools, because multi-agent adds coordination cost. I split only for a concrete reason: context isolation when a subtask is noisy, genuine parallelism via a planner and workers, distinct specializations needing different tools or permissions, or a deliberate critique loop with a separate evaluator. Every handoff is a place context gets lost, so I keep the number of agents as small as the problem allows and never go multi-agent for show.

Question 12

A user backgrounds your chat app mid-response on their phone and comes back. What should happen?

Accepted Answer

They should come back to the response intact, never a broken half-message, because mobile backgrounds a lot of sessions. I decouple generation from the connection: the server writes every chunk into a Redis stream keyed by an ID, and the browser reads from there, so if the tab backgrounds the server keeps generating into Redis. On return the client resumes and replays from the buffer. The subtlety is that stop is not disconnect; a separate stop endpoint actually cancels the producer and only clears the stream ID if it still matches.

Question 13

Mem0, Letta, or Zep for agent memory? How do you choose?

Accepted Answer

Choose a philosophy, not a vendor. Mem0 hides the schema behind add and search and is compact, the default for user personalization. Letta exposes memory blocks the agent edits itself, for autonomous long-horizon agents where managing memory is part of the behavior. Zep with Graphiti is a temporal knowledge graph with timestamped facts for as-of queries, heavier but right for compliance and time reasoning. And for some domains, especially coding, the best memory is the filesystem or codebase itself, queried rather than embedded.

Question 14

Postgres or MongoDB for a new AI product that needs vector search?

Accepted Answer

My default is Postgres with pgvector: one database for relational data, JSON, and vectors, with ACID and row-level security for multi-tenancy, and it handles tens of millions of vectors on HNSW. That avoids running a separate vector store. MongoDB 8.x has a real pitch with in-database embeddings (auto-embedding via Voyage AI), so I pick it if the data is document-shaped or the team is already on Mongo, reading vendor benchmarks skeptically. If I ever outgrow pgvector, I add a dedicated vector DB as a sidecar rather than rewriting the data layer.

Question 15

How do you handle an agent that hallucinates a tool call to a function that does not exist?

Accepted Answer

I treat every tool call from the model as untrusted input, like a browser form. A validation layer sits between the model and execution: it checks the tool name is registered and the arguments match the schema before anything runs, so an invented function never executes. On failure I feed a structured error back into context so the model self-corrects on the next turn, bounded by a retry cap that falls back to a graceful error or human handoff. The model proposes, the validation layer disposes.

Question 16

Design a customer support agent that can answer from internal docs and issue refunds.

Accepted Answer

I treat the two capabilities very differently. Answering from docs is read-only and low risk: index into pgvector, retrieve with hybrid search plus a reranker, run it freely in the loop. Refunds are irreversible and financial, so they get the strict treatment: a typed tool with argument validation and a human approval gate above a threshold, plus least-privilege credentials so the doc path cannot touch the refund system. An orchestrator decides answer versus lookup versus act versus escalate, everything is logged, and I verify the refund against the payment system, not the model.

Question 17

You open an SSE stream so your long-running agent does not time out. Does that solve the timeout problem?

Accepted Answer

No. A connection is a thin pipe that must stay alive; a durable job is state held safely. People confuse the pipe being open with the work being safe. SSE does not raise the duration ceiling, it just avoids idle kills. The fix is to split them: the work lives in a durable engine that survives days, the pipe is disposable, and the server is a doorman that starts the job and returns an ID without waiting. The browser reconnects by ID and resumes from saved state. The moment the server waits, you tie the work to the life of a pipe.

Question 18

What does "replay" mean in an agent system, and why would you build it?

Accepted Answer

Replay is reconstructing what an agent did by reprocessing a stored event log, instead of trusting live state. It matters because agents are non-deterministic: without it you only see the final output and cannot debug, reproduce, or safely evaluate a change. You store an append-only log of messages, plans, tool calls, results, and outputs. Then either re-execute the steps with tools and model mocked for determinism, or fold the events through a reducer to rebuild state. Durable execution gives you part of this for free, and every production failure becomes a new test.

Question 19

Where do you cache in an LLM agent system, and is caching useful when every prompt looks different?

Accepted Answer

Prompts look unique but are built from reusable blocks: system instructions, tool descriptions, and context are stable, only the user input really varies. So you cache the blocks. Provider caching handles stable prefixes for free but is opaque. Application caching keys a prompt hash to its response. Component caching, the best return, caches tool outputs, retrieval results, and planner decisions. For paraphrases I use semantic caching with a similarity threshold, careful not to over-merge distinct intents. You are caching intent and reusable blocks, not whole strings.

Question 20

Edge runtime or a Node server for your agent backend? What actually changes?

Accepted Answer

Hybrid, with reasoning. Edge is a lightweight isolate: native fetch, ultra-low latency, near-zero cold start, but short-lived with strict limits, no raw socket, and less predictable long connections. Node is a full runtime with socket control and reliable long-lived streaming and long tasks, at the cost of latency and cold starts. The agent loop, tools, and long conversations want Node or a durable engine; edge is the fast global front for auth, routing, and light token-pumping. The duration ceiling applies regardless of runtime, so do not put the loop on edge for no benefit. One currency note: Vercel now deprecates standalone Edge Functions in favor of Node on Fluid Compute, so true edge-native agents increasingly live on Cloudflare Workers and Durable Objects.

Question 21

Someone argues an agent is really just a workflow with ambiguity baked in. Do you agree?

Accepted Answer

Largely yes, sharpened: an agent is a deterministic workflow engine where certain nodes delegate the decision to a stochastic policy, the LLM, while all side effects stay in deterministic layers. The LLM is a policy function inside the engine, not the controller. It matters because the LLM breaks same-input-same-output, so you wrap each fuzzy node with validation, constrained outputs, and idempotency. Unlike a plain API, the LLM node can influence control flow, but only inside bounded limits. So harden the shell and treat each LLM node as untrusted and retryable.

Question 22

How do you handle model routing, provider failover, and cost control across an agent's LLM calls?

Accepted Answer

I put an AI gateway in front of every model call instead of hitting providers directly. One endpoint, many models: automatic failover when a provider degrades, retries, response caching, per-key rate limits, and hard spend budgets with real-time cost tracking. On Vercel that is the AI Gateway paired with the AI SDK; for provider neutrality, OpenRouter; on Cloudflare, their AI Gateway. Routing is policy, not app code: a cheap model handles most turns, the frontier model only the hard ones, with A/B and geo routing decided at the gateway.

Question 23

Can you run the whole agent on one serverless platform with no servers to manage?

Accepted Answer

Yes. On Cloudflare the agent is a Durable Object: single-threaded stateful compute with its own SQLite, so each session gets durable state, WebSockets, and scheduling for free, and the Agents SDK sits on top. Inference runs at the edge on Workers AI, retrieval on Vectorize, durable background work on Workflows v2, and untrusted code in Sandboxes. It scales to zero and bills on CPU time, so an idle agent costs nothing. The whole lifecycle lives on one platform with no cluster to run.

Question 24

A package in your agent's dependency tree gets compromised overnight. How do you not get burned?

Accepted Answer

Treat dependencies as an attack surface, because they are: in June 2026 the @mastra npm scope was hit by a typosquat remote-access trojan across 140+ packages. The defenses are boring and they work. Pin exact versions with a committed lockfile and never blind-update in CI. Verify provenance (signed publishes) before bumping. Run agents with least privilege so a compromised dependency cannot reach secrets or write paths, isolate any untrusted code in a sandbox, and track a software bill of materials so you know what actually reached production. The blast radius is bounded by what the process can touch, not by how clever the malware is.

Agentic AI in production

Walk me through how you would design a chat agent for a SaaS product. Assume Next.js + Supabase.

Walk me through how you would design a chat agent for a SaaS product. Assume Next.js + Supabase.

Your agent needs to wait for human approval that could take days. How do you implement this?

When would you NOT use RAG?

A team says they want to add Kafka to their agent system. What do you ask them?

How do you prevent prompt injection in an agent that processes user-submitted documents?

What is durable execution and when do you need it?

You inherit an agent that "works on my machine" but fails in production. What is the most likely cause?

Why is sub-agent isolation considered more important than context compaction in 2026?

How would you evaluate whether your agent is actually getting better over time?

Your agent costs $0.40 per conversation and the business needs it under $0.10. What do you do?

When would you use a multi-agent system instead of a single agent with many tools?

A user backgrounds your chat app mid-response on their phone and comes back. What should happen?

Mem0, Letta, or Zep for agent memory? How do you choose?

Postgres or MongoDB for a new AI product that needs vector search?

How do you handle an agent that hallucinates a tool call to a function that does not exist?

Design a customer support agent that can answer from internal docs and issue refunds.

You open an SSE stream so your long-running agent does not time out. Does that solve the timeout problem?

What does "replay" mean in an agent system, and why would you build it?

Where do you cache in an LLM agent system, and is caching useful when every prompt looks different?

Edge runtime or a Node server for your agent backend? What actually changes?

Someone argues an agent is really just a workflow with ambiguity baked in. Do you agree?

How do you handle model routing, provider failover, and cost control across an agent's LLM calls?

Can you run the whole agent on one serverless platform with no servers to manage?

A package in your agent's dependency tree gets compromised overnight. How do you not get burned?

New tools and notes, when they land.