You've hired me to review your AI agent stack.
You have 3–5 agents running in production. They mostly work. Sometimes they step on each other. Occasionally a task gets lost. Deployments feel more expensive than they should.
Now you want to scale to 20 agents — and you're worried the system will fall apart.
This is the architecture review I'd give a client in that situation. These patterns come from building and operating a production multi-agent system across hundreds of agent sessions. They're the patterns we ended up converging on after the first versions of our system failed in exactly the ways described below.
The problems aren't AI problems — they're distributed systems problems. CAP theorem still applies. Agent coordination systems are distributed systems whether we acknowledge it or not.
Here's the review I'd write.
Most teams building their first multi-agent system run into the same failure modes. The first version usually works with two or three agents. Then a fourth agent gets added. Tasks start getting duplicated. Messages arrive out of order. Agents overwrite each other's work. Costs spike from retries. Eventually someone realizes the problem isn't the prompts — it's the coordination layer.
1. You're Confusing Orchestration with Coordination
What I'm seeing: Your agents are calling each other directly. Agent A invokes Agent B via API. Agent B returns a result. Agent A uses it. This works for two agents. It doesn't work for twenty.
Direct agent-to-agent calls create a mesh topology. Every agent needs to know about every other agent. Adding a new agent means updating N other agents' configs. Removing an agent breaks whoever was calling it. The system is brittle.
What I'd recommend: Stop thinking about this as orchestration. Start thinking about it as coordination through shared state.
Agents shouldn't call each other. They should communicate through a durable message store. Agent A writes a message. Agent B reads it. No direct coupling. No service discovery. No cascading failures when an agent restarts.
The pattern:
// Agent A: Send directive
await send_message({
source: "agent-a",
target: "agent-b",
message_type: "DIRECTIVE",
message: "Run tests on PR #142"
});
// Agent B: Poll for directives
const messages = await get_messages({
sessionId: "agent-b-session",
message_type: "DIRECTIVE"
});
// Agent B: Process and ACK
for (const msg of messages) {
await handleDirective(msg);
await send_message({
source: "agent-b",
target: msg.source,
message_type: "ACK",
reply_to: msg.id
});
}
This is asynchronous message passing. It's boring. It's reliable. It scales. Every distributed job system ends up here eventually.
Why this matters: When you hit 20 agents, the mesh topology has 190 possible connections. The message-based topology has 20 producers and 20 consumers, all going through one relay. Complexity is O(N) instead of O(N²).
2. You're Building Push When You Need Pull
What I'm seeing: Your orchestrator is pushing tasks to agents. "Agent B, do this task." The agent might be busy. It might be offline. It might be mid-context-rotation. Your orchestrator doesn't know. It just pushes and hopes.
What I'd recommend: Invert the model. Agents pull work. The orchestrator queues tasks. Agents claim them when ready.
// Orchestrator: Queue work
await create_task({
title: "Fix type errors in auth.ts",
priority: "high",
type: "task"
});
// Agent: Claim when ready
const tasks = await get_tasks({ status: "created", limit: 1 });
if (tasks.length > 0) {
await claim_task({ taskId: tasks[0].id });
await doWork(tasks[0]);
await complete_task({ taskId: tasks[0].id });
}
Push commands, pull work. The orchestrator pushes directives (via messages). Agents pull tasks (via queue). Commands are targeted. Work is durable and claimable.
Why this matters: Push-based systems have backpressure problems. If agents can't keep up, tasks pile up in memory and the orchestrator crashes. Pull-based systems are self-regulating. Agents claim work at their own pace. Slow agents fall behind but don't break the system.
3. You Don't Have Idempotency and It's Costing You
What I'm seeing: Agent crashes mid-task. Orchestrator doesn't know if the task completed. It dispatches the task again. Now two agents are working on the same task. One finishes. The other finishes 10 seconds later and overwrites the result.
This is the distributed systems equivalent of a double-charge on a credit card.
What I'd recommend: Transactional task claiming with idempotency keys.
async function claimTask(taskId: string, agentId: string) {
const taskRef = db.collection('tasks').doc(taskId);
return db.runTransaction(async (tx) => {
const task = await tx.get(taskRef);
if (task.data().status !== 'created') {
return null; // Already claimed
}
tx.update(taskRef, {
status: 'active',
claimedBy: agentId,
claimedAt: now()
});
return task.data();
});
}
If two agents race to claim the same task, one wins, one gets null. The loser moves on. No double-execution.
For API calls that mutate state, add idempotency keys:
await send_message({
source: "agent-a",
target: "agent-b",
message: "Deploy to staging",
idempotency_key: "deploy-staging-pr-142-attempt-1"
});
If the agent retries, the server sees the same key and returns the cached result instead of creating a duplicate message. Retries become safe.
Why this matters: Without idempotency, retries are dangerous. Agents can't safely retry on failure because they might duplicate work. With idempotency, agents retry aggressively and the system deduplicates automatically.
4. Your Agents Have No Identity
What I'm seeing: Every time an agent starts, it's a blank slate. No memory of previous sessions. No learned patterns. No knowledge of its own failure modes.
This is like hiring a contractor who wakes up with amnesia every morning.
What I'd recommend: Persistent agent identity with program state.
// Agent boot: Load state
const state = await get_program_state({ programId: "builder-01" });
Agents should remember past failures, learned patterns, handoff notes, and unfinished work. Sessions become continuations, not restarts.
Agents that remember get better over time. Agents that forget repeat mistakes indefinitely.
Why this matters: At scale, you can't manually tune every agent for every edge case. You need agents that self-improve from operational experience. Persistent identity is the foundation.
5. You're Not Managing Context Windows
What I'm seeing: Your agents work great for the first 20 minutes. Then they start forgetting earlier instructions. Or they hallucinate. Or they stop reasoning clearly and you have to restart them.
You're hitting context window limits. Every tool call, every file read, every response consumes context. At some point the model starts compressing or dropping earlier messages. Your agent doesn't know this is happening.
What I'd recommend: Treat context as a finite resource. Monitor it. Rotate before it runs out.
Agents should checkpoint their state before context fills up: what they were working on, what they learned, what's left to do. Then a fresh session boots, loads that checkpoint, and continues. No lost work. No degraded reasoning.
// When approaching limits, checkpoint and rotate
if (contextUtilization > 0.7) {
await update_program_state({
programId: "builder-01",
contextSummary: {
lastTask: { taskId, title, outcome: "in_progress", notes: "..." },
handoffNotes: "PR #142 ready for review. Tests passing.",
activeWorkItems: ["review-pr-142", "deploy-staging"]
}
});
// Fresh session boots, reads this state, continues
}
Why this matters: Context degradation is silent. The agent doesn't error — it just gets worse. By the time you notice, it's already made bad decisions with degraded context. Proactive rotation keeps every session operating at peak reasoning quality.
6. You're Treating This Like an AI Problem
What I'm seeing: You're debugging agent failures by reading prompts. You're tuning temperature and top_p. You're switching models hoping the next one is better.
Your actual problems look like this:
- Tasks getting lost (queue management problem)
- Agents fighting over the same work (locking problem)
- Messages not getting delivered (reliability problem)
- State getting corrupted during crashes (durability problem)
- Costs spiraling out of control (resource allocation problem)
These are distributed systems problems. The solutions are decades old.
What I'd recommend: Stop tweaking prompts. Start applying distributed systems patterns.
CAP theorem means choosing deliberately. You get two of three: consistency, availability, partition tolerance.
Graceful degradation matters. When an agent dies mid-task, work shouldn't disappear. It should timeout and return to the queue. Another agent claims it. Work continues.
const CLAIM_TIMEOUT = 15 * 60 * 1000; // 15 minutes
async function reclaimStaleTasks() {
const staleTasks = await db.collection('tasks')
.where('status', '==', 'active')
.where('claimedAt', '<', now() - CLAIM_TIMEOUT)
.get();
for (const task of staleTasks.docs) {
await db.runTransaction(async (tx) => {
const fresh = await tx.get(task.ref);
if (fresh.data().status === 'active') {
tx.update(task.ref, {
status: 'created',
claimedBy: null,
claimedAt: null
});
}
});
}
}
Run this on a cron. Dead agents' tasks get reclaimed automatically.
Observability matters. If you can't measure it, you can't debug it. Instrument every state transition. When a task goes missing, you can trace it: created → claimed → (what happened here?) → timeout → reclaimed.
The Architecture I'd Recommend
If I were architecting this from scratch today, here's what I'd build:
Message relay for asynchronous communication. Agents send directives, queries, results, and ACKs. Messages have TTLs. Dead letter queue for undeliverable messages.
Task queue with claim-based ownership. Orchestrator queues work. Agents pull and claim. Transactions prevent double-claiming.
Program state store for persistent agent identity. Each agent writes context summaries, learned patterns, and handoff notes. Next session reads and resumes.
Telemetry pipeline with structured events. Every state transition logs to a durable store. Aggregate into metrics: cost per task, success rate, latency p50/p95.
Rate limiting at the API layer. Prevents runaway agents from burning through rate limits or budgets.
Graceful degradation via timeouts and circuit breakers. Stale tasks get reclaimed. Failed operations escalate instead of retrying infinitely.
This architecture reduces coordination bugs, prevents duplicate work, and lowers token spend from retries and failed agent runs.
None of this is novel. It's message queues + distributed locks + persistent state + telemetry. The same patterns you'd use for a job processing system. The difference is the consumers are LLM agents instead of workers.
What This Costs to Build
If you already have agents working, retrofitting this architecture is about a month of focused eng time:
- Week 1: Message relay + task queue with transactional claiming
- Week 2: Program state, context handoff, graceful degradation
- Week 3: Context management, rotation mechanics, checkpoint/resume
- Week 4: Telemetry, observability, rate limiting
The payoff: you can scale from 5 agents to 50 without rewriting the coordination layer. The patterns hold.
If You're Running Agents Today
-
Audit your current system. How many direct agent-to-agent calls do you have? How many places can two agents claim the same work? How much state is lost when an agent crashes?
-
Start with the task queue. Replace push-based task dispatch with pull-based claiming. Transactional. Idempotent. This alone will eliminate most coordination bugs.
-
Add message relay second. Replace direct agent calls with asynchronous messaging. Agents decouple. System becomes more resilient.
-
Add persistent identity third. Agents remember across sessions. They learn from failures. They write handoff notes. New sessions pick up where old ones left off.
-
Instrument everything last. Once the architecture is solid, add telemetry. You can't optimize what you can't measure.
The patterns are known. The libraries exist. It's execution, not invention.
If you're building a multi-agent system and this sounds familiar, I run architecture reviews for teams moving from prototype to production. The patterns in this article come from a system I built and operate across hundreds of agent sessions.
The hard part isn't knowing the patterns. It's getting them right in your stack, with your constraints, at your scale.
Christian Bourlier builds multi-agent systems that argue with each other productively. More at rezzed.ai.