I Broke My Blog Review System Trying to Beat Session Limits

Context: This post assumes you've read how I built the multi-agent blog review system. If you haven't, that post explains the architecture before this one breaks it.

I broke my entire blog review system this week trying to beat session limits.

Not because tokens cost money. Because Claude Code has session limits - I kept hitting the message cap before finishing my work.

Here's what that means in practice: Claude Code limits how many messages you can send in a session. Each time you ask it something, that's a message. Each time an agent spawns, that's a message. Each file read, each grep search - all messages. When you hit the limit (usually around 100-150 messages depending on context), your workflow just stops mid-task.

Every agent spawn, every file read, every grep search consumes tokens. When your prompts eat 15,750 tokens before any actual work happens, you burn through your session budget fast.

So I thought: "If I can trim these prompts by 1,100 tokens, I'll get more messages per session. More reviews. More work done."

The logic was sound. The execution broke everything.

The Setup

Context for what I broke:

System: Multi-agent blog review system (Orchestrator + 4 specialist critics)
Agent files: .claude/agents/blog-authenticity-guardian.md, .claude/agents/blog-structure-editor.md, .claude/agents/blog-skeptical-reader.md, .claude/agents/blog-technical-educator.md, .claude/agents/blog-orchestrator.md
Total prompt size: ~15,750 tokens across all 5 agents
Context window: 200,000 tokens (Claude Sonnet 4.6)
Session limits: Hit message cap regularly during multi-agent reviews (typically 100-150 messages per session)
My brain: "If I reduce prompt overhead, I'll get more messages per session"

The constraint felt real. Each blog review spawned 4-5 agents, each with 2,000-3,000 token prompts. Add file reads, grep searches, and context - I'd burn 50+ messages per review. When you hit the session limit, your workflow just... stops.

The Temptation

The optimization seemed obvious. Every agent review consumed messages:

Orchestrator reads the blog post (1 message)
Spawns 3 critics in parallel (3 messages)
Each critic reads files and returns feedback (6-9 messages)
Spawns Technical Educator for revisions (1 message)
Educator reads files and proposes changes (3-5 messages)
Round 2 reviews (another 3-6 messages)

That's 17-27 messages per blog post review. With 15,750 tokens of prompt overhead per agent, I was front-loading massive context before any actual analysis happened.

Here's my thinking: Each message I send to Claude can carry some context. If my agent prompts are smaller, each message has more room for actual work - the blog post content, the critic feedback, the revision drafts. Smaller prompts = more work per message = more blog posts reviewed per session.

The math seemed clear.

So I started condensing. "Just remove verbose examples here. Tighten this guidance there. These agents don't need ALL this instruction, right?"

What I Did

I opened up the agent prompt files and started cutting:

From .claude/agents/blog-skeptical-reader.md, I removed:

The detailed 6-point evaluation framework (Searchability, Specificity, Mental Model Transfer, Cognitive Load, Curse of Knowledge, Journey Documentation)
The explanations of why each dimension matters
The example rubrics showing good vs. bad posts

From .claude/agents/blog-technical-educator.md, I removed:

The "Debugging Detective Stories" framework
The "Aha! Moment" structure guidance
The Julia Evans and Josh Comeau references (concrete examples to emulate)
The mental model transfer explanations

I condensed verbose output format examples into concise headers. I tightened sentences. I deleted "redundant" guidance.

Commit efb9345: Reduced total prompt tokens from ~15,750 to ~14,650.

I'd saved 1,100 tokens (about 825 words). My optimization was complete.

The numbers looked great. The system was broken.

How I Discovered It

Two days after the commit, I ran a blog post through the review system. I'd been working on a debugging story about hydration errors in Next.js 16 - the kind of post my system had been catching gaps in reliably.

The review process started normally. Orchestrator spawned the three critics in parallel. Authenticity Guardian flagged a preachy opening. Structure Editor suggested reorganizing the mental model section. Skeptical Reader asked for more specific error messages.

All normal. I spawned the Technical Educator for revisions.

Then I saw the draft it generated.

Instead of writing a debugging post that started with the error and showed my journey to resolving the problem, it instead created an entirely different post with a completely unexpected hallucinated structure and heavy tutorial focus. And that was what happened when it worked. At worst it fully hallucinated an entirely different conversation.

What "Hallucinating Structure" Means

When I say the agent was "hallucinating structure," I mean it was inventing a post format that doesn't match my blog's voice at all. Here's the difference:

What my Technical Educator is supposed to generate (from .claude/agents/blog-technical-educator.md before my optimization):

Your posts are debugging detective stories. Start with the error message.
Show what you tried that didn't work. Document the "aha!" moment. Then
explain the mental model that makes it obvious in hindsight. Think: Julia
Evans blog posts, not MDN documentation.

Structure:
1. Start with the problem (relatable, specific)
2. Show the debugging journey (wrong turns and all)
3. Explain the mental model (why it works, not just how)
4. End with the solution (what finally worked)

Use conversational tone. Be honest about confusion. No "in this post,
I'll show you how" or other tutorial language.

What it generated after my optimization (what I actually saw):

Dramatic section headers like "Breaking Point #1" and "Breaking Point #2". Content marketing structure where each section is a "problem" followed by an "explanation." Formal, distant voice ("The primary issue manifests when..."). No confusion shown, no wrong turns documented, no debugging story - just clean instruction.

That's hallucinating structure: the agent invented a format that doesn't exist in my blog's voice because I'd removed the guidance that defined what my voice actually is.

What Went Wrong

I looked at the actual changes I made to figure out what I'd removed.

The Missing Evaluation Dimensions

Here's what my Skeptical Reader prompt looked like before my optimization:

## Evaluation Dimensions

You evaluate posts against 6 dimensions:

1. **Searchability**: Does the title include specific phrases people Google?
   - ✅ "Error: Cannot read property 'map' of undefined" (searchable)
   - ❌ "Understanding Async/Await in JavaScript" (generic)

2. **Specificity**: Are version numbers, actual error messages, and real code included?
   - ✅ "Next.js 16.1.1", "Cannot read property 'map' of undefined"
   - ❌ "Next.js 16", "an error occurred"

3. **Mental Model Transfer**: Is the WHY explained before the HOW?
   - ✅ "I finally understood this was a timing issue..." (explains why)
   - ❌ "Add async/await to handle promises" (just says how)

4. **Cognitive Load**: Does complexity progress gradually?
   - ✅ Simple version first, then nuance, then edge cases
   - ❌ Jump straight to complex implementation

5. **Curse of Knowledge**: Would past-Luke understand this?
   - ✅ "I thought X, but actually Y because..." (bridges gap)
   - ❌ "This is obvious..." (assumes knowledge)

6. **Journey Documentation**: Is the debugging process shown?
   - ✅ "I tried X, which didn't work. Then I tried Y..."
   - ❌ Just the solution, no process

After my optimization, I'd condensed this to:

## Evaluation Dimensions

Check for:
- Specificity (versions, error messages, real code)
- Examples for every concept
- Clear explanations

Cognitive Load and Curse of Knowledge weren't just combined - they were gone. Searchability was missing. Journey Documentation had vanished. The agent couldn't catch those failure modes anymore because I'd deleted the concepts from its prompt entirely.

The Removed Framework Guidance

Here's the actual Technical Educator framework I removed:

Before (commit 56ad54a - worked):

## Your Mission

You create blog content that documents Luke's journey. You write for "past Luke" - the version of him from 3-6 months ago who was struggling with the same problems. Your posts are specific and story-driven. Maximum helpfulness comes from sharing Luke's actual experience in detail, not from prescriptive advice.

## What Journey Posts Actually Look Like

Your posts are debugging detective stories. Start with the error message. Show what you tried that didn't work. Document the "aha!" moment. Then explain the mental model that makes it obvious in hindsight.

Think: Julia Evans blog posts, not MDN documentation.

### Structure Template

I kept hitting [specific error]. Here's the message:

[actual error message]

I tried:
1. [first thing I tried] - didn't work because [reason]
2. [second thing I tried] - didn't work because [reason]

What finally worked: [solution].

Here's the mental model: [explanation of why it works]
(Past-me from 6 months ago would have never caught this.)

## Mental Model Transfer

Always explain WHY before HOW. Build conceptual understanding first, then show implementation.

Examples:
- ❌ "Add async/await to your function" (just how)
- ✅ "This is a timing issue. The data arrives asynchronously, so we need to wait for it. Here's how: add async/await..." (why first, then how)

After (commit efb9345 - broken):

## Your Mission

Create blog posts that document Luke's journey. Write for past-Luke who was struggling with similar problems.

## Post Structure

Start with the error message. Show what you tried. Explain what worked.

## Mental Model Transfer

Explain why before how.

I removed:

"Debugging detective stories" (the framing)
"Aha! moment" (the emotional arc)
"Mental model that makes it obvious" (the purpose)
Julia Evans reference (the concrete example to emulate)
The entire structure template with examples
The specific WHY-before-HOW examples
The conversational tone guidance

I kept the surface instruction ("show debugging journey") but deleted all the guidance about how and why.

The Defensive Repetition

The funniest part? I'd actually already tried this optimization.

Commit 1b076ef from weeks earlier: "Edits to multi agent blog post generation to try and reduce hallucination."

I'd noticed the agents drifting. I'd added back framework guidance and evaluation dimensions to fix it. Then I deleted them again trying to save tokens.

I'd literally undone my own fix because I'd forgotten why I made it.

What Actually Worked

After rolling back, I did a more surgical optimization:

I condensed only the output format templates (verbose example posts → concise headers like "example title / example slug"). I kept every single evaluation dimension. I kept all the framework guidance about debugging stories and mental models.

Savings: ~400 tokens instead of 1,100. Hallucinations: Zero. Time wasted: 5 hours I'll never get back.

The system works again. My agents are catching gaps in posts, preserving my voice, and helping me improve. I just didn't save as many tokens as I wanted.

So, Three Things I'm Carrying Forward

Prompt tokens are not like code bloat. In regular code, every unused import or redundant function adds maintenance burden. But in AI prompts, "redundant" guidance is often the difference between reliable behavior and genre drift. My agents needed that repeated emphasis on "debugging detective stories" and "learn in public." The repetition creates a stronger pattern in the context.

Session limits aren't always solved by prompt optimization. I was trying to squeeze more work into limited sessions by trimming prompts. But the real constraint wasn't prompt tokens - it was the number of agent spawns and file operations per review. That and the obscenely limited session time that Anthropic give you. The better approach would have been fewer review rounds, caching responses, or batching multiple posts in one session.

Trust your systems. I had a working multi-agent review system. It caught gaps in my posts. It preserved my voice. It helped me improve. Then I broke it trying to make it "more efficient." The real efficiency would have been running more reviews with the working system, not optimizing away the safeguards that made it work.