CLAUDE.md is not documentation. It is a design constraint. How to architect an AI-first project that any agent can operate end-to-end.

AI-First Software Design: Beyond CLAUDE.md

Three months ago I started building a microservices scaffold generator with Claude Code. The first week was electric. I described what I wanted, the agent generated Onion Architecture boilerplate, wired Spring Boot modules, wrote tests. Felt like a 10x multiplier.

By week two, the multiplier was gone.

The agent kept violating the Outbox pattern I had already explained. It placed JPA annotations inside the domain layer. It generated use cases as inner classes inside controllers. Every session started with me re-explaining the same architectural constraints — the same constraints I had explained the day before.

The problem was not the model. The problem was that my project had no persistent structure for an AI to understand. I was treating the agent like a junior developer who could read my mind. It cannot. No model can.

The Zero-Context Problem

Here is what happens when you give Claude Code (or Cursor, or any coding agent) a project without explicit architectural context:

Session 1: "Add a cancel order use case"
Agent: Adds cancelOrder() method directly in OrderController
       Uses JPA annotations in domain layer
       No test written
       → You spend 20 minutes fixing it

Session 2: "Add a cancel order use case" (new session, same project)
Agent: Makes the same mistakes again
       → You spend another 20 minutes

Each session is a blank slate. The agent has no memory of your architectural decisions. It does not know that your project uses Onion Architecture. It does not know that your domain layer must be framework-free. It does not know that you follow TDD and write the test first.

The Stack Overflow 2024 survey found that 65% of developers report AI tools losing critical context during refactoring. Qodo's research shows that when context is manually selected, AI loses relevance in 54% of cases — but when context is architecturally structured, that number drops to 16%.

The difference is not better prompting. It is better project design. I explored the foundations of this shift in Vibe Coding is Engineering — Not Magic.

CLAUDE.md as a Design Constraint

Most tutorials treat CLAUDE.md as a configuration file. "Put your preferred coding style here. List your tech stack. Add a few rules."

That misses the point entirely.

A CLAUDE.md is not documentation for the AI. It is an architectural constraint — the same way an ArchitectureTest.java enforced by ArchUnit is a constraint. The difference is that ArchUnit runs at compile time and fails the build. CLAUDE.md runs at generation time and prevents the agent from writing code that violates your architecture in the first place.

Here is a section from the CLAUDE.md of the Mars Enterprise Kit Lite — an open-source Order microservice with Onion Architecture, Kafka, and PostgreSQL:

## Architecture

### Onion Architecture (Single Module, Package-Based Layers)

src/main/java/io/mars/lite/
├── domain/                  # Domain Layer — NO JPA, NO Kafka, NO Web
│   ├── Order.java           # Aggregate Root (record)
│   ├── OrderRepository.java # Port (interface)
│   └── usecase/
│       └── CreateOrderUseCase.java  # @Service class
│
├── infrastructure/          # Infrastructure Layer — implements domain ports
│   ├── persistence/
│   │   └── OrderRepositoryImpl.java # Adapter
│   └── messaging/
│       └── OrderCreatedPublisher.java
│
└── api/                     # API Layer — HTTP entry points
    └── OrderController.java

Notice what this does. It is not telling the agent "we use Onion Architecture." It is showing the agent the exact package structure, the naming conventions, which annotations are allowed in which layer, and the relationship between ports and adapters. The agent reads this before it writes a single line of code.

But the constraint section is where the real value lives:

## Code Generation Guidelines

**DO:**
- Use Java records for immutable domain objects
- Follow TDD strictly: test FIRST, then implement
- Use ports (interfaces) in domain/, adapters in infrastructure/

**DON'T:**
- Add JPA, Kafka, or Spring Web annotations in domain/ package
- Let domain/ depend on infrastructure/ or api/
- Write production code before a failing test
- Implement Outbox pattern (out of scope for Lite)

This is not style guidance. This is a set of invariants. When the agent reads "DON'T add JPA annotations in domain/", it treats that as a hard boundary. I have measured the difference: without these constraints, Claude Code violates the Onion Architecture in roughly 3 out of 5 sessions. With them, violations drop to near zero.

graph TD
    subgraph L1["Layer 1 — Constraints"]
        C["CLAUDE.md<br/>Architecture rules<br/>Layer invariants<br/>Naming conventions<br/>Test strategy"]
    end

    subgraph L2["Layer 2 — Knowledge Base"]
        D[".mars/docs/<br/>Structured architectural decisions<br/>Pattern rationale<br/>Trade-off documentation"]
    end

    subgraph L3["Layer 3 — Capabilities"]
        S[".claude/skills/<br/>chaos-phantom-event/SKILL.md<br/>chaos-testing/SKILL.md<br/>exploratory-testing/SKILL.md"]
    end

    subgraph L4["Layer 4 — Workflows"]
        W[".claude/commands/<br/>forge spec<br/>forge implement<br/>Git Worktree isolation"]
    end

    A["AI Agent<br/>(Claude Code)"] -->|"reads at session start"| L1
    A -->|"loads on demand"| L2
    A -->|"invokes when needed"| L3
    A -->|"executes to deliver features"| L4

    L1 -->|"defines boundaries for"| L2
    L2 -->|"informs"| L3
    L3 -->|"operates within"| L4

    L4 -->|"produces"| OUT["Merge-ready code<br/>First attempt"]

    style L1 fill:#0f3460,stroke:#e94560,color:#eaeaea
    style L2 fill:#0f3460,stroke:#4ecca3,color:#eaeaea
    style L3 fill:#0f3460,stroke:#533483,color:#eaeaea
    style L4 fill:#0f3460,stroke:#4ecca3,color:#eaeaea
    style A fill:#16213e,stroke:#e94560,color:#eaeaea
    style OUT fill:#16213e,stroke:#4ecca3,color:#eaeaea
    style C color:#eaeaea
    style D color:#eaeaea
    style S color:#eaeaea
    style W color:#eaeaea

Beyond CLAUDE.md: The Context Architecture

Here is where most developers stop. They write a good CLAUDE.md, see immediate improvement, and think they are done.

They are not. CLAUDE.md is necessary but insufficient — the same way a README is necessary but insufficient for onboarding a new developer. A CLAUDE.md tells the agent what to do. A full context architecture tells the agent why, how, and what has been decided before.

The Mars Enterprise Kit uses a layered context architecture:

mars-enterprise-kit-lite/
├── CLAUDE.md                           # Layer 1: Constraints
├── .mars/
│   └── docs/
│       └── mars-enterprise-kit-context-lite.md  # Layer 2: Knowledge Base
├── .claude/
│   ├── skills/
│   │   ├── chaos-phantom-event/SKILL.md         # Layer 3: Capabilities
│   │   ├── chaos-testing/SKILL.md
│   │   └── exploratory-testing/SKILL.md
│   └── commands/
│       ├── forge plan                           # Layer 4: Workflows
│       ├── forge spec
│       └── forge implement

Each layer serves a distinct purpose. Let me walk through them.

Layer 1: CLAUDE.md — What the Agent Must Respect

I covered this above. The CLAUDE.md defines constraints: architecture rules, naming conventions, testing strategy, dependency directions. It is loaded into context at the start of every session, automatically. Think of it as the agent's working memory for "how this project works."

Layer 2: .mars/docs/ — What Has Been Decided

The .mars/docs/ directory holds the project's persistent knowledge base — structured documentation of architectural decisions. In the Kit Lite, this is a single comprehensive context document that covers the core architecture, patterns, and trade-offs. The idea is borrowed from Architecture Decision Records (ADRs): each significant decision gets documented with its context, the decision itself, consequences, and alternatives considered.

Here is the critical difference from regular documentation: these documents are written for machines, not just humans. They follow a consistent structure so the agent can parse them programmatically. When the agent needs to understand why the project uses the Transactional Outbox pattern instead of direct Kafka publishing, the rationale is documented. When it needs to understand why Redpanda is used locally instead of a full Kafka cluster, the trade-off analysis is there.

The impact is measurable. Without structured decision documentation, AI tools struggle with project-specific patterns:

Developer: "Add a new event when order is cancelled"

AI (without context):
@PostMapping("/orders/{id}/cancel")
public void cancel(@PathVariable String id) {
    orderRepository.updateStatus(id, "CANCELLED");
    // WRONG: Forgot Outbox pattern
    // WRONG: Direct repository access (violates Onion Architecture)
    // WRONG: No domain event raised
}

AI (with context, reads structured decision docs):
public void cancel() {
    this.status = OrderStatus.CANCELLED;
    DomainEventPublisher.raise(new OrderCancelledEvent(this.id));
}
// Outbox automatically handles publishing

If you want to understand why the Outbox pattern is necessary before reading the ADR, the Event-Driven Architecture article covers the event vs command distinction that motivates it.

In my experience, the difference is roughly 90% pattern-compliant code with structured context vs. 40% without. The 50-percentage-point gap is the value of persistent architectural context.

Layer 3: Skills — What the Agent Can Do

A Skill, in the context of AI coding agents, is a persistent operational instruction that the agent loads on demand when it needs to perform a specific complex task. It is not documentation. Documentation describes what a system does. A Skill tells the agent how to execute an operation step by step — with exact commands, expected outputs, error handling, and the order in which things must happen.

The distinction matters. If you write a README that says "this project uses chaos testing to validate the Dual Write problem," an agent reads that and knows the concept exists. But it has no idea how to actually run the test. A Skill gives the agent the full execution sequence: which services to start, which profile to activate, which endpoint to call, what the expected failure looks like, and how to verify the result. The agent goes from "I know this exists" to "I can execute this autonomously."

In practice, platforms like Claude Code implement Skills as Markdown files stored in .claude/skills/. Each Skill is a folder containing a SKILL.md with YAML frontmatter (name, description, invocation hints) and a body with step-by-step instructions. The agent discovers available Skills at session start and loads the relevant one into its context when the task requires it. Cursor follows a parallel pattern with .cursor/rules/ — different file format, same underlying principle: structured, persistent instructions that the agent reads and executes rather than general knowledge it tries to reason about.

Anthropic has pushed this further by making Agent Skills an open standard. The idea is that a well-written Skill should be portable — usable by any agent that implements the spec. Whether you are running Claude Code, Cursor, or another coding agent, the Skill format is the same: Markdown instructions that the agent loads, interprets, and follows. This is not theoretical. It is how the Mars Enterprise Kit Lite's chaos testing works in practice.

But here is what most discussions about Skills miss: you cannot write an effective Skill without deep domain knowledge. The chaos-phantom-event Skill only exists because I knew the product intimately — the correct test flow, the Spring profile that activates the chaos endpoint, the port conflict between Redpanda's schema registry and the application, the exact order of operations to reproduce the Dual Write failure. Writing a Skill is not writing documentation. It is codifying operational knowledge that already exists in the developer's or architect's head. An agent without this Skill would attempt to guess the test flow and fail on critical details — wrong port, forgotten Spring profile, incorrect order of operations. The developer needs to understand the business logic and the architecture before they can translate that knowledge into a Skill the agent can execute autonomously.

With that foundation, here is a concrete example. In the Mars Enterprise Kit Lite, Skills are self-contained instruction sets for complex operations. When the agent needs to execute a chaos test to prove the Dual Write problem, it does not improvise. It loads the chaos-phantom-event skill and follows a 10-step validated sequence.

The chaos-phantom-event skill exists precisely to prove the Dual Write failure mode — the inconsistency that happens when a Kafka event is published but the database transaction rolls back.

Here is the header of the chaos-phantom-event skill from the Kit Lite:

---
name: chaos-phantom-event
description: >
  Run the Phantom Event chaos test against Mars Enterprise Kit Lite.
  This skill starts infrastructure, builds and runs the app with
  the "chaos" Spring profile, calls POST /chaos/phantom-event, and
  validates the Dual Write failure: the Kafka event EXISTS but the
  order does NOT exist in PostgreSQL (DB rolled back by AOP).
argument-hint: "[optional: number of phantom events to generate]"
---

The skill then contains the full execution sequence: verify infrastructure, build the application, start with the chaos profile, record baseline state, execute the phantom event, verify the inconsistency in PostgreSQL, verify the inconsistency in Kafka, print proof, verify normal flow still works, cleanup.

Each step has exact commands, expected outputs, and troubleshooting guidance. The agent does not need to figure out how to run a chaos test. It follows the skill.

This matters because complex operations have failure modes that are not obvious. The chaos-phantom-event skill, for example, knows that the chaos endpoint returns 404 if you forget to activate the Spring chaos profile. It knows that Redpanda's schema registry conflicts with port 8081, so the app runs on 8082. These are the details that an agent without a skill gets wrong on the first try.

If you want to see these skills in action, the Mars Enterprise Kit Lite includes the full chaos-phantom-event skill with the 10-step execution sequence, the CLAUDE.md constraints, and the Onion Architecture scaffold — all production-ready and open-source. Check the implementation →

Layer 4: Workflows — How the Agent Delivers Features

The Spec (the implementation Blueprint) is how the agent receives and executes feature requests. A Spec is a structured document that defines scope, acceptance criteria, and implementation steps — and the agent has commands (forge plan, forge spec, forge implement) to generate them and execute them in isolated Git worktrees.

---
name: git-worktree-prp
description: >
  Manages Git Worktrees as an isolated development environment
  whenever the user invokes forge spec or forge implement commands.
  Every feature runs in its own worktree — completely isolated from
  the main working directory.
---

The agent never starts implementation work directly on the current branch. It creates a worktree, executes the Spec inside it, validates the result (compile, test, architecture checks), and only then merges. This is the same workflow a disciplined human developer would follow — except it is codified and repeatable.

Worktrees and agent parallelism. The advantage of Git Worktrees goes beyond isolating a single feature. With worktrees, you can run multiple agents in parallel — each one in an isolated branch, with no file conflicts, no shared state interference. While one agent implements feature A in worktree 1, another agent can be executing chaos tests in worktree 2. When each finishes, it delivers a clean, merge-ready result on its own branch. Without worktrees, parallel agents collide: they write to the same files simultaneously, corrupt local state, and the output is unusable. Worktrees turn agent parallelism from a theoretical concept into a practical workflow.

What AI-First Design Does NOT Solve

I should be honest about the limitations, because no article on this topic is complete without them.

It does not eliminate review. Martin Fowler's advice applies here: treat every AI output as a pull request from a productive but untrustworthy collaborator. My CLAUDE.md prevents structural violations, but it does not catch subtle domain logic errors. I still review every change.

It does not scale to infinite context. LLMs suffer from context rot — the more tokens you load, the less attention each token gets. As your project grows and documentation expands, it will exceed what fits in a single context window. The solution is layered loading: CLAUDE.md is always loaded, decision documents are loaded on demand, skills are loaded only when invoked. This is why the architecture is layered instead of monolithic.

It does not work without fundamentals. If you do not understand Domain-Driven Design, you cannot write constraints that protect domain boundaries. If you do not understand the Outbox pattern, you cannot write an ADR that prevents the agent from implementing Dual Write. AI-First design amplifies engineering expertise. It does not replace it.

The maintenance cost is real. When I change an architectural decision, I must update the CLAUDE.md, the relevant ADRs, and any affected skills. If the documentation drifts from the code, the agent generates code that matches the outdated docs. This is the same problem all documentation has — but it is worse here because the agent trusts the docs implicitly.

The Twist: AI-First Design Serves Humans Too

Here is something I did not expect when I started building this system.

The same context architecture that makes a project operable by AI makes it dramatically better for human developers. When your architectural decisions are documented in a structured, navigable format, onboarding becomes a matter of reading decisions in sequence rather than reverse-engineering intent from code. A new developer can start with the foundational decisions (architecture style, layer boundaries) and progress to the more specific ones (messaging patterns, testing strategy). By the end of the first week, they understand not just what the project does but why every decision was made.

The CLAUDE.md's constraints section doubles as a code review checklist. "DON'T add JPA annotations in domain/" is equally useful for a human reviewer scanning a pull request.

The skills are runnable by humans too. The chaos-phantom-event skill's 10-step sequence is a legitimate QA procedure for validating the Dual Write problem — any engineer can follow it manually.

This is the real argument for AI-First design: the investment is not for the AI. It is for the system. The AI is just the forcing function that makes you explicit about decisions that were previously implicit — living in the architect's head, gradually eroding as the team evolves.

graph LR
    subgraph BEFORE["WITHOUT Context Architecture"]
        direction TB
        B1["Session start<br/>Zero context loaded"]
        B2["Agent re-reads codebase<br/>~20 min reconstruction"]
        B3["Generates code<br/>JPA in domain layer<br/>No test written<br/>Wrong pattern"]
        B4["Developer corrects<br/>~25 min fix cycle"]
        B5["Next session<br/>Same mistakes repeat"]

        B1 --> B2 --> B3 --> B4 --> B5
        B5 -->|"loop"| B2
    end

    subgraph AFTER["WITH Context Architecture"]
        direction TB
        A1["Session start<br/>CLAUDE.md auto-loaded"]
        A2["Constraints active<br/>Architecture known<br/>Patterns understood"]
        A3["Generates code<br/>Domain layer clean<br/>Test written first<br/>Outbox respected"]
        A4["Developer reviews<br/>~5 min verification"]
        A5["Merge-ready<br/>First attempt"]

        A1 --> A2 --> A3 --> A4 --> A5
    end

    subgraph DELTA["Measured Difference"]
        direction TB
        M1["Fix time<br/>25 min  ->  5 min"]
        M2["Violations/session<br/>2-3  ->  0-1"]
        M3["Attempts to merge<br/>3-4  ->  1"]
        M4["Context re-explain<br/>~40%  ->  ~0%"]
    end

    BEFORE -.->|"add context architecture"| AFTER
    AFTER -.->|"measured result"| DELTA

    style BEFORE fill:#16213e,stroke:#e94560,color:#eaeaea
    style AFTER fill:#16213e,stroke:#4ecca3,color:#eaeaea
    style DELTA fill:#16213e,stroke:#533483,color:#eaeaea
    style B1 fill:#0f3460,stroke:#e94560,color:#eaeaea
    style B2 fill:#0f3460,stroke:#e94560,color:#eaeaea
    style B3 fill:#0f3460,stroke:#e94560,color:#eaeaea
    style B4 fill:#0f3460,stroke:#e94560,color:#eaeaea
    style B5 fill:#0f3460,stroke:#e94560,color:#eaeaea
    style A1 fill:#0f3460,stroke:#4ecca3,color:#eaeaea
    style A2 fill:#0f3460,stroke:#4ecca3,color:#eaeaea
    style A3 fill:#0f3460,stroke:#4ecca3,color:#eaeaea
    style A4 fill:#0f3460,stroke:#4ecca3,color:#eaeaea
    style A5 fill:#0f3460,stroke:#4ecca3,color:#eaeaea
    style M1 fill:#0f3460,stroke:#533483,color:#eaeaea
    style M2 fill:#0f3460,stroke:#533483,color:#eaeaea
    style M3 fill:#0f3460,stroke:#533483,color:#eaeaea
    style M4 fill:#0f3460,stroke:#533483,color:#eaeaea

A Concrete Before/After

Let me quantify the difference based on my experience building the Mars Enterprise Kit over the past three months.

Without context architecture (first two weeks):

Average time to correct agent output per session: ~25 minutes
Architecture violations per session: 2-3 (wrong layer, missing tests, wrong pattern)
Sessions before agent produces merge-ready code: 3-4 attempts
Time spent re-explaining constraints: ~40% of total session time

With CLAUDE.md + .mars/docs/ + skills (last three months):

Average time to correct agent output per session: ~5 minutes
Architecture violations per session: 0-1 (usually naming/style, not structural)
Sessions before agent produces merge-ready code: 1 (first attempt, most tasks)
Time spent re-explaining constraints: ~0% (constraints are in the file)

The numbers are approximate — I did not run a controlled experiment. But the magnitude of the difference is real. The context architecture turned the agent from a tool I fought with into a tool I collaborate with.

How This Maps to FORGE

When I started writing this article, I was calling it "AI-First Design." Six months in — applying the same ideas across the Mars Enterprise Kit and a second project — it stopped being a loose set of principles and became a methodology with a name: FORGE.

The 4-layer Context Architecture I described above maps almost 1:1 to four of FORGE's five pillars. Here is the translation:

Layer in this article	FORGE pillar	What it does
Layer 1 — CLAUDE.md (constraints)	F — Foundation	One source of truth every agent reads
Layer 2 — .mars/docs (knowledge base)	F — Foundation (extension)	Architectural decisions persist between sessions
Layer 3 — Skills	G — Generate	Reusable units the agent invokes on demand
Layer 4 — Workflows / Plan → Spec → Implement	G — Generate + E — Evaluate	Plan → Spec → Implement with validation gates
(not in this article — added by FORGE)	O — Orchestrate	Forge Lead coordinates 6 specialists in parallel
(not in this article — added by FORGE)	R — Refine	TDD as a programmatic gate, code review with severity

What this article didn't cover yet is what FORGE codifies on top of the static substrate: a Forge Lead orchestrator that coordinates 6 specialist subagents (Business Analyst, Software Architect, Database Architect, Backend Engineer, Frontend Engineer, Quality Analyst) and an L1–L4 Validation Loop with a confidence score ≥ 7 that has to pass before code merges. If the score comes in below 7, the loop auto-retries before the change ever reaches the developer.

Put differently: the 4-layer architecture in this article is the static substrate — the files on disk that make the project legible to an agent. FORGE is the active team that runs on top of it.

Full playbook (5 pillars, 7 agents, 3 commands, L1–L4 validation gates) at programmingonmars.io/forge-playbook.

Takeaway

AI-First is not a feature you add to a project. It is a design discipline. The CLAUDE.md is the entry point, not the system. The system is the layered context architecture — constraints, knowledge base, skills, workflows — that makes your project operable by any agent (or human) that reads it.

If your agent keeps violating your architecture, the problem is not the model. The problem is that your architecture is not explicit enough for the model to follow.

Start with the CLAUDE.md. Make your constraints real. Then build the layers around it.

The Lite version gives you the Onion Architecture scaffold, the CLAUDE.md constraints, and three chaos testing skills — enough to start applying these ideas in your own projects. I am building a PRO version that expands the knowledge base with a full set of architectural decision records, AI-native documentation, and agent-coordinated workflows. If that sounds useful, you can follow the progress here →. And if you want the methodology distilled — the 5 pillars, the 7-agent team, the L1–L4 validation gates — that's the FORGE Playbook (early access) →.