TDD AI assisted development is not optional — it is the protocol. How the test validation step turns LLM-generated code from a liability into production-ready software.

TDD in AI-Assisted Development: The Step Nobody Documents

TDD no Ciclo AI-Assisted

Last month I asked Claude Code to implement a cancel order use case. It generated the controller, the service, the repository call. Clean code, reasonable naming, even a DTO. Zero tests. Zero edge cases. No validation for cancelling an already cancelled order. No check for null customer IDs. The code compiled. It would have passed code review on a Monday morning. And it would have broken in production by Tuesday.

The problem was not the model. The problem was me. In TDD AI assisted development, the developer must define what should fail — not hope the agent infers it. I described the happy path and expected the AI to infer the unhappy ones. It cannot. No LLM can. They are non-deterministic by design — the same prompt produces different code on different runs. Without explicit constraints, the output is a coin flip dressed in clean syntax.

The Context: Why "Just Use AI" Fails at Scale

I work with Java and Spring Boot in systems that process over a million transactions a day. At that scale, "eventually consistent" is not a comfort — it is a risk you manage with explicit contracts. When I started using AI coding assistants full-time last year, the first thing that broke was not my architecture. It was my feedback loop.

In classical TDD, the feedback loop is tight: Red, Green, Refactor. You write a failing test, you write the minimum code to pass it, you clean up. The developer holds both the domain knowledge and the implementation. With AI in the loop, that cycle splits. The developer holds domain knowledge. The AI holds implementation speed. But nobody defined the handoff protocol between the two. Most teams just type a prompt and hope for the best. That is not engineering. That is improvisation.

%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#21242D", "primaryTextColor": "#ffffff", "primaryBorderColor": "#FF7F50", "lineColor": "#BCC3D7", "background": "#0E0C15", "mainBkg": "#21242D", "fontFamily": "Sora, monospace"}}}%%

flowchart LR
    subgraph CLASSIC["  CLASSICAL TDD  "]
        direction TB
        D1(["Developer"]) -->|"writes failing test"| R["RED\ntest fails"]
        R -->|"min. code to pass"| G["GREEN\ntest passes"]
        G -->|"clean up"| RF["REFACTOR"]
        RF -->|"next behaviour"| R
    end

    subgraph BROKEN["  TDD + AI — BROKEN LOOP  "]
        direction TB
        D2(["Developer"]) -->|"describes feature"| AIGEN["AI generates\ncode + tests"]
        AIGEN -->|"no review"| SKIPNODE{{"VALIDATION\nSKIPPED"}}
        SKIPNODE -->|"ships directly"| PUSH["code pushed"]
        PUSH -.->|"missing edge cases\nnull checks\nboundary conditions"| BUST["PROD BUG"]
    end

    CLASSIC ~~~ BROKEN

    style R fill:#CC2222,color:#fff,stroke:#FF4444
    style G fill:#2A6B3E,color:#fff,stroke:#3EB75E
    style RF fill:#0F4F6E,color:#fff,stroke:#1BA2DB
    style SKIPNODE fill:#7A3A00,color:#FFC876,stroke:#FF7F50,stroke-width:2px
    style BUST fill:#3D0000,color:#FF7F50,stroke:#CC5200,stroke-width:2px
    style CLASSIC fill:#16181E,stroke:#2E313D,color:#BCC3D7
    style BROKEN fill:#16181E,stroke:#FF7F50,color:#BCC3D7

The New Cycle: AI Assisted TDD as a Communication Protocol for LLMs

After months of daily pairing with Claude Code, I converged on a cycle that works. It is not revolutionary — it is TDD with an explicit step that most teams skip. Here is the full loop:

Step 1 — Developer defines domain scenarios. Not code. Not pseudocode. Scenarios. Happy flow, unhappy flow, edge cases, boundary conditions. Written as natural language or structured prompts. This is where domain knowledge lives, and this is what most developers skip when they prompt an AI.

Step 2 — AI generates the test pyramid. Unit tests for domain logic. Integration tests for persistence and transactions. E2E tests for API contracts. The AI does not decide what to test — you already told it. The AI decides how to express it in code.

Step 3 — Developer validates the generated tests. This is the step nobody documents. You read every test the AI generated. You check: did it capture the edge case where an order with zero items should throw a BusinessException? Did it test what happens when you cancel an already cancelled order? Did it cover the contract — the HTTP 400 when customerId is null? If the answer is no, you correct the scenarios and go back to Step 2.

Step 4 — AI implements the production code. Only after the tests are validated. The AI now has a concrete, executable specification of what "correct" means. It is not guessing. It is solving a constraint satisfaction problem.

Step 5 — AI runs the tests as a guardrail on every change. Each implementation attempt is validated against the test suite automatically. Tests fail, AI adjusts. Tests pass, iteration is done. The developer is not reviewing line-by-line generated code — the tests are doing the reviewing.

%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#21242D", "primaryTextColor": "#ffffff", "primaryBorderColor": "#FF7F50", "lineColor": "#BCC3D7", "background": "#0E0C15", "mainBkg": "#21242D", "clusterBkg": "#16181E", "fontFamily": "Sora, monospace"}}}%%

flowchart TD
    DEV(["DEVELOPER\ndomain expert"])
    AI(["AI\nimplementation engine"])

    S1["STEP 1\nDefine domain scenarios\nhappy path · edge cases\nboundary conditions · failures"]
    S2["STEP 2\nGenerate test pyramid\nUnit · Integration · E2E"]
    S3{{"STEP 3\nValidate generated tests\nagainst domain reality"}}
    S4["STEP 4\nImplement production code\nwith tests as spec"]
    S5["STEP 5\nRun tests as guardrail\non every iteration"]

    FAIL_LOOP["revise scenarios\n+ regenerate"]
    DONE(["tests GREEN\niteration complete"])

    DEV -->|"structured prompt\nwith constraints"| S1
    S1 --> AI
    AI -->|"unit + integration + e2e"| S2
    S2 --> DEV
    DEV -->|"reads every test"| S3
    S3 -->|"tests capture\nedge cases correctly"| S4
    S3 -->|"tests miss domain\nrules or edge cases"| FAIL_LOOP
    FAIL_LOOP -->|"back to step 1"| S1
    S4 --> AI
    AI -->|"implements against spec"| S5
    S5 -->|"tests FAIL\nAI adjusts"| S4
    S5 -->|"tests PASS"| DONE

    style DEV fill:#16181E,stroke:#FF7F50,color:#FF7F50,stroke-width:2px
    style AI fill:#16181E,stroke:#1BA2DB,color:#1BA2DB,stroke-width:2px
    style S1 fill:#21242D,stroke:#2E313D,color:#BCC3D7
    style S2 fill:#21242D,stroke:#2E313D,color:#BCC3D7
    style S3 fill:#7A3A00,stroke:#FF7F50,color:#FFC876,stroke-width:2px
    style S4 fill:#21242D,stroke:#2E313D,color:#BCC3D7
    style S5 fill:#21242D,stroke:#3EB75E,color:#3EB75E
    style FAIL_LOOP fill:#3D0000,stroke:#CC5200,color:#FFA07A
    style DONE fill:#0F3D1F,stroke:#3EB75E,color:#3EB75E

The Test Pyramid as a Context Map for TDD with AI Coding Tools

Here is the insight that changed how I work with AI: each layer of the test pyramid maps to a different layer of context that the LLM needs.

Unit tests encode domain rules. When I tell the AI "an order with empty items should throw a BusinessException", I am giving it a domain constraint. The unit test is the executable form of that constraint:

@Test
void shouldThrowWhenItemsAreEmpty() {
    assertThatThrownBy(() -> Order.create(UUID.randomUUID(), Set.of()))
        .isInstanceOf(BusinessException.class)
        .hasMessageContaining("items cannot be empty");
}

@Test
void shouldThrowWhenCustomerIdIsNull() {
    var items = Set.of(new OrderItem(UUID.randomUUID(), 1, new BigDecimal("10.00")));
    assertThatThrownBy(() -> Order.create(null, items))
        .isInstanceOf(BusinessException.class)
        .hasMessageContaining("customerId cannot be null");
}

These tests exist in the mars-enterprise-kit-lite. They test Order.create() — a static factory on a Java record. No Spring context, no database, no Kafka. Pure domain logic. The AI can run hundreds of these in milliseconds. Each one is a constraint the implementation must satisfy.

Integration tests encode infrastructure contracts. This is where things like transactions, persistence mappings, and database constraints live:

class CreateOrderUseCaseIntegrationTest extends AbstractIntegrationTest {

    @Autowired
    private CreateOrderUseCase createOrderUseCase;

    @Autowired
    private OrderJpaRepository orderJpaRepository;

    @Test
    @DisplayName("should create order and persist in database")
    void shouldCreateOrderAndPersistInDatabase() {
        var items = Set.of(new OrderItem(UUID.randomUUID(), 2, new BigDecimal("10.00")));
        var customerId = UUID.randomUUID();

        var orderId = createOrderUseCase.execute(
            new CreateOrderUseCase.Input(items, customerId));

        assertThat(orderId).isNotNull();
        var entity = orderJpaRepository.findById(orderId).orElseThrow();
        assertThat(entity.getCustomerId()).isEqualTo(customerId);
        assertThat(entity.getStatus()).isEqualTo(OrderStatus.CREATED);
        assertThat(entity.getTotal()).isEqualByComparingTo(new BigDecimal("20.00"));
    }
}

This test extends AbstractIntegrationTest, which spins up a real PostgreSQL container via Testcontainers. The AI knows: there is a real database, Flyway migrations must run, the JPA mapping must be correct, the total must be calculated as quantity times unit price. That is infrastructure context the AI cannot infer from a prompt alone.

E2E tests encode API contracts. What status code for an empty items array? What response shape for a successful creation?

@Test
@DisplayName("POST /orders - should return 400 when items are empty")
void shouldReturn400WhenItemsAreEmpty() {
    given()
        .contentType(ContentType.JSON)
        .body("""
            {
                "customerId": "550e8400-e29b-41d4-a716-446655440000",
                "items": []
            }
            """)
    .when()
        .post()
    .then()
        .statusCode(400);
}

REST Assured with BDD-style assertions. The AI reads this and knows: empty items is a 400, not a 500. That is a contract. The developer defined it. The AI implements it.

If you want to see this pattern in action, the Mars Enterprise Kit Lite implements a production-ready test infrastructure with unit tests, Testcontainers-based integration tests, and REST Assured E2E tests built-in — the exact setup described in this cycle. Check the implementation →

What Went Wrong: The Validation Step I Almost Skipped

The first time I tried this cycle, I skipped Step 3. I let the AI generate tests and went straight to implementation. The AI produced 12 unit tests for the Order aggregate — all passing, all green. I felt productive. Then I looked at what was actually being tested: 8 of the 12 tests were variations of the happy path. No test for cancelling an already cancelled order. No test for null customerId. No test for negative quantities.

The AI did exactly what I asked — it generated tests. But it optimized for coverage percentage, not for domain correctness. A human who understands the business domain would never write 8 happy path variations and zero edge case tests. That is the gap. The AI does not know what matters unless you tell it.

After that, the validation step became non-negotiable. I now spend more time reviewing generated tests than reviewing generated implementation code. If the tests are right, the implementation follows. If the tests are wrong, no amount of implementation quality saves you.

Trade-offs: When This Cycle Does Not Work

This workflow assumes the developer has deep domain knowledge. If you are building a CRUD with no business rules, the overhead of defining scenarios and validating tests is not worth it — just prompt and ship.

It also assumes your test infrastructure is solid. Testcontainers, proper integration test setup, fast feedback loops. If your test suite takes 20 minutes to run, the AI loop becomes a bottleneck instead of an accelerator. In the mars-enterprise-kit-lite, the full test suite — unit, integration, and E2E — runs in under 30 seconds because of shared Testcontainers and focused test scope.

And there is a seniority requirement that nobody talks about. A developer who has never practiced TDD cannot define meaningful scenarios for the AI. They will describe happy paths, the AI will generate happy path tests, and the production bugs will come from the edge cases neither of them considered. The new cycle does not lower the bar for engineering skill — it raises it.

The Takeaway

TDD is not dead in the AI era. It is the protocol. The developer who can express domain boundaries, edge cases, and failure modes in structured scenarios will get production-quality code from any LLM. The developer who cannot will get code that compiles and fails in production.

The step that makes the difference is the one nobody documents: validate the generated tests before letting the AI implement anything. That is where domain expertise earns its keep.

If you want to see how this connects to the broader question of structuring projects for AI agents, I wrote about it in AI-First Software Design: Beyond CLAUDE.md — the context engineering that makes this TDD cycle possible.

The Lite version gives you the test infrastructure: domain unit tests, Testcontainers integration tests, and REST Assured E2E contracts — everything you need to run this cycle on your own codebase. The PRO version adds the full reference implementation with Event Sourcing, SAGA orchestration, and the AI-assisted development workflow already applied end-to-end across three production-grade services. Join the early access →