It started with an alpha build and an idea.
On January 7th, Reuven, creator of Claude Flow, gave us early access to version 3. He walked us through something that immediately caught my attention: he had restructured Claude Flow around Domain-Driven Design, complete with Architecture Decision Records documenting every significant choice. Clean bounded contexts. Clear separation of concerns. The kind of architecture that makes you think, "I should have been doing this all along."
I saw two experiments waiting to happen.
First, I wanted to test whether Claude Flow v3's new architecture actually improved agent output quality. Would the DDD approach and ADRs make a measurable difference, or was this just another layer of abstraction?
Second, I wanted to rewrite the entire Agentic QE Fleet using the same principles. The v2 codebase had grown organically over months—60+ directories, features scattered across the system, tight coupling that made every change feel risky. If DDD worked for Claude Flow, would it work for a quality engineering platform?
Fourteen days later, I had my answers.
What Actually Changed
Let me be direct about what happened, because I'm not interested in hype.
The v2 Agentic QE Fleet was functional. It had around 32 agents, 35 QE skills, and it did the job. But it also had 5,334 files sprawled across the codebase, no clear feature boundaries, and the kind of architecture debt that accumulates when you're moving fast and shipping often.
V3 ships with 546 files organized into 12 bounded contexts. Each domain—test generation, coverage analysis, security compliance, defect intelligence—lives in its own world with explicit interfaces. The agent count went from 32 to 50 (43 main agents plus 7 TDD subagents). Skills expanded from 35 to 60.
The Numbers: From 5,334 files to 546 files. From 32 agents to 50 agents. From 35 skills to 60 skills. From chaos to 12 clean domains.
But the numbers don't tell the real story.
The Agent Quality Shift
Here's what I noticed within the first few days: agents were implementing things better.
Not perfect. I want to be clear about that. Integration between components remained a challenge—this is still the hardest problem in agentic development. Components would work in isolation, then struggle when they needed to talk to each other.
But something had shifted.
Previously, when I ran my brutal honesty review skill against agent output, I'd need three to six iterations to get a feature properly implemented, integrated, and verified with tests. The complexity of the task determined the iteration count, but it was rarely less than three rounds of "this isn't actually working, let me show you why."
With v3, that dropped to two rounds.
Two rounds to get systems fully implemented, integrated, and verified. For someone who's been running these workflows for months, that's not a marginal improvement. That's a fundamental change in how the development cycle feels.
The Token Reality
A week into development, Anthropic silently reduced the token allocation for subscriptions. I think the reduction was around 200 million tokens for my Max20 subscription—a significant cut that would have normally forced me to switch to GLM or other backup models to continue working.
Here's what actually happened: I didn't hit the limit.
Now, I can't attribute this entirely to Claude Flow v3 or the DDD architecture. There are too many variables. But the evidence suggests that agents with clear domain boundaries waste fewer tokens exploring irrelevant context. When an agent knows exactly what it's responsible for and what it isn't, it doesn't need to load the entire codebase into its context window to make a decision.
I'm not making performance claims I can't verify. What I can tell you is that I completed a significant rewrite under token constraints that would have been problematic with my previous workflow.
Migration as a Learning Laboratory
Testing the v2 to v3 migration turned into its own experiment.
For the first time, I published an npm alpha version of v3 while v2 was still the stable release. This let me test in parallel—running the same analysis tasks against both versions on real codebases. The setup was simple: execute v2 agents, execute v3 agents, and compare the reports.
The difference in output quality was clear. V3 agents produced more structured findings, better architectural insights, and more actionable recommendations. The DDD awareness meant they could identify issues that v2 agents simply couldn't see—like domain boundary violations or coupling problems that only matter when you understand the intended separation of concerns.
But the migration testing also surfaced problems I wouldn't have found otherwise:
- Things we forgot to transfer from v2 in the initial plan
- Edge cases in the agent definitions
- Configuration patterns that worked in v2 but needed adjustment for v3's architecture
Publishing the alpha forced me to treat migration as a first-class concern rather than an afterthought. Every v2 user upgrading to v3 deserves a path that doesn't break their workflows.
What Didn't Work
I promised no false claims, so here's what still needs work.
The self-learning and improvement systems—ReasoningBank, Dream cycles, the reinforcement learning algorithms—are implemented but not thoroughly tested in production scenarios. I need time with real usage patterns before I can tell you whether they deliver on their promise.
Integration between components remains the hardest problem. DDD helps by making the boundaries explicit, but it doesn't eliminate the challenge of coordinating multiple agents across domains. The Queen Coordinator exists precisely because this coordination problem doesn't solve itself.
And there are features in v2 that didn't make the initial v3 cut. Some were deliberately left behind as technical debt. Others were oversights that users will discover and report. The migration path handles the critical functionality, but v2 was a comprehensive system, and fourteen days is fourteen days.
What I Learned
Domain-Driven Design isn't just an organizational pattern. It's a communication protocol between humans and agents.
When you tell an agent, "You are responsible for the test-generation domain and nothing else," you're not just limiting scope. You're giving it permission to go deep instead of wide. The agent doesn't need to hedge its understanding across the entire system. It can build genuine expertise in one area.
This changes how agents write code, how they structure their output, and how they reason about problems. The v3 agents produce better work because they know what they're supposed to be good at.
Architecture Decision Records serve a similar function. When an agent can read ADR-026 explaining why TinyDancer uses three-tier model routing, it doesn't need to re-derive that reasoning from first principles. It can build on decisions already made.
These aren't revolutionary insights. DDD and ADRs have been standard practice in software engineering for years. What's new is recognizing that they work just as well—maybe better—when your team includes AI agents.
The 12 Domains
For those interested in the specifics, here are the 12 bounded contexts in V3:
| Domain | Purpose | Key Agents |
|---|---|---|
| test-generation | AI-powered test creation | test-architect, tdd-specialist |
| test-execution | Parallel execution, retry | parallel-executor, retry-handler |
| coverage-analysis | Gap detection | coverage-specialist, gap-detector |
| quality-assessment | Quality gates | quality-gate, deployment-advisor |
| defect-intelligence | Prediction, root cause | defect-predictor, root-cause-analyzer |
| requirements-validation | BDD, testability | requirements-validator, bdd-generator |
| code-intelligence | Knowledge graph | knowledge-manager, code-analyzer |
| security-compliance | SAST/DAST | security-scanner, security-auditor |
| contract-testing | API contracts | contract-validator, graphql-tester |
| visual-accessibility | Visual regression | visual-tester, a11y-validator |
| chaos-resilience | Chaos engineering | chaos-engineer, load-tester |
| learning-optimization | Cross-domain learning | learning-coordinator, pattern-learner |
What's Next
I have a working v3 that performs better than v2 by every metric I've measured. But "better" and "complete" aren't the same thing.
The learning systems need real-world validation. The browser automation with Vibium and agent-browser needs production testing. The 60 QE skills need to be used in contexts I haven't imagined.
Most importantly, the architecture needs stress testing by people who aren't me. I designed these 12 domains based on my understanding of quality engineering. Other practitioners will have different mental models, different priorities, and different ways of organizing work.
V3 is released as an alpha for a reason. It works, but it will evolve based on how people actually use it.
The Point
I rebuilt a significant software system in fourteen days.
Not by working around the clock. Not by cutting corners on architecture. By applying principles that make agents work more focused, more reliable, and more efficient.
Domain-Driven Design gave structure. Architecture Decision Records gave context. Clean bounded contexts gave agents permission to specialize. The combination produced something that feels qualitatively different from v2—not just better organized, but actually smarter about the work it does.
Is this the future of software development? I don't know. I try not to make sweeping claims about the future.
What I can tell you is that it worked for this project, on this timeline, with these constraints. And if you're building with AI agents, the architecture of your system matters as much as the capabilities of your models.
Maybe more.
The Agentic QE Fleet v3 is available now as an alpha release. If you're a v2 user, the migration path is documented. If you're new, the quick start will get you running in minutes.
Agentic QE Fleet on GitHub | Migration Guide | V2 to V3 Comparison