Skip to main content

Junyi's Lab

Agent Mesh: My Thoughts on Multi-AI Collaborative Workflows

Table of Contents

AI coding assistants are everywhere now, from chat windows to CLI Agents. After using a bunch of them, my biggest takeaway is: no single Agent is the best at everything.

I mainly use Claude Code, Google Gemini CLI, and Moonshot Kimi Code day-to-day. At first I treated them as standalone tools, but soon realized that was a waste — each has distinct strengths, and if you let them collaborate and delegate tasks based on those strengths, the results are much better. I call this approach Agent Mesh.

Here are my observations and thoughts.


# Core Roles: My Observations and Real Experience

## 1. Claude Code: Top-Tier Architect and Execution Engine

  • Hands-on impression: In daily use, Claude Code feels like a “senior engineer.” Whether spinning up a new project from scratch or tackling a complex refactor, its architectural vision is the best of the bunch. And it doesn’t just plan — it executes: traversing files, running bash, fixing its own compile errors, driving the whole loop autonomously. I’ve installed Claude Code on all my servers. It has massively boosted my productivity and freed up a lot of my energy.
  • Weakness: Hallucinations. Especially when generating large blocks of documentation or referencing large systems it hasn’t fully digested, it will make things up.

## 2. Gemini CLI: Large-Context Consumer and “Hallucination Checker”

  • Hands-on impression: Gemini’s biggest advantage is that 2-million-token context window. In practice, it really can swallow an entire repository or very long error logs (I’ve tested with logs of several hundred KB).
  • Collaborative value: Because it can hold the entire macro context in its head, it’s a natural auditor. When Claude Code produces a brilliant but potentially hallucinated code plan, you can call on Gemini for large-scale cross-file verification.

## 3. Kimi Code (Moonshot K2.5): Deep Reasoner and Data Miner (❓ Pending Evaluation)

  • Theoretical strengths: Kimi Code takes a completely different path. Under the hood it runs K2.5’s “long thinking” model (k0), using explicit chain-of-thought (similar to OpenAI o1’s approach), plus an “Agentic Swarm” pattern that can theoretically spin up a swarm of sub-Agents for multi-hop document retrieval or grinding through deeply nested algorithmic problems.
  • But I have a question mark: I use Claude and Gemini every day, so I have a solid feel for their capability boundaries. But how much better Kimi actually is at “data mining” and “deep reasoning” compared to the other two — honestly, I’m mostly going by its architectural claims and haven’t done rigorous head-to-head evaluations yet. So for now, its niche in logical persistence is only a theoretical one.

# What My Ideal Agent Mesh Looks Like

Agent Mesh Workflow Diagram

Once you understand each Agent’s strengths and weaknesses, the natural next question is: can you have them collaborate, chaining tasks via prompts? That’s what I mean by Agent Mesh.

Most of the time, Claude Code is the workhorse — it writes the vast majority of the code. But when it hits a context ceiling, say I need to refactor a module but don’t know which parts of a 500K-line legacy codebase will be affected, I hand it off to Gemini — let it read the entire repo and report back the blast radius.

Conversely, after Claude executes a refactoring plan, I’ll have Gemini cross-reference it against the actual project files to catch hallucinations or broken dependencies. If Gemini encounters a particularly tough algorithmic problem during auditing, it can pass it to Kimi Code for extended reasoning.

After Kimi finishes thinking, it doesn’t touch files directly — it passes conclusions back to Claude Code for implementation. The whole flow is a loop: each does what it’s best at, and results flow between them.


# The Hidden Factor: Agent Architecture vs. Base Model

When trying to combine these Agents, I discovered a point that’s easy to overlook: the underlying base model is only half the equation — the “Agent architecture built around it” determines its actual capabilities.

The best example is Anthropic’s Claude 3.5 Sonnet. You can use this same base model in multiple environments:

  1. As a standard chat assistant in VS Code.
  2. As the driving engine inside Cursor IDE.
  3. Through open-source Agent frameworks like OpenHands / Open Code.
  4. Natively via the Claude Code CLI.

Despite sharing the exact same LLM DNA (Claude 3.5 Sonnet), their real-world capabilities differ dramatically. In demanding engineering workflows, native Claude Code consistently outperforms the others in Agent execution. Why? Because the Agent architecture has been uniquely optimized by the model’s creators (including but not limited to: hidden loop mechanisms, error recovery strategies, and specific context engineering).

Even using excellent open-source alternatives like Open Code with the Claude model, the execution fluidity and architectural vision often falls short of Claude Code. This tells us that when choosing Agents for a Mesh, we’re not just picking a base model (like Sonnet or GPT-4o) — we’re picking the engineering scaffolding built around it.


# Final Thoughts

I don’t think a single “do-it-all” model will emerge to handle everything. What’s truly interesting is orchestration — having Agents with fundamentally different underlying architectures each play to their strengths.

One view I’ve consistently held: when we say “different” agents, we do not mean taking the same underlying model and wrapping it with different system prompts (e.g., role-playing “you are an architect” vs. “you are a tester”). True collaborative sparks only happen when agents have fundamental differences in how they process information at the architectural level.

But how do you know whether two models are truly different, or just fine-tuned versions of the same base? This is where tools like LLM-DNA (a research framework for analyzing evolutionary relationships and functional differences between language models, published in ICLR'26) become invaluable. By analyzing the “genetic” lineage and functional distance between models, we can deliberately select Agents from entirely different evolutionary branches, ensuring they don’t share the same blind spots.

My hands-on engineering experience tells me: the Claude Code + Gemini CLI synergy is currently the most powerful and intuitive combination. Gemini CLI leverages its long context to swallow entire repos and stacks of logs, specializing in catching hallucinations and macro-level auditing. Claude Code focuses on what it does best — understanding code architecture and executing precisely in the local environment.

When you deliberately combine Agents with fundamentally different architectures, they stop overlapping and start complementing each other. That’s the Agent Mesh idea I wanted to share — hope it inspires some of you.