Junyi's Lab

Debunking introl and ainewshub: "TPU is 4x Cheaper than GPU" is an AI Hallucination

junyi.h@comp.nus.edu.sg (Junyi Hou) — Wed, 10 Jun 2026 15:26:00 +0800

Disclaimer: this article is not arguing that TPU is worse than GPU. TPU and GPU each have their own fit, and which one wins depends on the specific workload. My point is that the batch of comparison data currently circulating online is itself inaccurate, much of it AI-fabricated and impossible to trace. What I’m taking apart below is exactly that fake data.

Artificial Analysis recently released a set of hardware benchmarks¹. Using Llama 3.3 70B, vLLM, and a reference speed of 30 output tokens/s per query to compute the cost per million input/output tokens, NVIDIA has roughly a 5x per-dollar token advantage over TPU v6e (Trillium), and roughly a 2x advantage over AMD MI300X².

The hardware benchmark conclusion Artificial Analysis posted on X. NVIDIA has about a 5x per-dollar token advantage over TPU v6e and about 2x over MI300X. H100 is $1.06, MI300X is $2.24, TPU v6e is $5.13.

Alongside this reproducible data, another kind of thing is also circulating online.

# A High-Profile Comparison Article

introl has an article titled Google TPU v6e vs GPU: 4x Better AI Performance Per Dollar³. Its core argument is that TPU has 4x better performance per dollar than H100, and that TPU completely beats NVIDIA on inference economics. Its key data comes from another article, ainewshub.org’s Nvidia vs Google TPU 2025 Cost Comparison⁴. Follow this citation chain down and you find that both are AI-generated, and the data is made up.

# The First Problem: The Core Citation Points to Data That Doesn’t Exist

The most central sentence in the ainewshub article reads like this.

The core claim in the ainewshub article, 4.7x better performance per dollar, but the source it cites is a non-existent MLPerf v4.1 LLM inference result.

“4.7× better performance-per-dollar on LLM inference than Nvidia H100/H200”, with the source listed as Google Cloud MLPerf Inference v4.1 results + customer case studies, October 2025.

But in fact, in MLPerf Inference v4.1, Google’s TPU submissions include only one model, stable-diffusion-xl. I went to MLCommons’ official results⁵ and filtered by Google plus TPU. Under v4.1 Closed Datacenter there are only two records, tpu-v5e-4 and tpu-v6-4, and both ran stable-diffusion-xl.

Filtering MLPerf v4.1 Closed Datacenter for Google TPU shows only two entries, tpu-v5e-4 and tpu-v6-4, both running stable-diffusion-xl. There is no LLM inference entry at all.

No LLM inference entry at all! The so-called MLPerf v4.1 LLM inference 4.7x performance per dollar cited by this article simply does not exist in the source it claims. The number was generated out of thin air and then attached to an authoritative-looking source.

Second, MLPerf does not report performance per dollar at all. It reports throughput, samples/s and queries/s, with no pricing in it. So that 4.7x could not possibly have been computed from MLPerf.

# The Second Problem: The Numbers Drift Between Reposts

Put introl³ and ainewshub⁴ side by side and the numbers don’t match (and there are false statements).

Performance per dollar multiple: introl writes 4x, ainewshub writes 4.7x. (This one is pure deception, see above.)
MLPerf version: introl cites v3.1, ainewshub cites v4.1. (No idea where introl got its data.)
Midjourney case: introl writes monthly spend dropped from $2 million to $700k, ainewshub writes from $2.1 million to $700k. (I didn’t carefully verify this one, but it’s most likely hallucinated too.)

Reposted again and again, it changes a little each time it’s retold. Damn it, every time the model generates it, it makes up a slightly different number.

# The Third Problem: A Frighteningly Precise TCO Table, With No Source

ainewshub⁴ gives a three-year total cost of ownership table, for a 1000-chip cluster, with NVIDIA H100 total cost $177 million, Google TPU v6 total cost $78.5 million, saving $98.5 million. It even breaks it down to hardware down 48%, power down 66%, cooling down 67%, support down 63%, networking down 67%, real estate down 63%. (All made up.)

This kind of breakdown precise to the single percentage point looks very professional. The problem is that not one item can be traced to a source. The accompanying customer cases are the same. Midjourney, and a so-called Series C computer vision startup whose monthly spend dropped from $340k to $89k, are all unverifiable specific numbers.

# The Most Infuriating Part is “data verified”

Following the ainewshub cost comparison article further upstream, its source points to another article on the same site, AI Inference Costs: TPU vs GPU 2025⁶. The same numbers are repeated here again, 4x value for money, Midjourney saving 65%, TPU v5e winning 8 out of 9 items. I have to stress again, this data is fake, it does not exist.

The statement at the end reads like this.

Data verified as of November 26, 2025. Sources include Google Cloud documentation, MLPerf benchmarks, company earnings reports, and verified industry migrations.

Verified, my ass.

It says it verified, yet it can’t provide a single clickable link, no specific report, no methodology. As checked above, the MLPerf TPU LLM inference results it relies on simply do not exist. The so-called verified is all completely fake.

Different articles on the same site can’t even keep these fake numbers consistent. The person who wrote them doesn’t know where the numbers came from either, because they never existed in the first place.

A site that mass-produces hallucinated numbers and then stamps them all with data verified … this garbage site does real damage.

# The Data I’ve Found So Far (At Least a Bit More Credible Than Their Site)

Artificial Analysis’ System Load Test¹, running Llama 3.3 70B.

Artificial Analysis System Load Test, peak system throughput, per-query output speed, and on-demand rental price for Llama 3.3 70B.

Peak system throughput: B200 is 15.4k tokens/s, H200 is 8.47k, H100 is 7.28k, TPU v6e is 6.73k, MI300X is 3.67k. TPU v6e falls behind NVIDIA’s same generation. Per-query output speed, TPU v6e is 61.3 tokens/s, the slowest in this group.

Cost has to be discussed in two cases, and here TPU does have a favorable scenario too.

Cost per million tokens for Llama 3.3 70B at peak throughput. TPU v6e is $0.62, close to H100’s $0.67 to $0.69.

Computing cost per million tokens at peak throughput, TPU v6e is $0.62, close to H100’s $0.67 to $0.69, and cheaper than MI300X’s $0.90 and B200 running vLLM at $1.63. If the workload is offline large-batch and can keep the chips fed, TPU’s economics work out.

But online serving rarely runs at peak throughput. Once you require a usable interactive speed, for example the reference speed of 30 tokens/s per query, TPU v6e’s unit cost jumps to $5.13, while H100 is $1.06. This is where the roughly 5x gap Artificial Analysis mentions comes from.

Later we’ll run the benchmark ourselves to get first-hand data, and compare then.

# How to Spot AI-Generated Junk Articles

Just follow one principle: any data, any number, must have a clickable source and a reproducible method.

Claude Code Asks to Re-login in tmux? It's Probably macOS Keychain

junyi.h@comp.nus.edu.sg (Junyi Hou) — Fri, 03 Apr 2026 00:00:00 +0800

claude works fine in my terminal — Team Account, auth, everything. Open tmux, run claude, and it wants me to log in again. Took a bit of digging to figure out it’s a macOS Security Session thing.

# What happens

claude in Ghostty: works. claude in tmux: asks to re-login.

How I tracked it down (click to expand)

## Environment variables?

First thought: tmux has different env vars.

env | grep -iE 'claude|anthropic'

Nothing on either side. Claude Code doesn’t store auth in env vars.

## Config files?

cat ~/.claude/.credentials.json 2>/dev/null || echo "no credentials file"

Nope. Not on disk either.

## Keychain

security dump-keychain 2>&1 | grep -i -A3 'claude\|anthropic'

Found it — stored in macOS Keychain as Claude Code-credentials:

0x00000007 <blob>="Claude Code-credentials"
"acct"<blob>="junyi"
"svce"<blob>="Claude Code-credentials"
keychain: "/Users/junyi/Library/Keychains/login.keychain-db"

Then:

security find-generic-password -s "Claude Code-credentials" -a "junyi" -w

Works in the terminal, errors in tmux. There it is.

# Why this happens

macOS Keychain access is tied to a Security Session (Bootstrap Namespace). When you open a terminal window, the process gets attached to your Aqua session, which holds the Keychain unlock state.

tmux server is a long-running daemon — it starts on your first tmux new and stays alive. When you tmux attach later, shells forked inside tmux server inherit the server’s original session, not your current Aqua session.

Normal terminal:
Ghostty → fork shell → inherits Aqua session → ✅ Keychain works
tmux:
Ghostty → attach → tmux server (old process)
→ fork shell → inherits old session
→ ❌ Keychain blocked

Same reason pbcopy/pbpaste break inside tmux.

# Fix

Install reattach-to-user-namespace to reconnect tmux processes to your Aqua session:

brew install reattach-to-user-namespace

Add to ~/.tmux.conf:

set-option -g default-command "reattach-to-user-namespace -l ${SHELL}"

Then kill the entire tmux server (kill-session won’t help — new windows still fork from the old server):

tmux kill-server
tmux new -s main

Verify:

security find-generic-password -s "Claude Code-credentials" -a "$USER" -w 2>&1 | head -c 50

If it prints the token, you’re good.

# While you’re at it

These tmux issues are probably the same root cause:

pbcopy/pbpaste broken
ssh-agent inaccessible
osascript failing
gh auth, op, and other Keychain-dependent CLI tools not working

reattach-to-user-namespace fixes all of them.

Agent Mesh: My Thoughts on Multi-AI Collaborative Workflows

junyi.h@comp.nus.edu.sg (Junyi Hou) — Wed, 11 Mar 2026 04:29:00 +0800

AI coding assistants are everywhere now, from chat windows to CLI Agents. After using a bunch of them, my biggest takeaway is: no single Agent is the best at everything.

I mainly use Claude Code, Google Gemini CLI, and Moonshot Kimi Code day-to-day. At first I treated them as standalone tools, but soon realized that was a waste — each has distinct strengths, and if you let them collaborate and delegate tasks based on those strengths, the results are much better. I call this approach Agent Mesh.

Here are my observations and thoughts.

# Core Roles: My Observations and Real Experience

## Claude Code: Top-Tier Architect and Execution Engine

In daily use, Claude Code feels like a “senior engineer.” Whether spinning up a new project from scratch or tackling a complex refactor, its architectural vision is the best of the bunch. And it doesn’t just plan — it executes: traversing files, running bash, fixing its own compile errors, driving the whole loop autonomously. I’ve installed Claude Code on all my servers. It has massively boosted my productivity and freed up a lot of my energy. The weakness is hallucinations — especially when generating large blocks of documentation or referencing large systems it hasn’t fully digested, it will make things up.

## Gemini CLI: Large-Context Consumer and “Hallucination Checker”

Gemini’s biggest advantage is that 2-million-token context window. In practice, it really can swallow an entire repository or very long error logs (I’ve tested with logs of several hundred KB). Because it can hold the entire macro context in its head, it’s a natural auditor. When Claude Code produces a brilliant but potentially hallucinated code plan, you can call on Gemini for large-scale cross-file verification.

## Kimi Code: Deep Reasoner and Data Miner (❓ Pending Evaluation)

Kimi Code takes a completely different path. Under the hood it runs K2.5’s “long thinking” model (k0), using explicit chain-of-thought (similar to OpenAI o1’s approach), plus an “Agentic Swarm” pattern that can theoretically spin up a swarm of sub-Agents for multi-hop document retrieval or grinding through deeply nested algorithmic problems. I do have a question mark here though: I use Claude and Gemini every day, so I have a solid feel for their capability boundaries, but how much better Kimi actually is at “data mining” and “deep reasoning” compared to the other two — honestly, I’m mostly going by its architectural claims and haven’t done rigorous head-to-head evaluations yet. So for now, its niche here is only a theoretical one.

# What My Ideal Agent Mesh Looks Like

Once you understand each Agent’s strengths and weaknesses, the natural next question is: can you have them collaborate, chaining tasks via prompts? That’s what I mean by Agent Mesh.

Most of the time, Claude Code is the workhorse — it writes the vast majority of the code. But when it hits a context ceiling, say I need to refactor a module but don’t know which parts of a 500K-line legacy codebase will be affected, I hand it off to Gemini — let it read the entire repo and report back the blast radius.

Conversely, after Claude executes a refactoring plan, I’ll have Gemini cross-reference it against the actual project files to catch hallucinations or broken dependencies. If Gemini encounters a particularly tough algorithmic problem during auditing, it can pass it to Kimi Code for extended reasoning.

After Kimi finishes thinking, it doesn’t touch files directly — it passes conclusions back to Claude Code for implementation. The whole flow is a loop: each does what it’s best at, and results flow between them.

# The Hidden Factor: Agent Architecture vs. Base Model

When trying to combine these Agents, I discovered a point that’s easy to overlook: the underlying base model is only half the equation — the “Agent architecture built around it” determines its actual capabilities.

The best example is Anthropic’s Claude 3.5 Sonnet. You can use this same base model in multiple environments:

As a standard chat assistant in VS Code.
As the driving engine inside Cursor IDE.
Through open-source Agent frameworks like OpenHands / Open Code.
Natively via the Claude Code CLI.

Despite sharing the exact same LLM DNA (Claude 3.5 Sonnet), their real-world capabilities differ dramatically. In demanding engineering workflows, native Claude Code consistently outperforms the others in Agent execution. Why? Because the Agent architecture has been uniquely optimized by the model’s creators (including but not limited to: hidden loop mechanisms, error recovery strategies, and specific context engineering).

Even using excellent open-source alternatives like Open Code with the Claude model, the execution fluidity and architectural vision often falls short of Claude Code. This tells us that when choosing Agents for a Mesh, we’re not just picking a base model (like Sonnet or GPT-4o) — we’re picking the engineering scaffolding built around it.

# Similar Explorations in the Community

After writing this up, I found that others in the community are working on similar ideas. Humanize is a Claude Code plugin that implements a workflow called RLCR (Ralph-Loop with Codex Review): Claude writes the code, Codex independently reviews it, and if issues are found, it gets kicked back for rework, looping until everything passes. The core idea is the same as Agent Mesh: let architecturally different AIs each do what they’re best at, iterating through a work-feedback loop instead of expecting one model to nail everything in a single shot.

# Final Thoughts

I don’t think a single “do-it-all” model will emerge to handle everything. What’s truly interesting is orchestration — having Agents with fundamentally different underlying architectures each play to their strengths.

One view I’ve consistently held: when we say “different” agents, we do not mean taking the same underlying model and wrapping it with different system prompts (e.g., role-playing “you are an architect” vs. “you are a tester”). True collaborative sparks only happen when agents have fundamental differences in how they process information at the architectural level.

But how do you know whether two models are truly different, or just fine-tuned versions of the same base? This is where tools like LLM-DNA (a research framework for analyzing evolutionary relationships and functional differences between language models, published in ICLR'26) become invaluable. By analyzing the “genetic” lineage and functional distance between models, we can deliberately select Agents from entirely different evolutionary branches, ensuring they don’t share the same blind spots.

My hands-on engineering experience tells me: the Claude Code + Gemini CLI synergy is currently the most powerful and intuitive combination. Gemini CLI leverages its long context to swallow entire repos and stacks of logs, specializing in catching hallucinations and macro-level auditing. Claude Code focuses on what it does best — understanding code architecture and executing precisely in the local environment.

When you deliberately combine Agents with fundamentally different architectures, they stop overlapping and start complementing each other. That’s the Agent Mesh idea I wanted to share — hope it inspires some of you.