#
Abstract
We benchmark LLM inference on one Google TPU v6e-4 host (four chips, one VM) with four chips in one VM. We use vLLM 0.20.0 with the tpu-inference backend and an fp8 KV cache.
We test three Qwen3 models:
| Model | Type | Params | Parallelism | Chips |
|---|---|---|---|---|
| Qwen3.5-4B | dense | 4B active | tp1 | 1 |
| Qwen3-30B-A3B | MoE | 30B total / 3B active | tp4 | 4 |
| Qwen3-32B | dense | 32B active | tp4 | 4 |
We measure three parts of inference: prefill, decode, and end-to-end online serving.

