Benchmarking Local Claude Code-Compatible Agents on M5 Max with MLX and Qwen3.6
A practical local-agent benchmark using a Claude Code-compatible Anthropic API bridge, Qwen3.6 MLX artifacts, prompt caching, and held-out TypeScript validation tests.
TL;DR
I built an Anthropic-compatible local proxy for Claude Code and backed it with MLX. The proxy used in this benchmark is anthropic-api-to-mlx at benchmark proxy commit ed74457. The controlled TypeScript benchmark is local-agent-bench.
The main result is practical rather than philosophical: with prompt caching enabled, local Claude Code-style agent loops are usable on an M5 Max. In this controlled TypeScript benchmark, Qwen3.6 35B A3B 4-bit class artifacts were much faster than dense 27B artifacts, with median decode throughput around 108-113 tokens/s versus roughly 27-29 tokens/s for 27B.
I would not read this as a model-quality leaderboard. The benchmark has only four tasks, three repeats, and one discriminator task. The robust signal is speed and local-agent feasibility. The quality signal is included only as context for the pass counts.
What This Shows
This benchmark shows local Claude Code-compatible agent throughput on an M5 Max using MLX and Qwen3.6.
It does not establish a general model-quality ranking. The clearest result is that Qwen3.6 35B A3B was much faster than dense 27B in this setup.
Motivation
Claude Code is a useful interface for coding-agent work, but it expects an Anthropic-style HTTP API. MLX gives me a fast local runtime on Apple Silicon, but it does not expose a Claude Code-compatible endpoint. The missing piece is a bridge.
The goal of this experiment was not to prove that one model is generally smarter than another. I wanted to know whether a local M5 Max setup can run the kind of repeated read-edit-test loop that Claude Code performs, and what breaks first: throughput, context handling, or tool formatting.
Proxy Architecture
The path is intentionally simple:
Claude Code -> Anthropic-compatible /v1/messages -> anthropic-api-to-mlx -> mlx_lm -> local MLX model
The proxy implements the small subset needed for these experiments: POST /v1/messages, streaming responses, POST /v1/messages/count_tokens, model listing, and health checks. Tool parsing is best-effort because Qwen does not emit Anthropic-native tool calls. The proxy handles common XML and JSON-ish tool-call shapes that appear under Claude Code prompts.
Qwen reasoning was enabled with MLX_ENABLE_THINKING=1. The proxy strips <think>...</think> blocks before returning content to Claude Code, because provider-specific reasoning text inside normal content can interfere with the agent loop.
Prompt Caching
Prompt caching is not a quality feature here. It is an execution feature. Claude Code repeatedly sends long, highly overlapping conversations. Without cache reuse, the agent loop becomes painful.
The proxy keeps bounded in-memory MLX prompt-cache slots. For each request, it renders the chat template with the tokenizer and only accepts a cache hit when the current token IDs start with the cached token IDs exactly. The request body remains the source of truth. A cache hit changes speed, not semantics.
In the run logs, typical completed runs show two initial misses followed by hits for later Claude Code turns. The recorded metrics include cache hits and misses, cached/rest token counts, prefill throughput, decode throughput, and peak memory.
Hardware and Setup
The benchmark was run locally on an M5 Max system. The agent was Claude Code pointed at the local Anthropic-compatible endpoint. The backend was MLX through mlx_lm. The main settings were:
- Apple M5 Max with 128 GB unified memory
- macOS 26.3.1 (build 25D771280a)
- Python 3.12.13, MLX 0.31.2, mlx-lm 0.31.3
- Claude Code 2.1.126
- benchmark proxy commit
ed74457 MLX_ENABLE_THINKING=1PROMPT_CACHE=1MLX_MAX_KV_SIZE=196608MLX_MAX_TOKENS=2048for the first repeat set, then8192for completion repeats- KV cache quantization was not used in the main matrix
Model Artifacts and Revisions
Artifact identity matters. The main comparison below is MLX-only, and the revisions are recorded so the benchmark can be rerun against the same model artifacts.
| Artifact | Repository | Revision |
|---|---|---|
Qwen3.6 27B 4-bit | mlx-community/Qwen3.6-27B-4bit | c000ac2c2057d94be3fa931000c31723aac53282 |
Qwen3.6 27B NVFP4 | mlx-community/Qwen3.6-27B-nvfp4 | 4d2ab1eda95532602a12db325c8a3ea9f362a6f5 |
Qwen3.6 27B MXFP4 | mlx-community/Qwen3.6-27B-mxfp4 | b4ac73652af626fd653d0e4fe48a4e9c468a82f6 |
Qwen3.6 35B A3B 4-bit | mlx-community/Qwen3.6-35B-A3B-4bit | 38740b847e4cb78f352aba30aa41c76e08e6eb46 |
Qwen3.6 35B A3B NVFP4 | mlx-community/Qwen3.6-35B-A3B-nvfp4 | 9c1a3a223ddd8a3425212cc421056614f149cf0f |
Qwen3.6 35B A3B MXFP4 | mlx-community/Qwen3.6-35B-A3B-mxfp4 | 833013b27a1f7c6dbb008b55d37c387ea22ea89d |
Benchmark Protocol
The benchmark project, local-agent-bench, is a controlled TypeScript task suite. It is useful for local-agent smoke testing, tool behavior, prompt-cache behavior, and throughput. It is not a full coding benchmark.
Each Claude Code run works in a worktree under /tmp/bench-work that excludes held-out validation tests. Those tests are public in the benchmark repository for reproducibility, but hidden from the agent workdir during each run. Evaluation is applied externally from /tmp/bench-eval. Public tests are development feedback for the agent; held-out validation tests are the scoring signal.
The recorded fields include:
- public and held-out validation pass/fail
- wall time, turns, and tool calls
- cache hits and misses
- average prefill and decode throughput
- peak memory
Main MLX 4-bit Results
The main matrix covers six MLX model/quant combinations, four TypeScript tasks, and three repeats per task, for 72 runs.
| Model slug | Pass | Avg wall time (s) | Median wall time (s) | Avg prefill (t/s) | Median decode (t/s) | Avg peak memory (GB) |
|---|---|---|---|---|---|---|
27b-4bit | 11/12 | 145.9 | 115.5 | 370.2 | 28.0 | 34.1 |
27b-mxfp4 | 11/12 | 129.4 | 110.5 | 380.4 | 29.1 | 27.8 |
27b-nvfp4 | 10/12 | 119.7 | 115.0 | 392.3 | 27.2 | 28.4 |
35b-a3b-4bit | 9/12 | 31.4 | 28.5 | 1256.2 | 113.2 | 26.8 |
35b-a3b-mxfp4 | 9/12 | 30.3 | 28.0 | 1349.1 | 113.4 | 24.4 |
35b-a3b-nvfp4 | 9/12 | 36.8 | 36.5 | 1245.3 | 107.9 | 25.6 |
The speed result is the cleanest signal. The 35B A3B artifacts finished in about 30-37 seconds on average, while the 27B artifacts were around 120-146 seconds. Decode throughput was also much higher for 35B A3B in this MLX setup.
The quality signal is weaker. All models passed tasks 01-03 in all repeats. Task04 dominated the pass-rate difference, so I would not use this table to claim a general model-quality ranking.
| Model slug | 01 Bugfix | 02 Feature | 03 Refactor | 04 From scratch |
|---|---|---|---|---|
27b-4bit | 3/3 | 3/3 | 3/3 | 2/3 |
27b-mxfp4 | 3/3 | 3/3 | 3/3 | 2/3 |
27b-nvfp4 | 3/3 | 3/3 | 3/3 | 1/3 |
35b-a3b-4bit | 3/3 | 3/3 | 3/3 | 0/3 |
35b-a3b-mxfp4 | 3/3 | 3/3 | 3/3 | 0/3 |
35b-a3b-nvfp4 | 3/3 | 3/3 | 3/3 | 0/3 |
Task04 Failure Analysis
Task04 asks the agent to implement a fixed-window rate limiter. Public tests allowed some rolling-window implementations to pass. Held-out validation tests caught the boundary bug.
The wrong implementation pattern was:
resetAt = nowMs + windowMs
The fixed-window boundary should be computed from the window bucket:
resetAt = Math.floor(nowMs / windowMs) * windowMs + windowMs
This is exactly the kind of bug held-out validation tests are supposed to expose. Public tests checked enough behavior for development feedback, but not enough to distinguish rolling-window semantics from fixed-window semantics.
In this benchmark, 27B variants solved task04 one to two times out of three. The 35B A3B variants solved it zero times out of three. That is a useful observation, but it is still only one task. I include it to explain the pass counts, not to rank the models.
Limitations
- The task count is small: four TypeScript tasks, with one discriminator task.
- Three repeats are better than one, but these results are still preliminary.
- Some run diffs include
package-lock.jsonnoise fromnpm install. - Tool parsing is best-effort and model-dependent.
- The benchmark uses held-out validation, but it is not designed to support broad quality claims.
- Artifacts must be referenced by repository URL and revision; local directory names are not enough.
- These results should not be used to claim that MXFP4 is generally better than NVFP4.
- KV-cache quantization was intentionally excluded from the main comparison.
Reproducibility
The benchmark code, tasks, protocol, and generated summaries are public in local-agent-bench. The most relevant files are:
- results/PRIMARY_MATRIX_20260504.md
- results/SUMMARY.md
- BENCH_PROTOCOL.md
- eval/run-mlx-4bit-complete-matrix.sh
The proxy source is anthropic-api-to-mlx, and the benchmark proxy commit was ed74457.
Related source repos
git clone https://github.com/audexdev/local-agent-bench cd local-agent-bench cat BENCH_PROTOCOL.md cat results/PRIMARY_MATRIX_20260504.md
FAQ
Can Claude Code run locally with MLX?
Claude Code can use a local MLX-backed model through an Anthropic-compatible proxy. This benchmark used anthropic-api-to-mlx to map Claude Code requests to mlx_lm and Qwen3.6.
Was Qwen3.6 35B-A3B faster than dense 27B?
Yes, in this controlled TypeScript benchmark on an M5 Max. The 35B-A3B variants averaged about 30-37 seconds per run, while the dense 27B variants averaged about 120-146 seconds.
Is this a model-quality benchmark?
No. The benchmark is a local-agent throughput and usability note. It includes pass counts for context, but it is not a definitive model-quality leaderboard.
Did the main benchmark use KV-cache quantization?
No. KV-cache quantization was intentionally excluded from the main matrix. The main runs used MLX model quantizations only.
What made prompt caching important?
Claude Code sends long, overlapping conversations across turns. Exact token-prefix prompt caching avoids repeated full-prefill work and makes the local agent loop much more usable.
Conclusion
This benchmark answers a narrow practical question: can a local Claude Code-compatible loop run comfortably through MLX on an M5 Max? In this setup, yes, as long as prompt caching is working.
The main thing I would take from the numbers is speed. Qwen3.6 35B A3B was much faster than dense 27B in this local MLX agent workflow, and the difference was large enough to change how usable the loop felt. That is the result I am comfortable publishing from this run.