PreliminaryM5 Max

Benchmarking Local Claude Code-Compatible Agents on M5 Max with MLX and Qwen3.6

A practical local-agent benchmark using a Claude Code-compatible Anthropic API bridge, Qwen3.6 MLX artifacts, prompt caching, and held-out TypeScript validation tests.

TL;DR

I built an Anthropic-compatible local proxy for Claude Code and backed it with MLX. The proxy used in this benchmark is anthropic-api-to-mlx at benchmark proxy commit ed74457. The controlled TypeScript benchmark is local-agent-bench.

The main result is practical rather than philosophical: with prompt caching enabled, local Claude Code-style agent loops are usable on an M5 Max. In this controlled TypeScript benchmark, Qwen3.6 35B A3B 4-bit class artifacts were much faster than dense 27B artifacts, with median decode throughput around 108-113 tokens/s versus roughly 27-29 tokens/s for 27B.

I would not read this as a model-quality leaderboard. The benchmark has only four tasks, three repeats, and one discriminator task. The robust signal is speed and local-agent feasibility. The quality signal is included only as context for the pass counts.

What This Shows

This benchmark shows local Claude Code-compatible agent throughput on an M5 Max using MLX and Qwen3.6.

It does not establish a general model-quality ranking. The clearest result is that Qwen3.6 35B A3B was much faster than dense 27B in this setup.

Motivation

Claude Code is a useful interface for coding-agent work, but it expects an Anthropic-style HTTP API. MLX gives me a fast local runtime on Apple Silicon, but it does not expose a Claude Code-compatible endpoint. The missing piece is a bridge.

The goal of this experiment was not to prove that one model is generally smarter than another. I wanted to know whether a local M5 Max setup can run the kind of repeated read-edit-test loop that Claude Code performs, and what breaks first: throughput, context handling, or tool formatting.

Proxy Architecture

The path is intentionally simple:

Claude Code
  -> Anthropic-compatible /v1/messages
  -> anthropic-api-to-mlx
  -> mlx_lm
  -> local MLX model

The proxy implements the small subset needed for these experiments: POST /v1/messages, streaming responses, POST /v1/messages/count_tokens, model listing, and health checks. Tool parsing is best-effort because Qwen does not emit Anthropic-native tool calls. The proxy handles common XML and JSON-ish tool-call shapes that appear under Claude Code prompts.

Qwen reasoning was enabled with MLX_ENABLE_THINKING=1. The proxy strips <think>...</think> blocks before returning content to Claude Code, because provider-specific reasoning text inside normal content can interfere with the agent loop.

Prompt Caching

Prompt caching is not a quality feature here. It is an execution feature. Claude Code repeatedly sends long, highly overlapping conversations. Without cache reuse, the agent loop becomes painful.

The proxy keeps bounded in-memory MLX prompt-cache slots. For each request, it renders the chat template with the tokenizer and only accepts a cache hit when the current token IDs start with the cached token IDs exactly. The request body remains the source of truth. A cache hit changes speed, not semantics.

In the run logs, typical completed runs show two initial misses followed by hits for later Claude Code turns. The recorded metrics include cache hits and misses, cached/rest token counts, prefill throughput, decode throughput, and peak memory.

Hardware and Setup

The benchmark was run locally on an M5 Max system. The agent was Claude Code pointed at the local Anthropic-compatible endpoint. The backend was MLX through mlx_lm. The main settings were:

Apple M5 Max with 128 GB unified memory
macOS 26.3.1 (build 25D771280a)
Python 3.12.13, MLX 0.31.2, mlx-lm 0.31.3
Claude Code 2.1.126
benchmark proxy commit ed74457
MLX_ENABLE_THINKING=1
PROMPT_CACHE=1
MLX_MAX_KV_SIZE=196608
MLX_MAX_TOKENS=2048 for the first repeat set, then 8192 for completion repeats
KV cache quantization was not used in the main matrix

Model Artifacts and Revisions

Artifact identity matters. The main comparison below is MLX-only, and the revisions are recorded so the benchmark can be rerun against the same model artifacts.

Artifact	Repository	Revision
`Qwen3.6 27B 4-bit`	mlx-community/Qwen3.6-27B-4bit	`c000ac2c2057d94be3fa931000c31723aac53282`
`Qwen3.6 27B NVFP4`	mlx-community/Qwen3.6-27B-nvfp4	`4d2ab1eda95532602a12db325c8a3ea9f362a6f5`
`Qwen3.6 27B MXFP4`	mlx-community/Qwen3.6-27B-mxfp4	`b4ac73652af626fd653d0e4fe48a4e9c468a82f6`
`Qwen3.6 35B A3B 4-bit`	mlx-community/Qwen3.6-35B-A3B-4bit	`38740b847e4cb78f352aba30aa41c76e08e6eb46`
`Qwen3.6 35B A3B NVFP4`	mlx-community/Qwen3.6-35B-A3B-nvfp4	`9c1a3a223ddd8a3425212cc421056614f149cf0f`
`Qwen3.6 35B A3B MXFP4`	mlx-community/Qwen3.6-35B-A3B-mxfp4	`833013b27a1f7c6dbb008b55d37c387ea22ea89d`

Benchmark Protocol

The benchmark project, local-agent-bench, is a controlled TypeScript task suite. It is useful for local-agent smoke testing, tool behavior, prompt-cache behavior, and throughput. It is not a full coding benchmark.

Each Claude Code run works in a worktree under /tmp/bench-work that excludes held-out validation tests. Those tests are public in the benchmark repository for reproducibility, but hidden from the agent workdir during each run. Evaluation is applied externally from /tmp/bench-eval. Public tests are development feedback for the agent; held-out validation tests are the scoring signal.

The recorded fields include:

public and held-out validation pass/fail
wall time, turns, and tool calls
cache hits and misses
average prefill and decode throughput
peak memory

Main MLX 4-bit Results

The main matrix covers six MLX model/quant combinations, four TypeScript tasks, and three repeats per task, for 72 runs.

Model slug	Pass	Avg wall time (s)	Median wall time (s)	Avg prefill (t/s)	Median decode (t/s)	Avg peak memory (GB)
`27b-4bit`	11/12	145.9	115.5	370.2	28.0	34.1
`27b-mxfp4`	11/12	129.4	110.5	380.4	29.1	27.8
`27b-nvfp4`	10/12	119.7	115.0	392.3	27.2	28.4
`35b-a3b-4bit`	9/12	31.4	28.5	1256.2	113.2	26.8
`35b-a3b-mxfp4`	9/12	30.3	28.0	1349.1	113.4	24.4
`35b-a3b-nvfp4`	9/12	36.8	36.5	1245.3	107.9	25.6

The speed result is the cleanest signal. The 35B A3B artifacts finished in about 30-37 seconds on average, while the 27B artifacts were around 120-146 seconds. Decode throughput was also much higher for 35B A3B in this MLX setup.

The quality signal is weaker. All models passed tasks 01-03 in all repeats. Task04 dominated the pass-rate difference, so I would not use this table to claim a general model-quality ranking.

Model slug	01 Bugfix	02 Feature	03 Refactor	04 From scratch
`27b-4bit`	3/3	3/3	3/3	2/3
`27b-mxfp4`	3/3	3/3	3/3	2/3
`27b-nvfp4`	3/3	3/3	3/3	1/3
`35b-a3b-4bit`	3/3	3/3	3/3	0/3
`35b-a3b-mxfp4`	3/3	3/3	3/3	0/3
`35b-a3b-nvfp4`	3/3	3/3	3/3	0/3

Task04 Failure Analysis

Task04 asks the agent to implement a fixed-window rate limiter. Public tests allowed some rolling-window implementations to pass. Held-out validation tests caught the boundary bug.

The wrong implementation pattern was:

resetAt = nowMs + windowMs

The fixed-window boundary should be computed from the window bucket:

resetAt = Math.floor(nowMs / windowMs) * windowMs + windowMs

This is exactly the kind of bug held-out validation tests are supposed to expose. Public tests checked enough behavior for development feedback, but not enough to distinguish rolling-window semantics from fixed-window semantics.

In this benchmark, 27B variants solved task04 one to two times out of three. The 35B A3B variants solved it zero times out of three. That is a useful observation, but it is still only one task. I include it to explain the pass counts, not to rank the models.

Limitations

The task count is small: four TypeScript tasks, with one discriminator task.
Three repeats are better than one, but these results are still preliminary.
Some run diffs include package-lock.json noise from npm install.
Tool parsing is best-effort and model-dependent.
The benchmark uses held-out validation, but it is not designed to support broad quality claims.
Artifacts must be referenced by repository URL and revision; local directory names are not enough.
These results should not be used to claim that MXFP4 is generally better than NVFP4.
KV-cache quantization was intentionally excluded from the main comparison.

Reproducibility

The benchmark code, tasks, protocol, and generated summaries are public in local-agent-bench. The most relevant files are:

The proxy source is anthropic-api-to-mlx, and the benchmark proxy commit was ed74457.

Related source repos

git clone https://github.com/audexdev/local-agent-bench
cd local-agent-bench
cat BENCH_PROTOCOL.md
cat results/PRIMARY_MATRIX_20260504.md

FAQ

Can Claude Code run locally with MLX?

Claude Code can use a local MLX-backed model through an Anthropic-compatible proxy. This benchmark used anthropic-api-to-mlx to map Claude Code requests to mlx_lm and Qwen3.6.

Was Qwen3.6 35B-A3B faster than dense 27B?

Yes, in this controlled TypeScript benchmark on an M5 Max. The 35B-A3B variants averaged about 30-37 seconds per run, while the dense 27B variants averaged about 120-146 seconds.

Is this a model-quality benchmark?

No. The benchmark is a local-agent throughput and usability note. It includes pass counts for context, but it is not a definitive model-quality leaderboard.

Did the main benchmark use KV-cache quantization?

No. KV-cache quantization was intentionally excluded from the main matrix. The main runs used MLX model quantizations only.

What made prompt caching important?

Claude Code sends long, overlapping conversations across turns. Exact token-prefix prompt caching avoids repeated full-prefill work and makes the local agent loop much more usable.

Conclusion

This benchmark answers a narrow practical question: can a local Claude Code-compatible loop run comfortably through MLX on an M5 Max? In this setup, yes, as long as prompt caching is working.

The main thing I would take from the numbers is speed. Qwen3.6 35B A3B was much faster than dense 27B in this local MLX agent workflow, and the difference was large enough to change how usable the loop felt. That is the result I am comfortable publishing from this run.