Local AI Benchmarks: M4 Mac Mini vs Cloud APIs

The headline numbers

GPT-4o is faster. But it's not free.

Generation speed

22 tok/s

Llama 3.1 8B, local

Prompt ingestion

172 tok/s

Context loads instantly

Cost per token

£0

Unlimited usage included

GPT-4o streams at 137–148 tok/s — roughly 7× faster than local models. But local models generate at 19–22 tok/s, which is fast enough for real-time code assistance, autocomplete, and chat. The difference matters less than you'd think when you're reading and editing the output anyway.

Qwen 3 includes a built-in reasoning mode that thinks through problems before answering. The M4's unified memory means models load once and stay resident — no cold starts, no throttling, no API bills.

Test 1 — General reasoning

Binary search: explain and implement.

A classic prompt that tests code generation and explanation quality. Each model was asked to write a Python binary search function and explain each step.

Model	Prompt	Generation	Tokens	Time
Qwen 3 8Blocal	137.6 tok/s	19.5 tok/s	1,509	1m 19s
Qwen 2.5 7Blocal	101.7 tok/s	22.3 tok/s	788	35s
Llama 3.1 8Blocal	58.1 tok/s	21.0 tok/s	365	38s
GPT-4oapi	n/a	136.6 tok/s	498	5.2s

Test 2 — Real-world coding

Build a full REST API from scratch.

A production-grade coding task: write a complete Express.js REST API with five CRUD endpoints, Zod input validation, error handling middleware, and TypeScript types. This is the kind of task developers use AI for daily.

Model	Prompt	Generation	Tokens	Time
Qwen 3 8Blocal	127.8 tok/s	19.4 tok/s	2,048	1m 46s
Qwen 2.5 7Blocal	172.1 tok/s	22.3 tok/s	1,399	1m 3s
Llama 3.1 8Blocal	173.4 tok/s	20.8 tok/s	1,335	1m 5s
GPT-4oapi	n/a	148.4 tok/s	1,392	10.6s

GPT-4o is clearly faster — 148 tok/s vs ~20 tok/s locally. It also produces higher quality output for complex tasks. But the local models held their own: Qwen 3 8B reasoned through the problem and generated a full TypeScript API with Zod validation, and Llama 3.1 8B delivered a complete multi-file REST API with service layer, controllers, and error handling. Both produced valid, runnable code.

The trade-off is cost and privacy. At $10 per million output tokens, running AI agents 24/7 means $100–300+/mo in API costs — and every line of code hits OpenAI's servers. Locally? Unlimited, private, and already paid for.

The cost comparison

API bills add up. Your Mac doesn't charge per token.

Cloud API route

GPT-4o: $2.50/M input + $10/M output
24/7 agent usage: $100–300+/mo in tokens
Code sent to external servers
Rate limits under load
7× faster, but per-token billing

Halfpenny Mac Pro

£89/mo — everything included
Unlimited requests, no token billing
19–22 tok/s on Qwen 3 / Llama 3.1 / Qwen 2.5
100% private — nothing leaves the machine
No rate limits, no throttling
Dedicated M4 with 16GB unified memory

Cloud APIs are faster and smarter. That's the honest truth. But for developers who use AI throughout the day — autocomplete, inline chat, boilerplate, docs — local models handle the bulk of the work at zero marginal cost. And your proprietary code never touches a third-party API.

For developers

One command to your own AI coding assistant.

Every Halfpenny Mac comes with Ollama pre-installable in one command. Pull any open-source model — Qwen, Llama, DeepSeek Coder, Mistral, Code Llama — and connect it to your IDE.

Continue.dev — open-source AI coding assistant that connects to any Ollama model
Aider — AI pair programming in your terminal, works with local models
Open WebUI — ChatGPT-style interface for your local models
Any OpenAI-compatible tool — Ollama exposes a standard API on localhost

Setup takes five minutes. Install Ollama, pull a model, point your IDE at localhost:11434. No API keys, no account creation, no billing dashboards.

Test environment

How we ran these benchmarks.

All local benchmarks were run on a single Halfpenny Mac Pro-tier machine (mini-02-524) with the following specs:

Chip: Apple M4 (10-core CPU, 10-core GPU)
Memory: 16GB unified (shared between CPU and GPU)
Storage: 512GB SSD
Runtime: Ollama on macOS Sequoia
Models: Qwen 3 8B (4.9GB), Llama 3.1 8B (4.9GB), Qwen 2.5 7B (4.7GB)

API benchmarks used the OpenAI streaming API for GPT-4o. All tests used the same prompt and were run on 28 May 2026.

22 tokens per second.
Locally. Privately. Unlimited.