GPT-4o is faster. But it's not free.
GPT-4o streams at 137–148 tok/s — roughly 7× faster than local models. But local models generate at 19–22 tok/s, which is fast enough for real-time code assistance, autocomplete, and chat. The difference matters less than you'd think when you're reading and editing the output anyway.
Qwen 3 includes a built-in reasoning mode that thinks through problems before answering. The M4's unified memory means models load once and stay resident — no cold starts, no throttling, no API bills.
Binary search: explain and implement.
A classic prompt that tests code generation and explanation quality. Each model was asked to write a Python binary search function and explain each step.
| Model | Prompt | Generation | Tokens | Time |
|---|---|---|---|---|
| Qwen 3 8Blocal | 137.6 tok/s | 19.5 tok/s | 1,509 | 1m 19s |
| Qwen 2.5 7Blocal | 101.7 tok/s | 22.3 tok/s | 788 | 35s |
| Llama 3.1 8Blocal | 58.1 tok/s | 21.0 tok/s | 365 | 38s |
| GPT-4oapi | n/a | 136.6 tok/s | 498 | 5.2s |
Build a full REST API from scratch.
A production-grade coding task: write a complete Express.js REST API with five CRUD endpoints, Zod input validation, error handling middleware, and TypeScript types. This is the kind of task developers use AI for daily.
| Model | Prompt | Generation | Tokens | Time |
|---|---|---|---|---|
| Qwen 3 8Blocal | 127.8 tok/s | 19.4 tok/s | 2,048 | 1m 46s |
| Qwen 2.5 7Blocal | 172.1 tok/s | 22.3 tok/s | 1,399 | 1m 3s |
| Llama 3.1 8Blocal | 173.4 tok/s | 20.8 tok/s | 1,335 | 1m 5s |
| GPT-4oapi | n/a | 148.4 tok/s | 1,392 | 10.6s |
GPT-4o is clearly faster — 148 tok/s vs ~20 tok/s locally. It also produces higher quality output for complex tasks. But the local models held their own: Qwen 3 8B reasoned through the problem and generated a full TypeScript API with Zod validation, and Llama 3.1 8B delivered a complete multi-file REST API with service layer, controllers, and error handling. Both produced valid, runnable code.
The trade-off is cost and privacy. At $10 per million output tokens, running AI agents 24/7 means $100–300+/mo in API costs — and every line of code hits OpenAI's servers. Locally? Unlimited, private, and already paid for.
API bills add up. Your Mac doesn't charge per token.
Cloud API route
- GPT-4o: $2.50/M input + $10/M output
- 24/7 agent usage: $100–300+/mo in tokens
- Code sent to external servers
- Rate limits under load
- 7× faster, but per-token billing
Halfpenny Mac Pro
- £89/mo — everything included
- Unlimited requests, no token billing
- 19–22 tok/s on Qwen 3 / Llama 3.1 / Qwen 2.5
- 100% private — nothing leaves the machine
- No rate limits, no throttling
- Dedicated M4 with 16GB unified memory
Cloud APIs are faster and smarter. That's the honest truth. But for developers who use AI throughout the day — autocomplete, inline chat, boilerplate, docs — local models handle the bulk of the work at zero marginal cost. And your proprietary code never touches a third-party API.
One command to your own AI coding assistant.
Every Halfpenny Mac comes with Ollama pre-installable in one command. Pull any open-source model — Qwen, Llama, DeepSeek Coder, Mistral, Code Llama — and connect it to your IDE.
- Continue.dev — open-source AI coding assistant that connects to any Ollama model
- Aider — AI pair programming in your terminal, works with local models
- Open WebUI — ChatGPT-style interface for your local models
- Any OpenAI-compatible tool — Ollama exposes a standard API on localhost
Setup takes five minutes. Install Ollama, pull a model, point your IDE at localhost:11434. No API keys, no account creation, no billing dashboards.
How we ran these benchmarks.
All local benchmarks were run on a single Halfpenny Mac Pro-tier machine (mini-02-524) with the following specs:
- Chip: Apple M4 (10-core CPU, 10-core GPU)
- Memory: 16GB unified (shared between CPU and GPU)
- Storage: 512GB SSD
- Runtime: Ollama on macOS Sequoia
- Models: Qwen 3 8B (4.9GB), Llama 3.1 8B (4.9GB), Qwen 2.5 7B (4.7GB)
API benchmarks used the OpenAI streaming API for GPT-4o. All tests used the same prompt and were run on 28 May 2026.
Run AI locally for £89/mo.
No API bills. No rate limits. No data leaving your machine. Just a dedicated M4 Mac mini running your models, 24/7.