GPT-4o is faster. But it's not free.

Generation speed
22 tok/s
Llama 3.1 8B, local
Prompt ingestion
172 tok/s
Context loads instantly
Cost per token
£0
Unlimited usage included

GPT-4o streams at 137–148 tok/s — roughly 7× faster than local models. But local models generate at 19–22 tok/s, which is fast enough for real-time code assistance, autocomplete, and chat. The difference matters less than you'd think when you're reading and editing the output anyway.

Qwen 3 includes a built-in reasoning mode that thinks through problems before answering. The M4's unified memory means models load once and stay resident — no cold starts, no throttling, no API bills.

Binary search: explain and implement.

A classic prompt that tests code generation and explanation quality. Each model was asked to write a Python binary search function and explain each step.

Model Prompt Generation Tokens Time
Qwen 3 8Blocal 137.6 tok/s 19.5 tok/s 1,509 1m 19s
Qwen 2.5 7Blocal 101.7 tok/s 22.3 tok/s 788 35s
Llama 3.1 8Blocal 58.1 tok/s 21.0 tok/s 365 38s
GPT-4oapi n/a 136.6 tok/s 498 5.2s

Build a full REST API from scratch.

A production-grade coding task: write a complete Express.js REST API with five CRUD endpoints, Zod input validation, error handling middleware, and TypeScript types. This is the kind of task developers use AI for daily.

Model Prompt Generation Tokens Time
Qwen 3 8Blocal 127.8 tok/s 19.4 tok/s 2,048 1m 46s
Qwen 2.5 7Blocal 172.1 tok/s 22.3 tok/s 1,399 1m 3s
Llama 3.1 8Blocal 173.4 tok/s 20.8 tok/s 1,335 1m 5s
GPT-4oapi n/a 148.4 tok/s 1,392 10.6s

GPT-4o is clearly faster — 148 tok/s vs ~20 tok/s locally. It also produces higher quality output for complex tasks. But the local models held their own: Qwen 3 8B reasoned through the problem and generated a full TypeScript API with Zod validation, and Llama 3.1 8B delivered a complete multi-file REST API with service layer, controllers, and error handling. Both produced valid, runnable code.

The trade-off is cost and privacy. At $10 per million output tokens, running AI agents 24/7 means $100–300+/mo in API costs — and every line of code hits OpenAI's servers. Locally? Unlimited, private, and already paid for.

API bills add up. Your Mac doesn't charge per token.

Cloud API route

  • GPT-4o: $2.50/M input + $10/M output
  • 24/7 agent usage: $100–300+/mo in tokens
  • Code sent to external servers
  • Rate limits under load
  • 7× faster, but per-token billing

Cloud APIs are faster and smarter. That's the honest truth. But for developers who use AI throughout the day — autocomplete, inline chat, boilerplate, docs — local models handle the bulk of the work at zero marginal cost. And your proprietary code never touches a third-party API.

One command to your own AI coding assistant.

Every Halfpenny Mac comes with Ollama pre-installable in one command. Pull any open-source model — Qwen, Llama, DeepSeek Coder, Mistral, Code Llama — and connect it to your IDE.

  • Continue.dev — open-source AI coding assistant that connects to any Ollama model
  • Aider — AI pair programming in your terminal, works with local models
  • Open WebUI — ChatGPT-style interface for your local models
  • Any OpenAI-compatible tool — Ollama exposes a standard API on localhost

Setup takes five minutes. Install Ollama, pull a model, point your IDE at localhost:11434. No API keys, no account creation, no billing dashboards.

How we ran these benchmarks.

All local benchmarks were run on a single Halfpenny Mac Pro-tier machine (mini-02-524) with the following specs:

  • Chip: Apple M4 (10-core CPU, 10-core GPU)
  • Memory: 16GB unified (shared between CPU and GPU)
  • Storage: 512GB SSD
  • Runtime: Ollama on macOS Sequoia
  • Models: Qwen 3 8B (4.9GB), Llama 3.1 8B (4.9GB), Qwen 2.5 7B (4.7GB)

API benchmarks used the OpenAI streaming API for GPT-4o. All tests used the same prompt and were run on 28 May 2026.

Run AI locally for £89/mo.

No API bills. No rate limits. No data leaving your machine. Just a dedicated M4 Mac mini running your models, 24/7.

Sign up See all tiers