Mac LLM Optimization — The Practical Playbook
What this is: The hands-on optimization guide for running LLMs locally on Apple Silicon Macs. Specific flags, specific numbers, specific anti-patterns. Focused on the failure mode that kills user experience: blowing past unified memory and forcing the kernel into swap.
Why it matters: A 16 GB MacBook Air is the median Locara target. On that machine, every choice — quantization, KV cache type, context length, thread count, model unloading policy — is the difference between “feels native” and “beachball.” The Mac Studio Ultra user has headroom for sloppy choices; the Air user does not, and the Air user is most of the market.
Most relevant to Locara: The practical end of the stack. Pairs with llm-memory-math.md (why the formulas drive these choices), macos-memory-management.md (the OS-level memory primitives this note depends on), mac-hardware-lineup.md (the per-SKU numbers to tune against), mlx.md (the framework), and ollama.md / lm-studio.md (alternative runtimes).
Numbers in this note are ballpark estimates synthesized from community reporting (r/LocalLLaMA, AlexZiskind YouTube benchmarks, MLX team posts). Treat them as order-of-magnitude, not gospel. The math in
llm-memory-math.mdgives you predictions for your own configurations.
1. Loading models efficiently
mmap is the foundation
llama.cpp loads GGUF files via mmap(2) by default. The single most important memory optimization for LLM apps on Mac. When a file is mmap’d:
- Weight pages are demand-paged from disk; only pages actually touched become resident.
- Those pages are clean, file-backed — the kernel can evict them under pressure without writing to swap (just re-reads from the GGUF file).
- They don’t count against the process’s dirty footprint the way malloc’d memory does.
topwill under-report;footprint <pid>and Activity Monitor’s pressure gauge are the honest views.
Georgi Gerganov’s design notes in the llama.cpp repo make this explicit. The flag --no-mmap exists for systems where mmap is slower (some NFS-mounted models, or when you want pages locked). On Mac with an SSD, you almost never want --no-mmap.
The related flag --mlock calls mlock(2) to wire the pages so they cannot be evicted. Avoid --mlock on memory-constrained Macs — it forces weights to be non-evictable, which is exactly the over-commitment you want to prevent. --mlock only makes sense on a Mac Studio with 128+ GB where you have headroom and want to guarantee no swap stutter.
MLX uses mmap too
MLX’s weight loader (mlx.core.load, mlx_lm.utils.load) memory-maps .safetensors and .npz weight files. Awni Hannun has discussed this in the context of MLX’s lazy execution model — weights aren’t materialized into GPU memory until a computation references them, and on UMA hardware the same physical page is visible to CPU and GPU.
Source in mlx/io/safetensors.cpp and mlx/io/load.cpp in the ml-explore/mlx repo. The safetensors format is specifically designed for mmap (header + contiguous tensor blob, so each tensor is a file slice).
Combined with mmap, MLX’s lazy execution means actual residency grows incrementally during the first forward pass — the top/Activity Monitor footprint will climb for several seconds after “model loaded” before stabilizing.
GPU layer offload (-ngl) — different meaning on UMA
In llama.cpp, -ngl N (or --n-gpu-layers N) controls how many transformer layers run on the GPU vs CPU. On discrete-GPU systems this is a tradeoff between PCIe transfer cost and VRAM capacity.
On Apple Silicon, it’s fundamentally different — UMA means no host-to-device copy, and the Metal backend gets a direct view of the same memory:
-ngl 999(or-ngl -1in newer builds) — offload everything to GPU — is almost always correct on Mac. No transfer cost penalty.- Partial offload (
-ngl 20etc.) is mainly useful when you want headroom for other Metal workloads (UI compositor, separate Metal app). - The GPU memory budget on Apple Silicon is governed by
iogpu.wired_limit_mb— by default capping GPU-wired allocations at ~60–75% of total RAM. On a 64 GB Mac, ~48 GB to Metal. Override (unsupported):sudo sysctl iogpu.wired_limit_mb=N. Pre-Sonoma:debug.iogpu.wired_limitin bytes.
Lazy / streaming layer loading
The hypothetical “load only embedding + final layer, stream other layers from disk per token” is largely subsumed by mmap. With mmap, the OS already streams layers on demand and eviction is automatic.
True per-token streaming is punishingly slow: token-by-token decode needs every layer for every token. Streaming at SSD speeds (~5 GB/s) instead of memory speeds (~400 GB/s on M3 Max, ~800 GB/s on M3 Ultra) would be ~100× slower. The only viable use case is prefill of extremely long context where you can amortize disk read across many tokens of compute. Not common in production.
2. KV cache management
The KV cache is the runtime memory cost — scales with context length and grows during generation. For long contexts it can exceed the model weights themselves (see llm-memory-math.md for the formulas).
llama.cpp KV controls
-c <N>(or--ctx-size): cap context. The single most effective lever for the user’s RAM budget. Use 4096 or 8192 unless you truly need more.--cache-type-k <type>and--cache-type-v <type>: quantize the KV cache. Options:f32,f16(default),bf16,q8_0,q4_0,q4_1,iq4_nl,q5_0,q5_1.q8_0halves cache size with negligible quality loss.q4_0quarters it but can degrade long-context quality.-fa(FlashAttention): required as of recent builds for many KV quantization types. Improves prefill speed substantially.--keep <N>: on context shift, retain the first N tokens (typically the system prompt) rather than dropping them.--prompt-cache <path>and--prompt-cache-all: save the KV state to disk so reloading a long shared prefix is fast. Crucial for “I have a 50K-token system prompt” workflows.
Sensible default for Mac (8 GB chat budget, 7B model):
-ngl 999 -t <num_perf_cores> -c 8192 -fa --cache-type-k q8_0 --cache-type-v q8_0
MLX KV controls
In mlx-lm (the ml-explore/mlx-lm repo):
mlx_lm.generate(model, tokenizer, prompt, max_tokens=N, prompt_cache=cache): pass aprompt_cacheto reuse across turns.mlx_lm.models.cache.KVCacheandRotatingKVCache: the latter implements a sliding window so cache size is bounded.mlx_lm.models.cache.QuantizedKVCache: quantized KV.make_prompt_cache(model, quantize_kv=True, kv_bits=4, kv_group_size=64).- Server:
mlx_lm.server --cache-limit-gb Nto cap process-wide cache memory.
The MLX team is actively iterating on this; exact API is version-dependent (check mlx-lm release notes). As of mlx-lm 0.20+, the standard pattern:
from mlx_lm import load, generate
from mlx_lm.models.cache import make_prompt_cache
model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-4bit")
cache = make_prompt_cache(model)
out = generate(model, tokenizer, prompt, max_tokens=512, prompt_cache=cache)
# cache now holds the KV; pass it again for the next turn.
Sliding window attention
Mistral 7B v0.1 introduced native SWA with a 4096-token window even when n_ctx was larger. Bounds KV cache regardless of how long context grows.
Note: Mistral v0.2 and Mixtral removed SWA in favor of larger native windows. Llama and Qwen do not use SWA. Gemma 2 uses interleaved SWA (alternating local and global attention). Read each model’s card — attention shape varies.
Context shifting
When the user’s conversation exceeds n_ctx:
- Truncate: drop oldest tokens. Simple; loses early context (combine with
--keepto preserve system prompt). - Shift: re-encode after dropping a chunk. llama.cpp supports this natively when
n_past >= n_ctx. - Summarize: have the model summarize old turns into a compact form, then re-inject. Higher quality, costs an extra forward pass.
3. Quantization choice for Mac
The community consensus
- Q4_K_M (~4.5 bits/weight average): best capacity-to-quality ratio for memory-constrained machines. 7B → ~4.4 GB; 70B → ~42 GB.
- Q5_K_M (~5.5 bits): quality bump, ~25% more memory.
- Q6_K (~6.5 bits): near-FP16 quality; 7B → ~5.5 GB, 70B → ~56 GB.
- Q8_0 (~8.5 bits): indistinguishable from FP16, doubles memory vs Q4_K_M.
Naming convention: Q<bits>_K_<S/M/L>. K = k-quants (better than legacy Q4_0/Q4_1). S/M/L control how much higher-precision storage is used for the most sensitive weights (output/attention layers).
For a 16 GB Mac running a 7B chat model: Q4_K_M leaves enough headroom for 8K context + OS + other apps.
For a 36 GB MacBook Pro: Q6_K of a 7B model + Q4_K_M of a 14B is a reasonable two-model setup.
MLX quantization
The mlx-community org on Hugging Face publishes MLX-quantized models at 4-bit, 6-bit, 8-bit. The typical conversion command:
mlx_lm.convert --hf-path <model> --q-bits 4 --q-group-size 64
MLX quants tend to be slightly smaller than equivalent GGUF (no per-tensor metadata overhead) and load slightly faster on Apple Silicon because the kernels are tuned for Metal. Awni Hannun’s benchmarks show MLX 4-bit roughly matches GGUF Q4_K_M on MMLU/HellaSwag, within noise.
Speculative decoding
Small “draft” model proposes K tokens; large “verifier” checks them in a single forward pass. On Mac this works particularly well:
- Draft model (Qwen 2.5 0.5B or 1.5B) runs fast.
- Memory bandwidth — not compute — bottlenecks the large model, so amortizing across K tokens is a big win.
Cost: both models resident. 7B + 0.5B at Q4 = ~4.4 + 0.4 = 4.8 GB weights. Worth it for 1.5–2× throughput on long generations.
llama.cpp: llama-speculative binary or -md <draft-model> --draft N flags in newer main builds. MLX: mlx_lm.generate(draft_model=...). Simon Willison’s “Run LLMs on macOS using llm-mlx” post (late 2024 on simonwillison.net) walks through practical setup.
4. Metal / GPU memory residency
Storage modes
MTLResourceStorageMode on Apple Silicon:
.shared: CPU and GPU access the same physical memory. The right default on UMA. No copy, no synchronization beyond fence/event ordering..private: GPU-only. Driver may put it in a different physical region with different cache modes; on Apple Silicon there’s no separate VRAM but caching behavior changes..managed: macOS-only legacy for discrete GPUs (Intel Macs). N/A on Apple Silicon..memoryless: tile memory for render passes. Not relevant for LLM inference.
For LLM tensors, .shared lets you read from CPU (to extract logits for sampling) without a blit. Sampling is usually CPU-side because its branching is awkward in Metal shaders.
MTLHeap for pooling
MTLHeap sub-allocates from a pre-reserved chunk. Useful for KV cache pages where you allocate/free frequently. llama.cpp’s Metal backend (ggml-metal.m) uses a buffer pool internally; MLX has its own pool (mlx/backend/metal/allocator.cpp).
The win: avoiding newBuffer calls (slow) and reducing fragmentation. Not something most app developers touch directly — relies on the inference engine’s allocator.
Purgeable state
MTLResource.setPurgeableState(.volatile) tells the kernel “OK to discard under pressure; I can rebuild.” Right state for:
- KV cache pages from old conversation turns already streamed to disk.
- Embedding caches.
After marking volatile, always check the return value of setPurgeableState(.nonVolatile) before reuse — if the kernel reclaimed it, you get .empty and must regenerate.
recommendedMaxWorkingSetSize
Apple’s exposed ceiling per MTLDevice.recommendedMaxWorkingSetSize. Stay under this for consistent GPU performance. Above it, the OS may swap GPU resources, killing throughput. Override (unsupported): sudo sysctl iogpu.wired_limit_mb=N on Sonoma+ (resets on reboot).
5. Threading, GCD, and concurrency
What’s actually parallel
Per-token decode is dominated by GEMV (matrix-vector) over model weights, executed on the GPU. Not CPU-parallel — a single big Metal command stream. CPU parallelism matters for:
- Prompt tokenization (BPE merging — fastest single-core but parallelizable for very long prompts).
- Sampling (top-k/top-p with logit biasing).
- Post-processing (streaming to UI, JSON parsing for tool use).
- Prefetching the next message’s embeddings.
- Encoding new turns into the KV cache (multi-token prefill — mostly GPU on Mac, with CPU coordination).
QoS classes
GCD QoS on Apple Silicon maps to P-core vs E-core scheduling:
.userInteractive/.userInitiated: P-cores preferentially..utility: mixed..background: E-cores; can be throttled significantly.
For LLM inference:
- Inference thread (calls into llama.cpp or MLX):
.userInitiated. - Background prefetch / cache warming:
.utility(not.background, which can stall for tens of seconds). - UI updates:
.userInteractiveon main queue.
Swift Concurrency: Task.detached(priority: .userInitiated). ObjC / C: pthread_set_qos_class_self_np(QOS_CLASS_USER_INITIATED, 0).
Thread count (llama.cpp -t)
llama.cpp’s -t sets CPU thread count. On Apple Silicon with -ngl 999 (everything on GPU), CPU threads do little. Rule of thumb: -t <num_perf_cores>:
- M2 / M3 / M4 (base): 4 P-cores →
-t 4 - M2 Pro / M3 Pro: 6 or 8 P-cores →
-t 6or-t 8 - M2 Max / M3 Max / M4 Max: 8 or 12 P-cores →
-t 8or-t 12 - M2 Ultra / M3 Ultra: 16 or 24 P-cores →
-t 16or-t 24
Check exact P-core count: sysctl hw.perflevel0.physicalcpu.
Going higher than P-core count hurts — E-cores are slower per-core and synchronization overhead in the threadpool dominates. The default -t -1 (use all cores) is wrong on Mac. Always explicitly set -t.
References: WWDC 2021 #10254 “Tune CPU job scheduling for Apple silicon Macs”; WWDC 2022 “Eliminate data races using Swift Concurrency.”
6. Power and thermals
Throttling profile
- MacBook Air (M1/M2/M3/M4): passively cooled. Sustained inference throttles, typically after 30–90 s at full GPU load. Gentle throttle (not a cliff) but tok/s can drop 20–40% from peak.
- MacBook Pro 14”/16”: active cooling. Sustains near-peak indefinitely on most workloads. 16” has better headroom.
- Mac mini / Mac Studio: active cooling, no battery constraints. Best sustained performance.
AlexZiskind on YouTube has the most rigorous thermal-throttle tests for various Macs running LLMs.
Low Power Mode
NSProcessInfo.processInfo.isLowPowerModeEnabled is true when the user enables Low Power Mode (System Settings → Battery). On Macs it’s a deliberate user choice — caps CPU/GPU to extend battery.
Should LLM apps respect it? Yes.
- Drop to a smaller model or shorter context.
- Disable speculative decoding (the draft adds load).
- Increase response time budget; warn the user.
Subscribe to NSProcessInfoPowerStateDidChange notification and adjust on the fly.
Thermal pressure
NSProcessInfo.processInfo.thermalState returns .nominal, .fair, .serious, or .critical. Subscribe to ProcessInfo.thermalStateDidChangeNotification. At .serious / .critical the system has already started throttling. Defensive moves:
- Pause non-essential background inference.
- Reduce streaming token frequency (less UI work).
- Shorter
max_tokens.
ProMotion and refresh
ProMotion (variable 24–120 Hz) doesn’t directly affect inference, but streaming-token UI can be smoother by tying token-arrival to display refresh via CVDisplayLink (macOS — being deprecated in favor of NSView-level updates in macOS 15+).
7. Inference engine choice and configuration
MLX
The arXiv paper “Production-Grade Local LLM Inference on Apple Silicon” (arXiv:2511.05502) reports on identical hardware: MLX ~230 tok/s vs llama.cpp ~150 tok/s for sub-14B models. Treat as ballpark — methodology, model, and context length matter.
MLX advantages on Apple Silicon:
- Kernels written specifically for Metal (vs llama.cpp’s generic Metal backend).
- Tighter integration with Apple’s BNNS / Accelerate.
- Better fused attention kernels.
Server: mlx_lm.server --model <path> --port 8080 exposes an OpenAI-compatible API.
Caveats:
- Weaker quantization variety vs GGUF.
- Less mature for >70B models.
- No non-Apple hardware support (by design).
llama.cpp
The Swiss army knife. Best for:
- Wide model coverage.
- Many quantization formats (K-quants, IQ-quants).
- Tunable KV cache (quantized, FlashAttention, prompt cache files).
- Cross-platform (same model file Mac/Linux/Windows).
Recommended Mac flags:
-ngl 999 -t <num_perf_cores> -c 8192 -fa --cache-type-k q8_0 --cache-type-v q8_0
Build: cmake -B build -DGGML_METAL=ON (current) or make GGML_METAL=1 (older). brew install llama.cpp for pre-built binaries.
MLC LLM
Uses Apache TVM to compile model graphs. Strong on cross-platform deployment (iOS, Android, web via WebGPU). On Mac competitive but less commonly used because the toolchain is heavier. mlc-ai/mlc-llm.
Ollama
Wraps llama.cpp + a model registry + a daemon. Convenient for users; for developers adds:
- Process boundary (HTTP API).
- Model lifecycle management.
- Some overhead vs direct llama.cpp embedding.
Environment variables of note:
OLLAMA_KEEP_ALIVE(default 5m): how long to keep a model loaded after last request. Set to 0 to unload immediately, or24hto pin.OLLAMA_MAX_LOADED_MODELS(default 1 or 3 depending on version): cap concurrent models.OLLAMA_NUM_PARALLEL(default 1): parallel requests per model.- Per-model:
num_ctx,num_gpu,num_threadin the Modelfile or API request.
Best for “just works” model server. Worst for memory-tight scenarios — daemon has overhead and the default keep-alive is generous.
vLLM / TGI / SGLang
Server-grade engines, primarily CUDA-tuned. Mac support exists but is uncommon on consumer Mac.
8. App-level optimizations
Unload models when idle
A 40 GB model loaded “just in case” forces other apps into a smaller working set. Strategy:
- Model manager (singleton) tracks load time and last-use time.
- After N seconds of idle (60–300 s reasonable), unload.
- On next request, reload — first-token latency includes load time; set user expectations.
- If OS reports memory pressure, unload immediately regardless of idle timer.
Implementation note: in MLX, “unload” means dropping all references so arrays deallocate. In llama.cpp, llama_free_model + llama_free. Mac memory accounting is lazy — actual decrease in RSS may take a few seconds.
”Models are dirty pages even when mmap’d” — actually no
Slight clarification: mmap’d weights are clean (file-backed), not dirty. The kernel evicts them without writing to swap; it re-reads from the GGUF on next touch. The real issue is activation memory and KV cache — those are anonymous private pages and become swap pressure.
Under MEMORYSTATUS_PRESSURE_CRITICAL on iOS, the kernel kills the foreground app. On macOS it does not — instead, it aggressively compresses memory and swaps to disk. User sees beachballs. The right move: subscribe to pressure events (see §9) and proactively shed load.
Streaming UI
Stream tokens to the UI as they arrive — don’t buffer the whole response. Concretely:
- Inference engine emits a per-token callback.
- Callback dispatches to main queue (or actor in Swift Concurrency).
- UI appends incrementally. Use a
TextorTextEditorthat appends cheaply; avoid full re-layouts per token.
For Markdown rendering: parsing per token is expensive. Common pattern: maintain a raw-text buffer for the in-progress response and only render Markdown on completion or every N tokens.
Shared-prefix caching
If your app sends a 4K-token system prompt + few-shot examples on every turn, cache the KV state after the shared prefix:
- llama.cpp:
--prompt-cache <file> --prompt-cache-all. Or in the API, save/restorellama_state. - MLX: pass the same
prompt_cacheobject across turns.
Can cut first-token latency from seconds to milliseconds for long system prompts.
Multiple-model coordination
If your app uses embedder + chat model + STT/TTS, they collectively must fit. Strategies:
- LRU model manager: load on demand, evict least-recently-used when pressure rises.
- Sequential pipelines: if your workflow is “embed → chat → TTS” with no overlap, load one model at a time. Cost: load/unload time between stages.
- Smaller models for secondary tasks: a 0.5B embedder is plenty for retrieval. Don’t load a 7B where a 100M sentence transformer suffices.
OS-provided alternatives save real memory:
- Embeddings: Apple’s
NaturalLanguageframework. - STT:
Speechframework or Whisper-CoreML. - TTS:
AVSpeechSynthesizer.
9. Detecting and responding to memory pressure
The API
let source = DispatchSource.makeMemoryPressureSource(
eventMask: [.warning, .critical],
queue: .global(qos: .utility)
)
source.setEventHandler {
let event = source.data
if event.contains(.critical) {
// Drop KV cache, unload models, clear embeddings.
} else if event.contains(.warning) {
// Free non-essential caches.
}
}
source.resume()
States:
.normal: no pressure..warning: system starting to feel pressure — free non-essential caches..critical: imminent — drop models, free everything you can.
Practical ladder
- Normal: keep things as-is.
- Warning: drop KV cache for non-active conversations. Reduce
max_tokensfor queued requests. Stop prefetching. - Critical: unload secondary models (embedder, TTS). Drop chat model if not in-flight. If in-flight, finish the current response and then unload.
- Post-critical: when pressure clears, lazily reload on next request.
The system-wide signal is shared across processes. On a 16 GB Mac with Chrome + Slack + your LLM app, you may see .warning even when your app is “well-behaved” — react anyway, because shedding load means the user’s other apps stay responsive.
What macOS won’t do
macOS will not jetsam-kill a foreground app for memory pressure the way iOS does. It will:
- Compress your dirty pages (saves memory at CPU cost).
- Swap compressed pages to disk.
- Slow disk I/O and GPU command submission as the system thrashes.
The user sees beachballs and blames your app. Even though you won’t be killed, the UX cost of bad memory citizenship is real.
10. Practical numbers and tok/s expectations
Memory bandwidth per chip (approximate, see mac-hardware-lineup.md for exact)
| Chip | Bandwidth | Notes |
|---|---|---|
| M1 base | 68 GB/s | |
| M2 / M3 / M4 base | 100 / 100 / 120 GB/s | |
| M1 / M2 Pro | 200 GB/s | |
| M3 Pro | 150 GB/s | regression vs M2 Pro |
| M4 Pro | 273 GB/s | recovers and beats |
| M1 / M2 Max | 400 GB/s | |
| M3 Max (14C) | 300 GB/s | binned variant |
| M3 Max (16C) | 400 GB/s | full die |
| M4 Max (14C) | 410 GB/s | binned variant |
| M4 Max (16C) | 546 GB/s | full die |
| M1 / M2 / M3 Ultra | 800 GB/s | unchanged for 4 generations |
| M5 base | ~150 GB/s | LPDDR5X-9600 |
Tok/s expectations (decode, single user, short context)
These are synthesized from r/LocalLLaMA, AlexZiskind, MLX team benchmarks. Order-of-magnitude estimates.
Qwen 2.5 7B Q4_K_M (~4.4 GB weights):
- MacBook Air M2 8 GB: not viable (system has ~3 GB free at idle). M2 16 GB: ~25 tok/s. M2 24 GB: ~25 tok/s.
- MacBook Air M3 16/24 GB: ~22 tok/s (lower bandwidth on M3 base).
- MacBook Air M4 16/24/32 GB: ~28 tok/s.
- MacBook Pro M3 Pro 18/36 GB: ~35 tok/s.
- MacBook Pro M4 Pro 24/48/64 GB: ~50–60 tok/s.
- MacBook Pro M3 Max 36/64/128 GB: ~60–80 tok/s depending on bandwidth binning.
- MacBook Pro M4 Max 36/48/64/128 GB: ~80–110 tok/s.
- Mac Studio M2 Ultra 64/128/192 GB: ~80–110 tok/s.
- Mac Studio M3 Ultra 96/256/512 GB: ~90–120 tok/s.
Llama 3 8B Q4_K_M (~4.9 GB):
- ~10–15% slower than Qwen 2.5 7B due to slightly larger weights. Same ordering applies.
Llama 3 70B Q4_K_M (~42 GB):
- Won’t fit on anything under 48 GB.
- MacBook Pro M3 Max 64 GB: ~6–10 tok/s. Painful but usable.
- MacBook Pro M3 Max 128 GB: ~6–10 tok/s (bandwidth-bound, not capacity).
- MacBook Pro M4 Max 64/128 GB: ~10–14 tok/s.
- Mac Studio M2 Ultra 128/192 GB: ~10–14 tok/s.
- Mac Studio M3 Ultra 256/512 GB: ~12–16 tok/s.
Caveats:
- Prompt processing (prefill) tok/s is typically 5–20× higher than decode tok/s. Long prompts process fast; generation is slow.
- These are steady-state with small context. At 32K context the KV cache reads add overhead and decode tok/s drops 30–50%.
- Speculative decoding can multiply effective tok/s by 1.5–2× for typical chat workloads.
Where bandwidth ceases to be the constraint
At batch=1 and short generations, you’re firmly bandwidth-bound. At larger batches (multi-user serving) or with very large MoE models, compute becomes the constraint. For consumer Mac (single user, chat), bandwidth × weight-size dominates. This is why a Mac Studio Ultra is dramatically faster than a MacBook Air even at the same quantization.
Benchmarking sources
- AlexZiskind YouTube — most rigorous public Mac LLM benchmarking; thermals, sustained vs burst, prefill vs decode.
- r/LocalLLaMA — weekly tok/s threads. Wide methodology variance.
- Awni Hannun’s MLX benchmark gists — cleanest MLX numbers.
- Hugging Face mlx-community model cards — often include tok/s on reference hardware.
- Justine Tunney’s llamafile blog (
https://justine.lol) — especially for the AMX-accelerated path on M1/M2.
Summary playbook for an app targeting “works on 16 GB Mac”
- Default to Qwen 2.5 7B Q4_K_M or equivalent (~4.4 GB).
- Load via mmap (default in both llama.cpp and MLX). Never use
--mlockon 16 GB systems. - Cap
n_ctxat 8192 unless the user opts in to more. Use q8_0 KV quantization and FlashAttention (-fa). -ngl 999(everything on GPU) and-t <num_P_cores>.- Subscribe to memory pressure. Drop secondary models, KV cache, embedding caches on warning. Drop chat model on critical (let in-flight requests finish first).
- Unload after idle — 2–5 minutes is reasonable.
- Stream tokens to UI, don’t buffer.
- Cache shared prefixes (system prompt + few-shot) as KV cache files.
- Respect Low Power Mode and thermal state — gracefully downshift.
- For multi-model apps, use an LRU model manager with a memory budget that respects the user’s hardware tier.
For a Mac Studio Ultra user, most of this matters less — they have headroom. For the median 16 GB MacBook Air user, every one matters; getting any of them wrong shows up as beachballs.
Specific learnings for Locara
-
The runtime should own mmap loading. Apps shouldn’t be able to opt out — the only way they get weights into memory is through a Locara primitive that mmaps under the hood and exposes Metal-shared buffers. No
--mlock, noData(contentsOf:). -
The runtime should own memory-pressure response. Apps subscribe to a normalized “memory budget tightening” capability event from the runtime, not the raw dispatch source. Lets the runtime coordinate multiple apps’ eviction behavior and avoid the case where two apps both keep their models loaded while pressure rises.
-
q8_0 K/Vas the default KV cache type. No user-visible knob to turn it off; manifest opt-in for quality-sensitive apps (e.g., coding-grade reasoning) only. Almost all chat workloads are indistinguishable from FP16 KV at q8. -
Thread count is a runtime concern, not an app concern. Locara reads
sysctl hw.perflevel0.physicalcpuonce, plumbs that into every inference engine instance. Apps never set-t. -
The model unload policy is a platform default, not an app choice. Apps can request “keep loaded” but the default is LRU-evict after 3 minutes of idle, with critical-pressure override. Prevents the Ollama-style “model stays loaded eating 40 GB while you went to lunch” problem.
-
Reject
--no-mmapand--mlockfrom manifest. Locara’s manifest should explicitly forbid these footguns. App authors don’t need them; their presence almost always indicates a bug. -
MLX-default with llama.cpp fallback is the right inference path on Mac, per
mlx.md’s analysis. This note doesn’t change that conclusion — it just makes the implementation concrete. Locara runtime handles the dispatch. -
Speculative decoding requires both models to be in the manifest as a pair. Runtime computes combined memory cost. App authors don’t manage draft models manually.
-
Streaming token UI as a runtime primitive. Locara provides an event-stream API; apps don’t poll the inference engine. Standardizes the smoothness story and lets the runtime do back-pressure under thermal/power constraints.
-
Per-app memory budget enforcement. Locara’s manifest declares
requires.memoryGB: N; runtime monitors footprint and warns the app (and the user) if it exceeds. Prevents an app from quietly starving the rest of the user’s system.
References
Engines and frameworks
- llama.cpp —
https://github.com/ggerganov/llama.cpp(Georgi Gerganov) - MLX —
https://github.com/ml-explore/mlx(Awni Hannun et al., Apple) - mlx-lm —
https://github.com/ml-explore/mlx-lm - MLC LLM —
https://github.com/mlc-ai/mlc-llm - Ollama —
https://github.com/ollama/ollama - llamafile —
https://github.com/Mozilla-Ocho/llamafile(Justine Tunney) - vLLM —
https://github.com/vllm-project/vllm
Key writeups and benchmarks
- arXiv:2511.05502 — Production-Grade Local LLM Inference on Apple Silicon: Engineering Tradeoffs, Benchmarks, and Deployment Patterns (2025)
- Justine Tunney, Edge AI Just Got Faster —
https://justine.lol/mmap/ - Simon Willison, “Run LLMs on macOS using llm-mlx and Apple’s MLX framework” (2025-02-15) —
https://simonwillison.net - Simon McLeod, Bringing K/V Context Quantisation to Ollama —
https://smcleod.net/2024/12/bringing-k-v-context-quantisation-to-ollama/ - llama.cpp discussion #638 (mmap design) —
https://github.com/ggml-org/llama.cpp/discussions/638 - llama.cpp discussion #5932 (KV cache quantization) —
https://github.com/ggml-org/llama.cpp/discussions/5932 - llama.cpp discussion #9999 (mmap RSS reporting) —
https://github.com/ggml-org/llama.cpp/discussions/9999
Apple primary
- Choosing a resource storage mode for Apple GPUs —
https://developer.apple.com/documentation/metal/choosing-a-resource-storage-mode-for-apple-gpus - Reducing the memory footprint of Metal apps —
https://developer.apple.com/documentation/metal/reducing-the-memory-footprint-of-metal-apps recommendedMaxWorkingSetSize—https://developer.apple.com/documentation/metal/mtldevice/recommendedmaxworkingsetsizeDISPATCH_SOURCE_TYPE_MEMORYPRESSURE—https://developer.apple.com/documentation/dispatch/dispatch_source_type_memorypressureNSProcessInfo(Low Power Mode, thermal state, activity assertions) —https://developer.apple.com/documentation/foundation/processinfo
WWDC sessions
- WWDC 2020 “Explore the new system architecture of Apple silicon Macs”
- WWDC 2021 #10254 “Tune CPU job scheduling for Apple silicon Macs”
- WWDC 2022 #10106 “Profile and optimize your game’s memory”
- WWDC 2024 ML / Metal sessions on MLX and Neural Engine
- WWDC 2025 sessions on M5 GPU neural accelerators
Practitioners worth following
- Awni Hannun (MLX lead) —
@awnihannun,awnihannun.com - Georgi Gerganov (llama.cpp) —
@ggerganov - Justine Tunney (llamafile, AMX path) —
justine.lol - Simon Willison (practical Mac LLM) —
simonwillison.net - Tim Dettmers (quantization, hardware) —
timdettmers.com - Maxime Labonne (fine-tuning, quantization) —
mlabonne.github.io - Daniel Han / Unsloth — fine-tuning memory
- AlexZiskind (YouTube) — Mac LLM benchmarking
- r/LocalLLaMA — community measurements
Contested / version-dependent
- arXiv:2511.05502 230 vs 150 tok/s — not independently verified; methodology matters.
iogpu.wired_limit_mbsysctl name and default percentage — varies across macOS releases.- MLX KV cache API (
make_prompt_cache,QuantizedKVCache) — actively evolving; checkmlx-lmrelease notes. MTLResidencySetvs implicit residency — macOS 15+ only, best practices still emerging.- Per-chip memory bandwidth — Apple’s published figures; real-world utilization varies.
- Tok/s numbers above — order-of-magnitude only.