LLM Memory Math — Parameters, KV Cache, Bandwidth, and What Actually Fits
What this is: A first-principles reference for translating “this model has X billion parameters at context length C” into hard numbers: GB of memory to load, GB to run, tokens/sec expected on a given device. Covers weights, quantization, KV cache, activations, the bandwidth-vs-FLOPS regime, MoE, speculative decoding, and worked examples for the common open-weights models.
Why it matters: Every Locara manifest claim (“this app needs 12 GB”) and every Locara device card (“M4 Max 128 GB runs Llama 3 70B Q4 at ~14 tok/s”) has to be defensible from formulas, not vibes. App authors and users need to know why a model fits or doesn’t, not just that it does. This is the math that lets the runtime make honest predictions.
Most relevant to Locara: Pairs with chip-fundamentals.md (the bandwidth-vs-FLOPS story at the hardware level), mac-hardware-lineup.md (the per-device bandwidth and RAM numbers to plug in), and mac-llm-optimization.md (the practical playbook for keeping these numbers small in production).
1. Parameters → bytes (weight memory)
The base formula
weight_bytes = num_params × bytes_per_param
This is the lower bound on RAM to load the model — before any inference runs, before any context is processed.
Bytes per parameter, by dtype
| dtype | bits | bytes/param | typical use |
|---|---|---|---|
| FP32 | 32 | 4 | original training checkpoints |
| FP16 / BF16 | 16 | 2 | ”full precision” inference and training |
| FP8 (E4M3 / E5M2) | 8 | 1 | H100+ training/inference |
| INT8 | 8 | 1 | bitsandbytes 8-bit, GPTQ-8 |
| INT4 (Q4) | 4 | ≈0.5 + overhead | bitsandbytes NF4, AWQ, GPTQ-4, GGUF Q4_* |
| INT3 / INT2 | 3 / 2 | ≈0.375 / 0.25 + overhead | aggressive compression |
Worked examples, full precision:
- Llama 3 8B at FP16 → 8 × 10⁹ × 2 = 16 GB
- Llama 3 70B at FP16 → 70 × 10⁹ × 2 = 140 GB
- Llama 3.1 405B at FP16 → 405 × 10⁹ × 2 = 810 GB
(File-on-disk size is slightly larger — usually <1% — for tokenizer, config, tensor metadata.)
Why “4 bits” is never exactly 4 bits
Every practical low-bit format stores weights in blocks, sharing a scale (and sometimes zero-point / min) across the block. That block metadata is the overhead. The honest unit is bits-per-weight (bpw).
Generic per-block affine quantization:
weight_real ≈ scale × weight_int + zero_point (asymmetric)
weight_real ≈ scale × weight_int (symmetric)
GGUF (llama.cpp) bits-per-weight
llama.cpp’s K-quants pack weights into “superblocks” of 256 weights, divided into sub-blocks of 16 or 32. Each sub-block has its own 6- or 8-bit scale plus a super-block-level FP16 scale. Per llama.cpp/tools/quantize/README.md and the Tensor-Encoding-Schemes wiki:
| Quant | Theoretical bpw | Measured bpw on Llama 3.1 8B | Notes |
|---|---|---|---|
| Q2_K | 2.56 | ~2.97 | very aggressive |
| Q3_K_S | 3.44 | ~3.16 | small |
| Q3_K_M | ~3.70 | ~3.64 | medium |
| Q3_K_L | ~3.90 | ~4.00 | large |
| Q4_0 | 4.50 (legacy) | — | one fp16 scale per 32 weights |
| Q4_1 | 5.00 (legacy) | — | scale + min per 32 weights |
| Q4_K_S | ~4.50 | ~4.67 | k-quant 4-bit small |
| Q4_K_M | ~4.50 | ~4.89 | community sweet spot |
| Q5_K_S | ~5.50 | ~5.57 | |
| Q5_K_M | ~5.50 | ~5.70 | near-lossless |
| Q6_K | ~6.56 | ~6.56 | indistinguishable from FP16 on benchmarks |
| Q8_0 | ~8.50 | ~8.50 | per-block fp16 scale + 8-bit ints |
| FP16 | 16 | 16 | baseline |
Measured > theoretical because token embeddings and lm_head output projection are usually kept at higher precision (FP16 or Q8_0) for quality, dragging the average up. Q4_K_M is itself mixed under the hood — it uses Q6_K for some tensors.
Memory math: weight_bytes ≈ num_params × bpw / 8.
- Llama 3 8B at Q4_K_M: 8 × 10⁹ × 4.89 / 8 ≈ 4.89 GB
- Llama 3 70B at Q4_K_M: 70 × 10⁹ × 4.89 / 8 ≈ 42.8 GB
MLX bpw
MLX packs weights into uint32 with separate scales/biases side-cars. Default is 4-bit with group size 64 → ~4.5 bpw effective (4 + 16/64 scale + 16/64 bias).
| MLX preset | bits | group | effective bpw |
|---|---|---|---|
| q2 | 2 | 64 | ~2.5 |
| q3 | 3 | 64 | ~3.5 |
| q4 (default) | 4 | 64 | ~4.5 |
| q6 | 6 | 64 | ~6.5 |
| q8 | 8 | 64 | ~8.5 |
MLX also supports mixed-precision presets where attention stays higher-bit while MLP drops, giving averages of ~2.2–6.2 bpw.
”Smart” quants — AWQ, GPTQ, EXL2
These use calibration data (a few hundred sequences from C4 / WikiText) to identify salient weights and protect them during quantization:
- GPTQ (Frantar et al., 2022, arXiv:2210.17323) uses approximate second-order Hessian information to choose quantization rounding that minimizes per-layer reconstruction error. Achieves 3–4-bit weights with minimal accuracy loss on OPT-175B / BLOOM-176B in hours.
- AWQ (Lin et al., 2023, arXiv:2306.00978) observes that ~1% of weight channels — those with the largest activation magnitudes — dominate error. Per-channel scaling before quantization is mathematically equivalent to mixed precision but keeps a uniform 4-bit format.
- SmoothQuant (Xiao et al., 2022) shifts quantization difficulty from activations to weights via channel-wise scaling, enabling INT8 weight + activation quantization.
- EXL2 (ExLlamaV2) extends GPTQ with mixed-bitwidth allocation per layer based on a measurement pass — non-integer average bpw (e.g. 3.5, 4.65, 5.2).
Memory math is the same — bpw is what matters.
Per-tensor / per-channel / group-wise scales
- Per-tensor: one scale per matrix. Smallest overhead, worst quality.
- Per-channel: one scale per output channel (row of W). Standard for INT8 activations.
- Group-wise (32 / 64 / 128): one scale per group of consecutive weights. The de facto standard for ≤4-bit. Smaller group → more overhead but higher quality.
For 4-bit + group-128 + FP16 scales: bpw = 4 + 16/128 = 4.125. With group-32: bpw = 4 + 16/32 = 4.5. Q4_K_M lands near 4.9 because group-32 + 6-bit super-block scale + some FP16 layers.
2. KV cache memory (runtime, context-dependent)
The KV cache is the runtime cost that scales with context. Every generated token must be remembered: its K and V projections per attention layer are cached so subsequent tokens don’t recompute them. At long context the KV cache can exceed the weights themselves.
Formula
kv_cache_bytes = 2 × num_layers × seq_len × num_kv_heads × head_dim × kv_dtype_bytes
(The leading 2 is for K and V. Multiply by batch size for multi-user serving; for local single-user, batch=1.)
MHA vs MQA vs GQA (why this number isn’t huge)
Vanilla Multi-Head Attention (MHA) has num_kv_heads = num_query_heads. Each query head has its own K and V.
Multi-Query Attention (MQA) (Shazeer, 2019) collapses to one K/V head shared by all queries — divides KV cache by num_query_heads, but hurts quality.
Grouped-Query Attention (GQA) (Ainslie et al., 2023, arXiv:2305.13245) — used by Llama 2 70B and all of Llama 3 — is the compromise: num_kv_heads < num_query_heads but > 1. Each group of query heads shares K/V.
- Llama 3 8B: 32 query heads, 8 KV heads → 4× smaller KV cache than MHA, essentially no quality loss.
- Llama 3 70B / 405B: 64 / 128 query heads, 8 KV heads → 8× / 16× reduction.
Worked examples (FP16 KV cache, batch=1)
Llama 3 8B — 32 layers, 8 KV heads, head_dim 128:
- Per-token:
2 × 32 × 8 × 128 × 2 = 131,072 B = 128 KB - 4K context: 512 MiB
- 32K: 4.0 GiB
- 128K: 16.0 GiB
Llama 3 70B — 80 layers, 8 KV heads, head_dim 128:
- Per-token:
2 × 80 × 8 × 128 × 2 = 327,680 B = 320 KB - 4K: 1.25 GiB
- 32K: 10 GiB
- 128K: 40 GiB
Llama 3.1 405B — 126 layers, 8 KV heads, head_dim 128:
- Per-token:
2 × 126 × 8 × 128 × 2 = 516,096 B ≈ 504 KB - 4K: 1.97 GiB
- 32K: 15.75 GiB
- 128K: 63 GiB — exceeds typical Mac Studio RAM on KV cache alone
Mixtral 8x7B — 32 layers, 8 KV heads, head_dim 128:
- Per-token: 128 KB
- 32K: 4 GiB
Qwen 2.5 7B — 28 layers, 4 KV heads, head_dim 128:
- Per-token:
2 × 28 × 4 × 128 × 2 = 57,344 B = 56 KB - 32K: 1.75 GiB
- 128K: 7.0 GiB
Qwen 2.5 32B — 64 layers, 8 KV heads, head_dim 128:
- Per-token: 256 KB
- 32K: 8 GiB
- 128K: 32 GiB
KV cache grows linearly with sequence length. At 128K for a 70B model, the KV cache (~40 GB at FP16) is comparable to the Q4 weights (~40 GB). This is why long-context inference is so memory-hungry.
KV cache quantization
llama.cpp supports --cache-type-k and --cache-type-v with options f32, f16 (default), bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1.
Trade-offs from community benchmarks (llama.cpp discussion #5932):
- q8_0 K / q8_0 V: near-lossless, ~50% memory reduction. Required matched K/V types for fused FlashAttention.
- q8_0 K / q4_0 V: ~62% reduction at ~6.5 bpw average. Quality usually fine.
- q4_0 K / q4_0 V: ~75% reduction. Some quality loss on reasoning. Stretches context dramatically.
vLLM supports FP8 KV cache; production deployments increasingly use INT8 or FP8 by default.
A 70B model at 128K context: FP16 KV cache = 40 GB → q8 KV cache = 20 GB → q4 KV cache = 10 GB. The savings are huge and the quality cost is small.
Avoiding quadratic growth — sliding window, ring attention, context shifting
KV cache scales linearly with context (memory). Attention computation is quadratic (each new token attends to all previous). Techniques to avoid both:
- Sliding window attention (Mistral) — only attend to the last W tokens. KV cache caps at W.
- StreamingLLM (Xiao et al., 2023, arXiv:2309.17453) — keep a few “attention sink” tokens at start plus a rolling window. Effectively-infinite streaming, bounded memory.
- Ring Attention (Liu et al., 2023) — distribute sequence dimension across devices, communicate KV blocks in a ring. Used to push training context past 1M tokens.
- Context shifting (llama.cpp): when context fills, drop oldest tokens and reuse the existing KV cache for the remainder. Avoids reprocessing the prompt every turn.
3. Activations memory (forward pass, peak)
For batch=1 inference, activations are nearly negligible compared to weights and KV cache:
- The hidden state (
hidden_size × seq_len_prefill × dtype_bytes), reused across layers. - The attention
Q × Kᵀscore matrix (seq_len × seq_len × num_heads × dtype_bytes) — this is the quadratic term FlashAttention eliminates. - MLP intermediate activations (
intermediate_size × seq_len × dtype_bytes).
Without FlashAttention, processing a 32K-token prompt for an 8B model would naively allocate a 32K × 32K × 32 × 2 ≈ 67 GB score matrix. FlashAttention (Dao et al., 2022, arXiv:2205.14135) tiles into SRAM-sized blocks and reduces peak memory to O(N) instead of O(N²), dramatically reducing HBM reads/writes. FlashAttention-2 (Dao 2023, arXiv:2307.08691) improved parallelism; FlashAttention-3 (Shah, Dao 2024, tridao.me/blog/2024/flash3/) added async H100 features and FP8.
For training / fine-tuning, activations dominate — every layer’s forward activations must be kept for backward. Activation checkpointing trades recompute for memory by saving a subset and recomputing the rest. Standard formula (EleutherAI’s “Transformer Math 101”, blog.eleuther.ai/transformer-math/):
training_memory ≈ weights × (1 [params] + 1 [grads] + 2 [Adam m,v]) × dtype + activations
≈ ~20 bytes/param at mixed precision
For inference at batch=1, treat activations as a small constant ~0.5–2 GB.
4. The “what fits?” formula
total_memory ≈ weight_bytes
+ kv_cache_bytes_at_max_context
+ activation_overhead (~0.5–2 GB)
+ framework_overhead (~1–3 GB; llama.cpp lower, Python/HF higher)
+ OS + other apps (10–20 GB on macOS realistically)
Community heuristics (from r/LocalLLaMA wiki, Hugging Face accelerate docs):
- Leave 20–30% of system RAM for OS and other apps. On a 64 GB Mac, ~48 GB usable for the model stack.
- Quick estimate:
usable_RAM ≈ system_RAM × 0.75, thenmodel_budget = usable_RAM − kv_cache_at_max_context. - Apple Silicon’s unified memory means the GPU can address up to ~75% of total RAM by default (macOS reserves the rest), tunable via
sudo sysctl iogpu.wired_limit_mb.
Decision rule: pick the largest model whose weight_bytes + kv_cache_bytes_at_your_max_context < 0.7 × system_RAM.
5. Memory bandwidth vs FLOPS — why bandwidth is the headline number for decode
The Roofline model
The Roofline model (Williams, Waterman, Patterson, CACM April 2009) bounds performance by either compute (peak FLOPS) or memory bandwidth, depending on a kernel’s arithmetic intensity (AI):
AI = FLOPs performed / bytes moved from memory
attainable_perf = min(peak_FLOPS, peak_bandwidth × AI)
Below a critical AI (“ridge point”), bandwidth dominates. Above it, compute dominates. This is Horace He’s central point in “Making Deep Learning Go Brrrr From First Principles” (https://horace.io/brrr_intro.html): modern GPUs grew FLOPS faster than bandwidth, so most non-matmul operations are bandwidth-bound and operator fusion is critical.
Prefill vs decode
- Prefill (processing the input prompt): attention is computed over all input tokens at once. AI is high because each weight is reused across many tokens in one matmul. Compute-bound on modern hardware.
- Decode (generating one token at a time): batch dimension is effectively 1. Each weight is used for exactly one output element. AI of the matmul is ~2 FLOPs per byte. Memory-bound on essentially all hardware.
Tokens-per-second from bandwidth (the headline formula)
For decode at batch=1, every weight must be read from RAM for every token:
time_per_token ≈ weight_bytes / memory_bandwidth
tokens_per_second ≈ memory_bandwidth / weight_bytes
(KV cache reads add a small term proportional to context; for short contexts it’s a few percent.)
Worked examples
Mac Studio M3 Ultra (800 GB/s):
- Llama 3 70B Q4_K_M (~42.8 GB) → 800 / 42.8 ≈ 18.7 tok/s theoretical. Real-world 10–15 after overhead.
- Llama 3 8B Q4_K_M (~4.9 GB) → 800 / 4.9 ≈ 163 tok/s theoretical. Real-world 80–120.
- Llama 405B Q4 (~225 GB) → 800 / 225 ≈ 3.6 tok/s theoretical. Real-world ~2.
MacBook Pro M4 Max (546 GB/s):
- Llama 3 70B Q4 → 546 / 42.8 ≈ 12.7 tok/s theoretical.
- Llama 3 8B Q4 → 546 / 4.9 ≈ 111 tok/s theoretical.
MacBook Air M2 (~100 GB/s):
- Llama 3 8B Q4 → 100 / 4.9 ≈ 20 tok/s.
This is why the same model runs ~8× faster on M3 Ultra than M2 Air, even though both have “enough” RAM — bandwidth scales independently of capacity. Tim Dettmers has written the same analysis for GPUs (an H100 at ~3.35 TB/s would do 70B-Q4 at ~78 tok/s in a perfect world).
A working rule of thumb
For any decode-bound workload at batch=1:
tok/s ≈ (memory_bandwidth_GBps × 0.7) / weight_bytes_GB
The 0.7 factor accounts for real-world bandwidth utilization (refresh, contention, controller overhead). MLX-tuned kernels approach this; llama.cpp is typically 10–20% below it.
6. Quantization quality vs memory — the quality cliff
Empirically, perplexity vs FP16 baseline (compiled from multiple Llama-family evaluations including the survey arXiv:2601.14277, “Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct”):
| Quant | Δ perplexity (vs FP16) | Practical quality | Use case |
|---|---|---|---|
| Q8_0 | ≈ 0 | indistinguishable from FP16 | when you have RAM and want zero risk |
| Q6_K | < +0.01 | indistinguishable on benchmarks | high-quality default |
| Q5_K_M | +0.02 to +0.04 | near-imperceptible | sweet spot if memory tight |
| Q4_K_M | +0.05 to +0.10 | minor degradation; mostly invisible in chat | community default |
| Q4_K_S | +0.07 to +0.15 | mild degradation, noticeable on coding/math | when 0.5 GB matters |
| Q3_K_M | +0.2 to +0.4 | visible on hard tasks | desperate |
| Q2_K | +0.5 to +1.5 | clearly worse | last resort to fit at all |
Community wisdom (consistent across kaitchup.substack.com, willitrunai.com, llama.cpp discussions, Maxime Labonne’s writeups):
- Q6_K and Q5_K_M are indistinguishable from FP16 on most benchmarks.
- Q4_K_M is the sweet spot for most local users.
- Below Q4 degradation becomes noticeable — especially on reasoning and code.
- Q2 is for desperate situations (running a 70B on 32 GB).
- For reasoning / coding models specifically, Q5_K_M as the practical floor.
Why calibration-data methods preserve quality at 4 bits
Naive round-to-nearest treats every weight equally. AWQ and GPTQ exploit the observation that LLM weights have outliers — a small fraction of channels (AWQ) or specific weights (GPTQ) have disproportionate output impact. Scaling salient channels before quantization (AWQ) or rounding with second-order error compensation (GPTQ) keeps far more meaningful information in the same 4 bits.
LLM.int8() (Dettmers et al., 2022) was the original observation: in models past 6.7B parameters, ~0.1% of activation dimensions are “emergent outliers” with magnitudes 20× the rest. Keeping them in FP16 while quantizing the rest to INT8 gives lossless 8-bit inference.
7. Speculative decoding, MoE, and other tricks
Speculative decoding
Independently introduced by Leviathan, Kalman, Matias (Google, Nov 2022, arXiv:2211.17192) and Chen et al. (DeepMind, Feb 2023, arXiv:2302.01318):
A small draft model (e.g., Llama 3.2 1B) autoregressively proposes K candidate tokens. The large target model (e.g., Llama 3 70B) verifies all K in parallel in a single forward pass — same matmuls, K query tokens instead of 1. The correct prefix is accepted; the first wrong token is resampled per the target’s distribution. Output distribution is provably identical to running the target alone.
Speed-up depends on draft quality (acceptance 50–80% typical) and the speed ratio. 2–3× decode throughput is realistic.
Memory cost: both models loaded simultaneously. 70B target + 1B draft at Q4 ≈ 42.8 + 0.6 = 43.4 GB — almost free.
Mixture of Experts (MoE)
MoE replaces the dense MLP in each transformer block with N “expert” MLPs and a small router. For each token, the router selects k experts (k=2 typical), so only k/N of the MLP weights are active per token.
Critical fact: the entire model must still be in memory — you don’t know which experts each token will route to.
Mixtral 8x7B (Jiang et al., arXiv:2401.04088):
- 8 experts/layer, top-2 routing.
- Total params: 46.7 B; active per token: 12.9 B.
- Memory at FP16: ~93 GB (all experts loaded).
- Memory at Q4_K_M: ~26 GB.
- Decode speed: bandwidth-bound on active params (~13B effective) → as fast as a dense 13B.
- Quality: matches or beats dense 70B on most benchmarks.
Mixtral 8x22B: 141B total / 39B active, ~80 GB at Q4_K_M, decodes like 39B dense.
DeepSeek-V3 / R1: 671B total / 37B active, ~340 GB at Q4_K_M — why Apple positions M3 Ultra 512 GB as the “DeepSeek R1 desktop”. Decode ~3–5 tok/s on M3 Ultra makes it usable but slow.
The memory math: weight_bytes = total_params × bpw / 8 (not active params). Speed math: tok/s ≈ bandwidth / (active_params × bpw / 8 + small_overhead).
Continuous batching, PagedAttention, FlashAttention-2/3
Server-side techniques less relevant for batch=1 local inference, but useful to know:
- PagedAttention (Kwon et al., SOSP 2023, arXiv:2309.06180) — the vLLM paper. KV cache stored in fixed-size pages (analogous to OS virtual memory), eliminating fragmentation. Without it, naive contiguous KV allocation wastes 60–80% in multi-user serving. With it, waste drops to <4% and throughput improves 2–3×.
- Continuous batching — swap completed sequences with new ones at every step rather than waiting for the batch. Doubles or triples server throughput.
- FlashAttention-2 (Dao 2023) — better parallelism, ~2× speed-up over v1.
- FlashAttention-3 (Shah et al. 2024) — Hopper-specific async + FP8. Achieves 75% of H100’s theoretical FLOPS.
For local batch=1 inference, you get these benefits via llama.cpp’s -fa (flash attention) flag, MLX’s fused attention kernels, or PyTorch SDPA backends.
8. Worked examples — full math
Conventions:
- KV cache at FP16 (the default; quantize to ~q8 for half, ~q4 for quarter).
- Activation overhead: 1 GB (rough).
- Framework overhead: 2 GB.
- Total =
weights + KV_cache + 3 GB other.
Qwen 2.5 7B
28 layers, 28 query heads, 4 KV heads, head_dim 128 → 56 KB per token.
| Format | bpw | Weights | +4K KV | +32K KV | +128K KV |
|---|---|---|---|---|---|
| FP16 | 16 | 14.0 GB | 0.22 → 17.2 GB | 1.75 → 18.8 GB | 7.0 → 24.0 GB |
| Q8_0 | 8.5 | 7.4 GB | 10.6 GB | 12.2 GB | 17.4 GB |
| Q6_K | 6.56 | 5.7 GB | 8.9 GB | 10.5 GB | 15.7 GB |
| Q4_K_M | 4.89 | 4.28 GB | 7.5 GB | 9.0 GB | 14.3 GB |
Runs comfortably at 32K on 16 GB Macs at Q4. 128K needs ~16 GB minimum.
Llama 3.1 8B
32 layers, 32 query heads, 8 KV heads, head_dim 128 → 128 KB per token.
| Format | bpw | Weights | +4K KV | +32K KV | +128K KV |
|---|---|---|---|---|---|
| FP16 | 16 | 16.0 GB | 0.5 → 19.5 GB | 4.0 → 23.0 GB | 16.0 → 35.0 GB |
| Q8_0 | 8.5 | 8.5 GB | 12.0 GB | 15.5 GB | 27.5 GB |
| Q6_K | 6.56 | 6.56 GB | 10.1 GB | 13.6 GB | 25.6 GB |
| Q4_K_M | 4.89 | 4.89 GB | 8.4 GB | 11.9 GB | 23.9 GB |
Fits at Q4 + 128K on a 32 GB Mac with KV quantization. Without it, 128K needs Q8 KV or ~24 GB.
Llama 3.3 70B
80 layers, 64 query heads, 8 KV heads, head_dim 128 → 320 KB per token.
| Format | bpw | Weights | +4K KV | +32K KV | +128K KV |
|---|---|---|---|---|---|
| FP16 | 16 | 140 GB | 1.25 → 144.3 GB | 10 → 153 GB | 40 → 183 GB |
| Q8_0 | 8.5 | 74.4 GB | 78.6 GB | 87.4 GB | 117 GB |
| Q6_K | 6.56 | 57.4 GB | 61.6 GB | 70.4 GB | 100 GB |
| Q4_K_M | 4.89 | 42.8 GB | 47.0 GB | 55.8 GB | 85.8 GB |
Q4_K_M at 4K fits on a 64 GB Mac. At 128K context, need 96+ GB or aggressive KV quantization (q4 KV → 10 GB instead of 40 GB → ~56 GB total, fits 64 GB).
Mixtral 8x7B
32 layers, 32 query heads, 8 KV heads, head_dim 128. 46.7 B total (this is what counts for memory).
| Format | bpw | Weights | +4K KV | +32K KV | +128K KV |
|---|---|---|---|---|---|
| FP16 | 16 | 93.4 GB | 0.5 → 96.9 GB | 4.0 → 100.4 GB | 16.0 → 112.4 GB |
| Q8_0 | 8.5 | 49.6 GB | 53.1 GB | 56.6 GB | 68.6 GB |
| Q6_K | 6.56 | 38.3 GB | 41.8 GB | 45.3 GB | 57.3 GB |
| Q4_K_M | 4.89 | 28.5 GB | 32.0 GB | 35.5 GB | 47.5 GB |
Decodes at the speed of a ~13B dense model (12.9 B active per token), despite needing 28+ GB. The MoE deal: slow to load, fast to run.
Llama 3.1 405B (only feasible on Mac Studio Ultra)
126 layers, 128 query heads, 8 KV heads, head_dim 128 → 504 KB per token.
| Format | bpw | Weights | +4K KV | +32K KV | +128K KV |
|---|---|---|---|---|---|
| FP16 | 16 | 810 GB | 2 → 815 GB | 16 → 829 GB | 63 → 876 GB |
| Q8_0 | 8.5 | 430 GB | 435 GB | 449 GB | 496 GB |
| Q4_K_M | 4.89 | 248 GB | 253 GB | 267 GB | 314 GB |
Even Q4 + 4K needs 253 GB. Mac Studio M3 Ultra 512 GB is the only consumer device that runs it. Q4 + 32K (~267 GB) comfortable. 128K context needs KV quantization to ~q4 (16 GB instead of 63 GB) to fit cleanly. Expect ~2 tok/s decode (bandwidth-bound: 800 / 248 ≈ 3.2 theoretical → 1.5–2 real).
Qwen 2.5 32B
64 layers, 40 query heads, 8 KV heads, head_dim 128 → 256 KB per token.
| Format | bpw | Weights | +4K KV | +32K KV | +128K KV |
|---|---|---|---|---|---|
| FP16 | 16 | 65 GB | 1 → 69 GB | 8 → 76 GB | 32 → 100 GB |
| Q8_0 | 8.5 | 34.5 GB | 38.5 GB | 45.5 GB | 69.5 GB |
| Q6_K | 6.56 | 26.6 GB | 30.6 GB | 37.6 GB | 61.6 GB |
| Q4_K_M | 4.89 | 19.9 GB | 23.9 GB | 30.9 GB | 54.9 GB |
Q4_K_M at 32K fits on a 32 GB Mac. Q6_K (near-FP16) at 32K needs 36+ GB → 48 GB Mac. Full 128K on Q4 needs ~64 GB.
Specific learnings for Locara
-
The manifest schema should accept (weights_bytes, kv_per_token_bytes, max_context) as primitives, not “min RAM.” This lets the runtime compute the real RAM requirement for the user’s actual context cap, not a single global number. An app that supports 128K context but defaults to 8K can run on a much smaller Mac if the user keeps context short.
-
Bandwidth is the right primary device-class metric alongside RAM. A 64 GB M3 Max-14C (300 GB/s) and a 64 GB M3 Max-16C (400 GB/s) have meaningfully different LLM performance for the same model — the manifest needs to know both. See
mac-hardware-lineup.mdfor the per-SKU lookup table. -
Default to Q4_K_M for capacity-bound users; Q6_K for quality-bound users. Don’t try to be smarter than the community consensus. The runtime’s model catalog should ship both for any model where size allows.
-
KV cache quantization should be on by default past 8K context. The math is unambiguous — at 32K context for a 70B model, FP16 KV cache costs 10 GB you almost certainly want for something else. Default to
q8_0 K / q8_0 Vand surface “use full precision KV” as a quality-app opt-in. -
Publish a “Llama 3 8B Q4_K_M expected ~N tok/s on your Mac” estimate at install time. Compute from
min(weight_budget, bandwidth × 0.7 / weight_bytes). Honest expected-perf is the LSB of trust — users who expect 100 tok/s and get 20 churn; users who are told 25 tok/s and get 25 are happy. -
MoE models are a special manifest case. Memory = total_params; speed = active_params. Without that distinction the runtime will refuse to load Mixtral on a 32 GB Mac it could actually fit, or promise speeds it can’t deliver. The manifest schema needs
total_paramsandactive_paramsas separate fields for MoE. -
Speculative decoding requires both models manifested together. The pair (target, draft) is the deployment unit, not just the target. Runtime should compute combined memory cost.
-
Track Apple’s official bandwidth as upper bound, plan for 0.7× utilization. Real LLM decode hits 70–85% of theoretical on tuned kernels (MLX) and 60–75% on llama.cpp. Bandwidth × utilization / weight_bytes is the honest tok/s estimate.
-
Reject 8 GB Mac configurations for any model >3B. The OS + browser baseline alone eats most of the 8 GB budget. A 7B model “technically fits” but will thrash. Manifest minimum should be 16 GB for most useful local models.
-
The KV cache formula is the single most important piece of math Locara’s runtime needs internalized. Get the architecture-specific (
num_layers,num_kv_heads,head_dim) numbers from each model’sconfig.jsonon HF, store them in the runtime’s model registry, and recompute at every install-time fit check.
References
Foundational quantization and memory papers
- Williams, Waterman, Patterson — Roofline: An Insightful Visual Performance Model for Multicore Architectures, CACM April 2009 —
https://cacm.acm.org/magazines/2009/4/22959-roofline-an-insightful-visual-performance-model-for-multicore-architectures/fulltext - Dettmers et al. — LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, NeurIPS 2022 —
https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/ - Dettmers et al. — QLoRA, 2023 —
https://arxiv.org/abs/2305.14314 - Frantar et al. — GPTQ, ICLR 2023 —
https://arxiv.org/abs/2210.17323 - Lin et al. — AWQ, 2023 —
https://arxiv.org/abs/2306.00978 - Xiao et al. — SmoothQuant, 2022 —
https://arxiv.org/abs/2211.10438
Attention and serving
- Dao et al. — FlashAttention, NeurIPS 2022 —
https://arxiv.org/abs/2205.14135 - Dao — FlashAttention-2, 2023 —
https://arxiv.org/abs/2307.08691 - Shah et al. — FlashAttention-3, 2024 —
https://tridao.me/blog/2024/flash3/ - Kwon et al. — PagedAttention / vLLM, SOSP 2023 —
https://arxiv.org/abs/2309.06180 - Ainslie et al. — GQA, 2023 —
https://arxiv.org/abs/2305.13245 - Leviathan, Kalman, Matias — Speculative Decoding, 2022 —
https://arxiv.org/abs/2211.17192 - Chen et al. (DeepMind) — Accelerating LLM Decoding with Speculative Sampling, 2023 —
https://arxiv.org/abs/2302.01318 - Xiao et al. — StreamingLLM, 2023 —
https://arxiv.org/abs/2309.17453
Model architecture papers
- Grattafiori et al. — The Llama 3 Herd of Models, 2024 —
https://arxiv.org/abs/2407.21783 - Jiang et al. — Mixtral of Experts, 2024 —
https://arxiv.org/abs/2401.04088 - DeepSeek-AI — DeepSeek-V3 Technical Report, 2024 —
https://arxiv.org/abs/2412.19437
Practical references
- Andrej Karpathy — nanoGPT, llm.c, State of GPT (Microsoft Build 2023).
https://github.com/karpathy/nanoGPT,https://github.com/karpathy/llm.c - Tim Dettmers —
https://timdettmers.com— practical hardware reasoning. - Horace He — Making Deep Learning Go Brrrr From First Principles —
https://horace.io/brrr_intro.html - EleutherAI / Aleksa Gordić et al. — Transformer Math 101 —
https://blog.eleuther.ai/transformer-math/ - Awni Hannun (MLX) —
https://awnihannun.com,@awnihannun - Simon Willison —
https://simonwillison.net/tags/local-llms/ - Daniel Han / Unsloth — fine-tuning memory math
- Maxime Labonne — quantization writeups
- llama.cpp quantize README —
https://github.com/ggml-org/llama.cpp/blob/master/tools/quantize/README.md - llama.cpp Tensor Encoding Schemes wiki —
https://github.com/ggml-org/llama.cpp/wiki/Tensor-Encoding-Schemes - llama.cpp server README (KV cache flags) —
https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md - llama.cpp discussion #5932 (KV cache quantization) —
https://github.com/ggml-org/llama.cpp/discussions/5932 - MLX quantization docs —
https://ml-explore.github.io/mlx/build/html/python/_autosummary/mlx.core.quantize.html - Hugging Face Qwen 2.5-7B-Instruct config —
https://huggingface.co/Qwen/Qwen2.5-7B-Instruct/blob/main/config.json - Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct —
https://arxiv.org/html/2601.14277v1 - Simon McLeod, Bringing K/V Context Quantisation to Ollama —
https://smcleod.net/2024/12/bringing-k-v-context-quantisation-to-ollama/ - Hugging Face, 4-bit Transformers with bitsandbytes —
https://huggingface.co/blog/4bit-transformers-bitsandbytes - apxml.com model spec pages (Llama 3.3 70B, Llama 3.1 405B, Qwen 2.5 32B, Mixtral 8x7B)