LLM Memory Math — Parameters, KV Cache, Bandwidth, and What Actually Fits

What this is: A first-principles reference for translating “this model has X billion parameters at context length C” into hard numbers: GB of memory to load, GB to run, tokens/sec expected on a given device. Covers weights, quantization, KV cache, activations, the bandwidth-vs-FLOPS regime, MoE, speculative decoding, and worked examples for the common open-weights models. Why it matters: Every Locara manifest claim (“this app needs 12 GB”) and every Locara device card (“M4 Max 128 GB runs Llama 3 70B Q4 at ~14 tok/s”) has to be defensible from formulas, not vibes. App authors and users need to know why a model fits or doesn’t, not just that it does. This is the math that lets the runtime make honest predictions. Most relevant to Locara: Pairs with chip-fundamentals.md (the bandwidth-vs-FLOPS story at the hardware level), mac-hardware-lineup.md (the per-device bandwidth and RAM numbers to plug in), and mac-llm-optimization.md (the practical playbook for keeping these numbers small in production).

1. Parameters → bytes (weight memory)

The base formula

weight_bytes = num_params × bytes_per_param

This is the lower bound on RAM to load the model — before any inference runs, before any context is processed.

Bytes per parameter, by dtype

dtype	bits	bytes/param	typical use
FP32	32	4	original training checkpoints
FP16 / BF16	16	2	”full precision” inference and training
FP8 (E4M3 / E5M2)	8	1	H100+ training/inference
INT8	8	1	bitsandbytes 8-bit, GPTQ-8
INT4 (Q4)	4	≈0.5 + overhead	bitsandbytes NF4, AWQ, GPTQ-4, GGUF Q4_*
INT3 / INT2	3 / 2	≈0.375 / 0.25 + overhead	aggressive compression

Worked examples, full precision:

Llama 3 8B at FP16 → 8 × 10⁹ × 2 = 16 GB
Llama 3 70B at FP16 → 70 × 10⁹ × 2 = 140 GB
Llama 3.1 405B at FP16 → 405 × 10⁹ × 2 = 810 GB

(File-on-disk size is slightly larger — usually <1% — for tokenizer, config, tensor metadata.)

Why “4 bits” is never exactly 4 bits

Every practical low-bit format stores weights in blocks, sharing a scale (and sometimes zero-point / min) across the block. That block metadata is the overhead. The honest unit is bits-per-weight (bpw).

Generic per-block affine quantization:

weight_real ≈ scale × weight_int + zero_point        (asymmetric)
weight_real ≈ scale × weight_int                     (symmetric)

GGUF (llama.cpp) bits-per-weight

llama.cpp’s K-quants pack weights into “superblocks” of 256 weights, divided into sub-blocks of 16 or 32. Each sub-block has its own 6- or 8-bit scale plus a super-block-level FP16 scale. Per llama.cpp/tools/quantize/README.md and the Tensor-Encoding-Schemes wiki:

Quant	Theoretical bpw	Measured bpw on Llama 3.1 8B	Notes
Q2_K	2.56	~2.97	very aggressive
Q3_K_S	3.44	~3.16	small
Q3_K_M	~3.70	~3.64	medium
Q3_K_L	~3.90	~4.00	large
Q4_0	4.50 (legacy)	—	one fp16 scale per 32 weights
Q4_1	5.00 (legacy)	—	scale + min per 32 weights
Q4_K_S	~4.50	~4.67	k-quant 4-bit small
Q4_K_M	~4.50	~4.89	community sweet spot
Q5_K_S	~5.50	~5.57
Q5_K_M	~5.50	~5.70	near-lossless
Q6_K	~6.56	~6.56	indistinguishable from FP16 on benchmarks
Q8_0	~8.50	~8.50	per-block fp16 scale + 8-bit ints
FP16	16	16	baseline

Measured > theoretical because token embeddings and lm_head output projection are usually kept at higher precision (FP16 or Q8_0) for quality, dragging the average up. Q4_K_M is itself mixed under the hood — it uses Q6_K for some tensors.

Memory math: weight_bytes ≈ num_params × bpw / 8.

Llama 3 8B at Q4_K_M: 8 × 10⁹ × 4.89 / 8 ≈ 4.89 GB
Llama 3 70B at Q4_K_M: 70 × 10⁹ × 4.89 / 8 ≈ 42.8 GB

MLX bpw

MLX packs weights into uint32 with separate scales/biases side-cars. Default is 4-bit with group size 64 → ~4.5 bpw effective (4 + 16/64 scale + 16/64 bias).

MLX preset	bits	group	effective bpw
q2	2	64	~2.5
q3	3	64	~3.5
q4 (default)	4	64	~4.5
q6	6	64	~6.5
q8	8	64	~8.5

MLX also supports mixed-precision presets where attention stays higher-bit while MLP drops, giving averages of ~2.2–6.2 bpw.

”Smart” quants — AWQ, GPTQ, EXL2

These use calibration data (a few hundred sequences from C4 / WikiText) to identify salient weights and protect them during quantization:

GPTQ (Frantar et al., 2022, arXiv:2210.17323) uses approximate second-order Hessian information to choose quantization rounding that minimizes per-layer reconstruction error. Achieves 3–4-bit weights with minimal accuracy loss on OPT-175B / BLOOM-176B in hours.
AWQ (Lin et al., 2023, arXiv:2306.00978) observes that ~1% of weight channels — those with the largest activation magnitudes — dominate error. Per-channel scaling before quantization is mathematically equivalent to mixed precision but keeps a uniform 4-bit format.
SmoothQuant (Xiao et al., 2022) shifts quantization difficulty from activations to weights via channel-wise scaling, enabling INT8 weight + activation quantization.
EXL2 (ExLlamaV2) extends GPTQ with mixed-bitwidth allocation per layer based on a measurement pass — non-integer average bpw (e.g. 3.5, 4.65, 5.2).

Memory math is the same — bpw is what matters.

Per-tensor / per-channel / group-wise scales

Per-tensor: one scale per matrix. Smallest overhead, worst quality.
Per-channel: one scale per output channel (row of W). Standard for INT8 activations.
Group-wise (32 / 64 / 128): one scale per group of consecutive weights. The de facto standard for ≤4-bit. Smaller group → more overhead but higher quality.

For 4-bit + group-128 + FP16 scales: bpw = 4 + 16/128 = 4.125. With group-32: bpw = 4 + 16/32 = 4.5. Q4_K_M lands near 4.9 because group-32 + 6-bit super-block scale + some FP16 layers.

2. KV cache memory (runtime, context-dependent)

The KV cache is the runtime cost that scales with context. Every generated token must be remembered: its K and V projections per attention layer are cached so subsequent tokens don’t recompute them. At long context the KV cache can exceed the weights themselves.

Formula

kv_cache_bytes = 2 × num_layers × seq_len × num_kv_heads × head_dim × kv_dtype_bytes

(The leading 2 is for K and V. Multiply by batch size for multi-user serving; for local single-user, batch=1.)

MHA vs MQA vs GQA (why this number isn’t huge)

Vanilla Multi-Head Attention (MHA) has num_kv_heads = num_query_heads. Each query head has its own K and V.

Multi-Query Attention (MQA) (Shazeer, 2019) collapses to one K/V head shared by all queries — divides KV cache by num_query_heads, but hurts quality.

Grouped-Query Attention (GQA) (Ainslie et al., 2023, arXiv:2305.13245) — used by Llama 2 70B and all of Llama 3 — is the compromise: num_kv_heads < num_query_heads but > 1. Each group of query heads shares K/V.

Llama 3 8B: 32 query heads, 8 KV heads → 4× smaller KV cache than MHA, essentially no quality loss.
Llama 3 70B / 405B: 64 / 128 query heads, 8 KV heads → 8× / 16× reduction.

Worked examples (FP16 KV cache, batch=1)

Llama 3 8B — 32 layers, 8 KV heads, head_dim 128:

Per-token: 2 × 32 × 8 × 128 × 2 = 131,072 B = 128 KB
4K context: 512 MiB
32K: 4.0 GiB
128K: 16.0 GiB

Llama 3 70B — 80 layers, 8 KV heads, head_dim 128:

Per-token: 2 × 80 × 8 × 128 × 2 = 327,680 B = 320 KB
4K: 1.25 GiB
32K: 10 GiB
128K: 40 GiB

Llama 3.1 405B — 126 layers, 8 KV heads, head_dim 128:

Per-token: 2 × 126 × 8 × 128 × 2 = 516,096 B ≈ 504 KB
4K: 1.97 GiB
32K: 15.75 GiB
128K: 63 GiB — exceeds typical Mac Studio RAM on KV cache alone

Mixtral 8x7B — 32 layers, 8 KV heads, head_dim 128:

Per-token: 128 KB
32K: 4 GiB

Qwen 2.5 7B — 28 layers, 4 KV heads, head_dim 128:

Per-token: 2 × 28 × 4 × 128 × 2 = 57,344 B = 56 KB
32K: 1.75 GiB
128K: 7.0 GiB

Qwen 2.5 32B — 64 layers, 8 KV heads, head_dim 128:

Per-token: 256 KB
32K: 8 GiB
128K: 32 GiB

KV cache grows linearly with sequence length. At 128K for a 70B model, the KV cache (~40 GB at FP16) is comparable to the Q4 weights (~40 GB). This is why long-context inference is so memory-hungry.

KV cache quantization

llama.cpp supports --cache-type-k and --cache-type-v with options f32, f16 (default), bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1.

Trade-offs from community benchmarks (llama.cpp discussion #5932):

q8_0 K / q8_0 V: near-lossless, ~50% memory reduction. Required matched K/V types for fused FlashAttention.
q8_0 K / q4_0 V: ~62% reduction at ~6.5 bpw average. Quality usually fine.
q4_0 K / q4_0 V: ~75% reduction. Some quality loss on reasoning. Stretches context dramatically.

vLLM supports FP8 KV cache; production deployments increasingly use INT8 or FP8 by default.

A 70B model at 128K context: FP16 KV cache = 40 GB → q8 KV cache = 20 GB → q4 KV cache = 10 GB. The savings are huge and the quality cost is small.

Avoiding quadratic growth — sliding window, ring attention, context shifting

KV cache scales linearly with context (memory). Attention computation is quadratic (each new token attends to all previous). Techniques to avoid both:

Sliding window attention (Mistral) — only attend to the last W tokens. KV cache caps at W.
StreamingLLM (Xiao et al., 2023, arXiv:2309.17453) — keep a few “attention sink” tokens at start plus a rolling window. Effectively-infinite streaming, bounded memory.
Ring Attention (Liu et al., 2023) — distribute sequence dimension across devices, communicate KV blocks in a ring. Used to push training context past 1M tokens.
Context shifting (llama.cpp): when context fills, drop oldest tokens and reuse the existing KV cache for the remainder. Avoids reprocessing the prompt every turn.

3. Activations memory (forward pass, peak)

For batch=1 inference, activations are nearly negligible compared to weights and KV cache:

The hidden state (hidden_size × seq_len_prefill × dtype_bytes), reused across layers.
The attention Q × Kᵀ score matrix (seq_len × seq_len × num_heads × dtype_bytes) — this is the quadratic term FlashAttention eliminates.
MLP intermediate activations (intermediate_size × seq_len × dtype_bytes).

Without FlashAttention, processing a 32K-token prompt for an 8B model would naively allocate a 32K × 32K × 32 × 2 ≈ 67 GB score matrix. FlashAttention (Dao et al., 2022, arXiv:2205.14135) tiles into SRAM-sized blocks and reduces peak memory to O(N) instead of O(N²), dramatically reducing HBM reads/writes. FlashAttention-2 (Dao 2023, arXiv:2307.08691) improved parallelism; FlashAttention-3 (Shah, Dao 2024, tridao.me/blog/2024/flash3/) added async H100 features and FP8.

For training / fine-tuning, activations dominate — every layer’s forward activations must be kept for backward. Activation checkpointing trades recompute for memory by saving a subset and recomputing the rest. Standard formula (EleutherAI’s “Transformer Math 101”, blog.eleuther.ai/transformer-math/):

training_memory ≈ weights × (1 [params] + 1 [grads] + 2 [Adam m,v]) × dtype + activations
                ≈ ~20 bytes/param at mixed precision

For inference at batch=1, treat activations as a small constant ~0.5–2 GB.

4. The “what fits?” formula

total_memory ≈ weight_bytes
             + kv_cache_bytes_at_max_context
             + activation_overhead          (~0.5–2 GB)
             + framework_overhead           (~1–3 GB; llama.cpp lower, Python/HF higher)
             + OS + other apps              (10–20 GB on macOS realistically)

Community heuristics (from r/LocalLLaMA wiki, Hugging Face accelerate docs):

Leave 20–30% of system RAM for OS and other apps. On a 64 GB Mac, ~48 GB usable for the model stack.
Quick estimate: usable_RAM ≈ system_RAM × 0.75, then model_budget = usable_RAM − kv_cache_at_max_context.
Apple Silicon’s unified memory means the GPU can address up to ~75% of total RAM by default (macOS reserves the rest), tunable via sudo sysctl iogpu.wired_limit_mb.

Decision rule: pick the largest model whose weight_bytes + kv_cache_bytes_at_your_max_context < 0.7 × system_RAM.

5. Memory bandwidth vs FLOPS — why bandwidth is the headline number for decode

The Roofline model

The Roofline model (Williams, Waterman, Patterson, CACM April 2009) bounds performance by either compute (peak FLOPS) or memory bandwidth, depending on a kernel’s arithmetic intensity (AI):

AI = FLOPs performed / bytes moved from memory
attainable_perf = min(peak_FLOPS, peak_bandwidth × AI)

Below a critical AI (“ridge point”), bandwidth dominates. Above it, compute dominates. This is Horace He’s central point in “Making Deep Learning Go Brrrr From First Principles” (https://horace.io/brrr_intro.html): modern GPUs grew FLOPS faster than bandwidth, so most non-matmul operations are bandwidth-bound and operator fusion is critical.

Prefill vs decode

Prefill (processing the input prompt): attention is computed over all input tokens at once. AI is high because each weight is reused across many tokens in one matmul. Compute-bound on modern hardware.
Decode (generating one token at a time): batch dimension is effectively 1. Each weight is used for exactly one output element. AI of the matmul is ~2 FLOPs per byte. Memory-bound on essentially all hardware.

Tokens-per-second from bandwidth (the headline formula)

For decode at batch=1, every weight must be read from RAM for every token:

time_per_token ≈ weight_bytes / memory_bandwidth
tokens_per_second ≈ memory_bandwidth / weight_bytes

(KV cache reads add a small term proportional to context; for short contexts it’s a few percent.)

Worked examples

Mac Studio M3 Ultra (800 GB/s):

Llama 3 70B Q4_K_M (~42.8 GB) → 800 / 42.8 ≈ 18.7 tok/s theoretical. Real-world 10–15 after overhead.
Llama 3 8B Q4_K_M (~4.9 GB) → 800 / 4.9 ≈ 163 tok/s theoretical. Real-world 80–120.
Llama 405B Q4 (~225 GB) → 800 / 225 ≈ 3.6 tok/s theoretical. Real-world ~2.

MacBook Pro M4 Max (546 GB/s):

Llama 3 70B Q4 → 546 / 42.8 ≈ 12.7 tok/s theoretical.
Llama 3 8B Q4 → 546 / 4.9 ≈ 111 tok/s theoretical.

MacBook Air M2 (~100 GB/s):

Llama 3 8B Q4 → 100 / 4.9 ≈ 20 tok/s.

This is why the same model runs ~8× faster on M3 Ultra than M2 Air, even though both have “enough” RAM — bandwidth scales independently of capacity. Tim Dettmers has written the same analysis for GPUs (an H100 at ~3.35 TB/s would do 70B-Q4 at ~78 tok/s in a perfect world).

A working rule of thumb

For any decode-bound workload at batch=1:

tok/s ≈ (memory_bandwidth_GBps × 0.7) / weight_bytes_GB

The 0.7 factor accounts for real-world bandwidth utilization (refresh, contention, controller overhead). MLX-tuned kernels approach this; llama.cpp is typically 10–20% below it.

6. Quantization quality vs memory — the quality cliff

Empirically, perplexity vs FP16 baseline (compiled from multiple Llama-family evaluations including the survey arXiv:2601.14277, “Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct”):

Quant	Δ perplexity (vs FP16)	Practical quality	Use case
Q8_0	≈ 0	indistinguishable from FP16	when you have RAM and want zero risk
Q6_K	< +0.01	indistinguishable on benchmarks	high-quality default
Q5_K_M	+0.02 to +0.04	near-imperceptible	sweet spot if memory tight
Q4_K_M	+0.05 to +0.10	minor degradation; mostly invisible in chat	community default
Q4_K_S	+0.07 to +0.15	mild degradation, noticeable on coding/math	when 0.5 GB matters
Q3_K_M	+0.2 to +0.4	visible on hard tasks	desperate
Q2_K	+0.5 to +1.5	clearly worse	last resort to fit at all

Community wisdom (consistent across kaitchup.substack.com, willitrunai.com, llama.cpp discussions, Maxime Labonne’s writeups):

Q6_K and Q5_K_M are indistinguishable from FP16 on most benchmarks.
Q4_K_M is the sweet spot for most local users.
Below Q4 degradation becomes noticeable — especially on reasoning and code.
Q2 is for desperate situations (running a 70B on 32 GB).
For reasoning / coding models specifically, Q5_K_M as the practical floor.

Why calibration-data methods preserve quality at 4 bits

Naive round-to-nearest treats every weight equally. AWQ and GPTQ exploit the observation that LLM weights have outliers — a small fraction of channels (AWQ) or specific weights (GPTQ) have disproportionate output impact. Scaling salient channels before quantization (AWQ) or rounding with second-order error compensation (GPTQ) keeps far more meaningful information in the same 4 bits.

LLM.int8() (Dettmers et al., 2022) was the original observation: in models past 6.7B parameters, ~0.1% of activation dimensions are “emergent outliers” with magnitudes 20× the rest. Keeping them in FP16 while quantizing the rest to INT8 gives lossless 8-bit inference.

7. Speculative decoding, MoE, and other tricks

Speculative decoding

Independently introduced by Leviathan, Kalman, Matias (Google, Nov 2022, arXiv:2211.17192) and Chen et al. (DeepMind, Feb 2023, arXiv:2302.01318):

A small draft model (e.g., Llama 3.2 1B) autoregressively proposes K candidate tokens. The large target model (e.g., Llama 3 70B) verifies all K in parallel in a single forward pass — same matmuls, K query tokens instead of 1. The correct prefix is accepted; the first wrong token is resampled per the target’s distribution. Output distribution is provably identical to running the target alone.

Speed-up depends on draft quality (acceptance 50–80% typical) and the speed ratio. 2–3× decode throughput is realistic.

Memory cost: both models loaded simultaneously. 70B target + 1B draft at Q4 ≈ 42.8 + 0.6 = 43.4 GB — almost free.

Mixture of Experts (MoE)

MoE replaces the dense MLP in each transformer block with N “expert” MLPs and a small router. For each token, the router selects k experts (k=2 typical), so only k/N of the MLP weights are active per token.

Critical fact: the entire model must still be in memory — you don’t know which experts each token will route to.

Mixtral 8x7B (Jiang et al., arXiv:2401.04088):

8 experts/layer, top-2 routing.
Total params: 46.7 B; active per token: 12.9 B.
Memory at FP16: ~93 GB (all experts loaded).
Memory at Q4_K_M: ~26 GB.
Decode speed: bandwidth-bound on active params (~13B effective) → as fast as a dense 13B.
Quality: matches or beats dense 70B on most benchmarks.

Mixtral 8x22B: 141B total / 39B active, ~80 GB at Q4_K_M, decodes like 39B dense.

DeepSeek-V3 / R1: 671B total / 37B active, ~340 GB at Q4_K_M — why Apple positions M3 Ultra 512 GB as the “DeepSeek R1 desktop”. Decode ~3–5 tok/s on M3 Ultra makes it usable but slow.

The memory math: weight_bytes = total_params × bpw / 8 (not active params). Speed math: tok/s ≈ bandwidth / (active_params × bpw / 8 + small_overhead).

Continuous batching, PagedAttention, FlashAttention-2/3

Server-side techniques less relevant for batch=1 local inference, but useful to know:

PagedAttention (Kwon et al., SOSP 2023, arXiv:2309.06180) — the vLLM paper. KV cache stored in fixed-size pages (analogous to OS virtual memory), eliminating fragmentation. Without it, naive contiguous KV allocation wastes 60–80% in multi-user serving. With it, waste drops to <4% and throughput improves 2–3×.
Continuous batching — swap completed sequences with new ones at every step rather than waiting for the batch. Doubles or triples server throughput.
FlashAttention-2 (Dao 2023) — better parallelism, ~2× speed-up over v1.
FlashAttention-3 (Shah et al. 2024) — Hopper-specific async + FP8. Achieves 75% of H100’s theoretical FLOPS.

For local batch=1 inference, you get these benefits via llama.cpp’s -fa (flash attention) flag, MLX’s fused attention kernels, or PyTorch SDPA backends.

8. Worked examples — full math

Conventions:

KV cache at FP16 (the default; quantize to ~q8 for half, ~q4 for quarter).
Activation overhead: 1 GB (rough).
Framework overhead: 2 GB.
Total = weights + KV_cache + 3 GB other.

Qwen 2.5 7B

28 layers, 28 query heads, 4 KV heads, head_dim 128 → 56 KB per token.

Format	bpw	Weights	+4K KV	+32K KV	+128K KV
FP16	16	14.0 GB	0.22 → 17.2 GB	1.75 → 18.8 GB	7.0 → 24.0 GB
Q8_0	8.5	7.4 GB	10.6 GB	12.2 GB	17.4 GB
Q6_K	6.56	5.7 GB	8.9 GB	10.5 GB	15.7 GB
Q4_K_M	4.89	4.28 GB	7.5 GB	9.0 GB	14.3 GB

Runs comfortably at 32K on 16 GB Macs at Q4. 128K needs ~16 GB minimum.

Llama 3.1 8B

32 layers, 32 query heads, 8 KV heads, head_dim 128 → 128 KB per token.

Format	bpw	Weights	+4K KV	+32K KV	+128K KV
FP16	16	16.0 GB	0.5 → 19.5 GB	4.0 → 23.0 GB	16.0 → 35.0 GB
Q8_0	8.5	8.5 GB	12.0 GB	15.5 GB	27.5 GB
Q6_K	6.56	6.56 GB	10.1 GB	13.6 GB	25.6 GB
Q4_K_M	4.89	4.89 GB	8.4 GB	11.9 GB	23.9 GB

Fits at Q4 + 128K on a 32 GB Mac with KV quantization. Without it, 128K needs Q8 KV or ~24 GB.

Llama 3.3 70B

80 layers, 64 query heads, 8 KV heads, head_dim 128 → 320 KB per token.

Format	bpw	Weights	+4K KV	+32K KV	+128K KV
FP16	16	140 GB	1.25 → 144.3 GB	10 → 153 GB	40 → 183 GB
Q8_0	8.5	74.4 GB	78.6 GB	87.4 GB	117 GB
Q6_K	6.56	57.4 GB	61.6 GB	70.4 GB	100 GB
Q4_K_M	4.89	42.8 GB	47.0 GB	55.8 GB	85.8 GB

Q4_K_M at 4K fits on a 64 GB Mac. At 128K context, need 96+ GB or aggressive KV quantization (q4 KV → 10 GB instead of 40 GB → ~56 GB total, fits 64 GB).

Mixtral 8x7B

32 layers, 32 query heads, 8 KV heads, head_dim 128. 46.7 B total (this is what counts for memory).

Format	bpw	Weights	+4K KV	+32K KV	+128K KV
FP16	16	93.4 GB	0.5 → 96.9 GB	4.0 → 100.4 GB	16.0 → 112.4 GB
Q8_0	8.5	49.6 GB	53.1 GB	56.6 GB	68.6 GB
Q6_K	6.56	38.3 GB	41.8 GB	45.3 GB	57.3 GB
Q4_K_M	4.89	28.5 GB	32.0 GB	35.5 GB	47.5 GB

Decodes at the speed of a ~13B dense model (12.9 B active per token), despite needing 28+ GB. The MoE deal: slow to load, fast to run.

Llama 3.1 405B (only feasible on Mac Studio Ultra)

126 layers, 128 query heads, 8 KV heads, head_dim 128 → 504 KB per token.

Format	bpw	Weights	+4K KV	+32K KV	+128K KV
FP16	16	810 GB	2 → 815 GB	16 → 829 GB	63 → 876 GB
Q8_0	8.5	430 GB	435 GB	449 GB	496 GB
Q4_K_M	4.89	248 GB	253 GB	267 GB	314 GB

Even Q4 + 4K needs 253 GB. Mac Studio M3 Ultra 512 GB is the only consumer device that runs it. Q4 + 32K (~267 GB) comfortable. 128K context needs KV quantization to ~q4 (16 GB instead of 63 GB) to fit cleanly. Expect ~2 tok/s decode (bandwidth-bound: 800 / 248 ≈ 3.2 theoretical → 1.5–2 real).

Qwen 2.5 32B

64 layers, 40 query heads, 8 KV heads, head_dim 128 → 256 KB per token.

Format	bpw	Weights	+4K KV	+32K KV	+128K KV
FP16	16	65 GB	1 → 69 GB	8 → 76 GB	32 → 100 GB
Q8_0	8.5	34.5 GB	38.5 GB	45.5 GB	69.5 GB
Q6_K	6.56	26.6 GB	30.6 GB	37.6 GB	61.6 GB
Q4_K_M	4.89	19.9 GB	23.9 GB	30.9 GB	54.9 GB

Q4_K_M at 32K fits on a 32 GB Mac. Q6_K (near-FP16) at 32K needs 36+ GB → 48 GB Mac. Full 128K on Q4 needs ~64 GB.

Specific learnings for Locara

The manifest schema should accept (weights_bytes, kv_per_token_bytes, max_context) as primitives, not “min RAM.” This lets the runtime compute the real RAM requirement for the user’s actual context cap, not a single global number. An app that supports 128K context but defaults to 8K can run on a much smaller Mac if the user keeps context short.
Bandwidth is the right primary device-class metric alongside RAM. A 64 GB M3 Max-14C (300 GB/s) and a 64 GB M3 Max-16C (400 GB/s) have meaningfully different LLM performance for the same model — the manifest needs to know both. See mac-hardware-lineup.md for the per-SKU lookup table.
Default to Q4_K_M for capacity-bound users; Q6_K for quality-bound users. Don’t try to be smarter than the community consensus. The runtime’s model catalog should ship both for any model where size allows.
KV cache quantization should be on by default past 8K context. The math is unambiguous — at 32K context for a 70B model, FP16 KV cache costs 10 GB you almost certainly want for something else. Default to q8_0 K / q8_0 V and surface “use full precision KV” as a quality-app opt-in.
Publish a “Llama 3 8B Q4_K_M expected ~N tok/s on your Mac” estimate at install time. Compute from min(weight_budget, bandwidth × 0.7 / weight_bytes). Honest expected-perf is the LSB of trust — users who expect 100 tok/s and get 20 churn; users who are told 25 tok/s and get 25 are happy.
MoE models are a special manifest case. Memory = total_params; speed = active_params. Without that distinction the runtime will refuse to load Mixtral on a 32 GB Mac it could actually fit, or promise speeds it can’t deliver. The manifest schema needs total_params and active_params as separate fields for MoE.
Speculative decoding requires both models manifested together. The pair (target, draft) is the deployment unit, not just the target. Runtime should compute combined memory cost.
Track Apple’s official bandwidth as upper bound, plan for 0.7× utilization. Real LLM decode hits 70–85% of theoretical on tuned kernels (MLX) and 60–75% on llama.cpp. Bandwidth × utilization / weight_bytes is the honest tok/s estimate.
Reject 8 GB Mac configurations for any model >3B. The OS + browser baseline alone eats most of the 8 GB budget. A 7B model “technically fits” but will thrash. Manifest minimum should be 16 GB for most useful local models.
The KV cache formula is the single most important piece of math Locara’s runtime needs internalized. Get the architecture-specific (num_layers, num_kv_heads, head_dim) numbers from each model’s config.json on HF, store them in the runtime’s model registry, and recompute at every install-time fit check.

References

Foundational quantization and memory papers

Williams, Waterman, Patterson — Roofline: An Insightful Visual Performance Model for Multicore Architectures, CACM April 2009 — https://cacm.acm.org/magazines/2009/4/22959-roofline-an-insightful-visual-performance-model-for-multicore-architectures/fulltext
Dettmers et al. — LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, NeurIPS 2022 — https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/
Dettmers et al. — QLoRA, 2023 — https://arxiv.org/abs/2305.14314
Frantar et al. — GPTQ, ICLR 2023 — https://arxiv.org/abs/2210.17323
Lin et al. — AWQ, 2023 — https://arxiv.org/abs/2306.00978
Xiao et al. — SmoothQuant, 2022 — https://arxiv.org/abs/2211.10438

Attention and serving

Dao et al. — FlashAttention, NeurIPS 2022 — https://arxiv.org/abs/2205.14135
Dao — FlashAttention-2, 2023 — https://arxiv.org/abs/2307.08691
Shah et al. — FlashAttention-3, 2024 — https://tridao.me/blog/2024/flash3/
Kwon et al. — PagedAttention / vLLM, SOSP 2023 — https://arxiv.org/abs/2309.06180
Ainslie et al. — GQA, 2023 — https://arxiv.org/abs/2305.13245
Leviathan, Kalman, Matias — Speculative Decoding, 2022 — https://arxiv.org/abs/2211.17192
Chen et al. (DeepMind) — Accelerating LLM Decoding with Speculative Sampling, 2023 — https://arxiv.org/abs/2302.01318
Xiao et al. — StreamingLLM, 2023 — https://arxiv.org/abs/2309.17453

Model architecture papers

Grattafiori et al. — The Llama 3 Herd of Models, 2024 — https://arxiv.org/abs/2407.21783
Jiang et al. — Mixtral of Experts, 2024 — https://arxiv.org/abs/2401.04088
DeepSeek-AI — DeepSeek-V3 Technical Report, 2024 — https://arxiv.org/abs/2412.19437

Practical references

Andrej Karpathy — nanoGPT, llm.c, State of GPT (Microsoft Build 2023). https://github.com/karpathy/nanoGPT, https://github.com/karpathy/llm.c
Tim Dettmers — https://timdettmers.com — practical hardware reasoning.
Horace He — Making Deep Learning Go Brrrr From First Principles — https://horace.io/brrr_intro.html
EleutherAI / Aleksa Gordić et al. — Transformer Math 101 — https://blog.eleuther.ai/transformer-math/
Awni Hannun (MLX) — https://awnihannun.com, @awnihannun
Simon Willison — https://simonwillison.net/tags/local-llms/
Daniel Han / Unsloth — fine-tuning memory math
Maxime Labonne — quantization writeups
llama.cpp quantize README — https://github.com/ggml-org/llama.cpp/blob/master/tools/quantize/README.md
llama.cpp Tensor Encoding Schemes wiki — https://github.com/ggml-org/llama.cpp/wiki/Tensor-Encoding-Schemes
llama.cpp server README (KV cache flags) — https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md
llama.cpp discussion #5932 (KV cache quantization) — https://github.com/ggml-org/llama.cpp/discussions/5932
MLX quantization docs — https://ml-explore.github.io/mlx/build/html/python/_autosummary/mlx.core.quantize.html
Hugging Face Qwen 2.5-7B-Instruct config — https://huggingface.co/Qwen/Qwen2.5-7B-Instruct/blob/main/config.json
Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct — https://arxiv.org/html/2601.14277v1
Simon McLeod, Bringing K/V Context Quantisation to Ollama — https://smcleod.net/2024/12/bringing-k-v-context-quantisation-to-ollama/
Hugging Face, 4-bit Transformers with bitsandbytes — https://huggingface.co/blog/4bit-transformers-bitsandbytes
apxml.com model spec pages (Llama 3.3 70B, Llama 3.1 405B, Qwen 2.5 32B, Mixtral 8x7B)