LLM Inference Frameworks — Survey for Mac Local-AI Apps

What this is: A survey of the credible LLM inference engines and optimization packages in the universe, with a Mac-first lens. Per-engine capsules, a kernel/library layer beneath, a decision matrix, a quantization-format compatibility map, common confusions to avoid, and a “what to track” list. Sourced from maintainers and canonical papers — Gerganov on llama.cpp, Hannun on MLX, Tunney on llamafile, Tianqi Chen on MLC, Kwon on vLLM/PagedAttention, Dao on FlashAttention, Dettmers on bitsandbytes, Apple WWDC for Core ML / MLX / Foundation Models. Why it matters: Locara is Mac-first. Picking the right inference substrate is a load-bearing decision — wrong choice means 3× slower decode, 2× more battery, or a fork of code we’ll regret. Most “best inference engine” content online is cargo-cult listicle noise; this note is the structured map that lets us choose deliberately. Most relevant to Locara: Pairs with mlx.md (default engine deep dive), llama-cpp.md (fallback engine deep dive), mac-llm-optimization.md (practical runtime tuning), apple-acceleration-frameworks.md (the kernel-and-library layer), llm-memory-math.md (memory cost formulas that drive the decision matrix), mac-hardware-lineup.md (per-SKU constraints).

Part 1 — Taxonomy

Bucket	Mac-relevant?	Examples
1. Native Mac-optimized engines (Metal/UMA-first kernels)	Yes	MLX + MLX-LM, llama.cpp (Metal backend), MLC LLM (Metal compile target)
2. Server/datacenter engines (CUDA-first batched serving)	No (with caveats)	vLLM, SGLang, TGI, TensorRT-LLM, Aphrodite, LightLLM
3. Single-GPU consumer engines (NVIDIA-centric quant runtimes)	No	ExLlamaV2 / EXL3, AutoGPTQ, AWQ-runner
4. Cross-platform abstraction frameworks (compiled or portable)	Partial	MLC LLM (cross-target compiler), Candle, Burn, Ratchet, WebLLM, transformers + MPS
5. Wrappers / model servers / dev shells (UX layer over engines)	Yes	Ollama, LM Studio, Jan + Cortex.cpp, Open WebUI, GPT4All, LocalAI, KoboldCpp, text-generation-webui, llamafile
6. Apple’s own (OS-level)	Yes	Core ML, MPSGraph, Foundation Models framework (macOS 26+), Speech / SpeechAnalyzer
7. Specialized (single-purpose)	Partial	AirLLM (disk-offload), exo (distributed across Macs), Outlines / LMQL (structured gen)
8. Kernel and library layer (substrate, not full engines)	Partial	ggml, FlashAttention, PagedAttention, Marlin, Triton, CUTLASS, bitsandbytes kernels, Mojo

Two refinements worth flagging: WebLLM lives in bucket 4, not its own bucket — it’s the browser deployment target of the MLC compiler, not a separate engine. Foundation Models is genuinely its own bucket, not a wrapper — it’s an OS-supplied LLM with no replaceable substrate; you can’t swap models under it.

Part 2 — Per-engine capsules

llama.cpp

What it is: C/C++ inference engine for transformer LLMs; the substrate under nearly all “local LLM” tooling on Mac.
Maintainer: Georgi Gerganov (“ggerganov”); large committer base (slaren, ngxson, JohannesGaessler, etc.).
License + activity: MIT, multiple commits/day, “pull master” is the recommended cadence — no semver discipline.
Where it runs: macOS (Metal), Linux (CUDA/ROCm/Vulkan/SYCL/CANN), Windows, *BSD, Android. CPU on AVX-2/AVX-512/NEON.
Inference characteristics: Single-stream chat-shaped workloads; tunable batched serving via llama-server but not at vLLM-class throughput. Bandwidth-bound at batch=1 (chat).
Quantization: GGUF (own format) with K-quants (Q4_K_M, Q5_K_M, Q6_K, Q8_0), IQ-quants (IQ2/3/4_XS/NL), legacy Q4_0/Q4_1. KV-cache quant (q8_0, q4_0) via --cache-type-k/v with -fa.
Distinguishing feature: The portability/coverage axis — same model file runs everywhere; broadest backend matrix.
When to use it on Mac: As the universally-available fallback when MLX hasn’t ported a model. Recommended baseline configuration: -ngl 999 -t <num_perf_cores> -c 8192 -fa --cache-type-k q8_0 --cache-type-v q8_0 (see mac-llm-optimization.md).
When NOT to use it: When MLX has the same model and you only care about Apple Silicon. MLX wins ~20–87% throughput on M-series for sub-14B models per arXiv:2511.05502.
Maintainer voice: Project README; design discussions in ggml-org/llama.cpp discussions (#638 on mmap, #5932 on KV quantization). No formal design doc — the code is the spec.

MLX + MLX-LM

What it is: Apple’s open-source array framework + the LLM-specific package on top. NumPy-shaped Python API, Swift + C++ bindings.
Maintainer: Awni Hannun, Jagrit Digani, Angelos Katharopoulos, Ronan Collobert (Apple ML Research). Hannun is the public face.
License + activity: MIT, very active; first public commits Nov 2023, announced Dec 2023.
Where it runs: Apple Silicon only by design. No CUDA path; experimental CUDA-via-translation exists with no parity roadmap.
Inference characteristics: Throughput leader on Apple Silicon for sub-14B models. Lazy evaluation; kernel fusion at graph-build time. mlx_lm.server exposes OpenAI-compatible API.
Quantization: MLX-Q (own scheme; 2/3/4/6/8-bit, group-size 32/64). Not interoperable with GGUF — mlx-community maintains a parallel weight registry on Hugging Face.
Distinguishing feature: Unified memory as a first-class assumption — “arrays in MLX live in shared memory” and “operations on MLX arrays can be performed on any of the supported device type without performing data copies” [MLX docs, ml-explore.github.io/mlx]. NumPy-shaped API for fast research adoption.
When to use it on Mac: The default — for any Mac-first chat or generation workload on a supported model.
When NOT to use it: When you need cross-platform (Linux/Windows), when the model isn’t ported, when you want >70B without ports, or when you need an exotic quant.
Maintainer voice: Awni Hannun on X (@awnihannun) and his personal blog; the MLX README. “MLX is designed by machine learning researchers for machine learning researchers” [MLX README].

MLC LLM

What it is: A cross-platform inference engine built on the TVM / TVM Unity compiler stack. Compiles model graphs to per-target binaries.
Maintainer: Tianqi Chen and the MLC team (CMU/OctoML lineage; same group as TVM, XGBoost).
License + activity: Apache 2.0, active.
Where it runs: “AMD (Vulkan, ROCm), NVIDIA (Vulkan, CUDA), Apple (Metal), Intel (Vulkan); iOS (Metal on A-series), Android (OpenCL on Adreno/Mali); WebGPU and WebAssembly; Linux and Windows” [mlc-ai/mlc-llm README].
Inference characteristics: Designed for one-engine-everywhere via the MLCEngine, with an OpenAI-compatible API surface across REST / Python / JavaScript / iOS / Android. Cross-platform consistency, not always the throughput leader on any one platform.
Quantization: Group-quant (q3f16, q4f16, q4f32) compiled into the model artifact.
Distinguishing feature: Compiler-first lineage. Where llama.cpp and MLX rely on hand-written kernels per backend, MLC generates them from a single graph IR via TVM. Same approach that produced TVM and TensorIR.
When to use it on Mac: When you need the same model artifact running on Mac, iOS, Android, and web. The cross-platform story is uniquely strong.
When NOT to use it: When you only need Mac — MLX wins on raw throughput and ergonomics. The toolchain (compile step, Vulkan/Metal target selection, Python pipeline) is heavier than mlx_lm.generate(...).
Maintainer voice: Tianqi Chen’s papers (TVM 2018 OSDI; TensorIR 2023 ASPLOS); MLC docs; mlc.ai blog posts.

Ollama

What it is: Local LLM runtime + model registry. Go daemon (port 11434) wrapping llama.cpp; Modelfile packaging; ollama.com/library registry.
Maintainer: Originally indie founders (Jeffrey Morgan, Michael Chiang); now a VC-funded company.
License + activity: MIT; highly active.
Where it runs: macOS, Linux, Windows. Apple Silicon via llama.cpp Metal backend.
Inference characteristics: “Load on demand, evict naively.” Default OLLAMA_KEEP_ALIVE=5m. Limited memory arbitration when multiple apps share the daemon.
Quantization: Whatever llama.cpp supports (GGUF / K-quants / IQ-quants).
Distinguishing feature: Best-in-class developer DX (ollama run llama3 → chat). Content-addressed storage of weight blobs across model variants (the “Docker layer” trick). OpenAI-compatible API at /v1.
When to use it on Mac: When you want a daemon-style local server with the lowest friction for end users. The default choice for the wrapper layer.
When NOT to use it: When you need MLX-grade throughput, when you need tight per-app memory budgeting, or when you want to embed inference in your own process (use llama.cpp directly via llama-cpp-2 Rust binding).
Maintainer voice: Mostly product-marketing tone; the GitHub README and ollama.com/blog are the primary signal. No design-doc culture.

LM Studio

What it is: Polished proprietary desktop app for running local LLMs + a model browser + an OpenAI-compatible local server.
Maintainer: Element Labs (Yagil Burowski et al.).
License + activity: Proprietary (closed source); free for personal use. Active.
Where it runs: macOS, Windows, Linux.
Inference characteristics: Bundles llama.cpp; recent versions ship MLX as an alternate engine — the first third-party app to offer MLX as a first-class runtime path.
Quantization: GGUF + MLX-Q.
Distinguishing feature: UX polish. Best-of-breed for end-user model discovery, side-by-side chat, and quick experimentation.
When to use it on Mac: As a reference for what a polished local-LLM UX looks like; as the easy on-ramp for non-developers.
When NOT to use it: When you need open source, when you need to embed in another app, or when you need scriptable customization beyond what the API surface allows.
Maintainer voice: No primary citation available. The team has done conference appearances but no public design doc.

Jan + Cortex.cpp

What it is: Jan = Tauri-based open-source desktop assistant. Cortex.cpp = Jan’s own llama.cpp-based daemon (competitor to Ollama).
Maintainer: Menlo Research (formerly janhq); Vietnam-based team.
License + activity: Apache 2.0. Jan active; Cortex.cpp archived July 2025, development moved to menloresearch/llama.cpp.
Where it runs: Mac / Win / Linux.
Inference characteristics: llama.cpp under the hood; multi-provider (local + BYO-key cloud).
Quantization: GGUF.
Distinguishing feature: Tauri + Rust + TS stack validates that combination for local-AI apps. Plugin/extensions API exists (though immature).
When to use it: As a reference for Tauri-based local AI apps and for the “assistant” framing.
When NOT to use it: Don’t depend on Cortex.cpp — archived. Use llama.cpp directly.
Maintainer voice: Public roadmap on GitHub; team writes via Jan’s blog. Honest about tradeoffs.

llamafile

What it is: Single-file executable bundling a model + llama.cpp + chat UI; runs on Mac/Linux/Windows/FreeBSD with no install.
Maintainer: Justine Tunney (justine.lol), under Mozilla Ocho (Innovation Studio).
License + activity: Apache 2.0 (main) + MIT (llama.cpp/whisper.cpp mods); active, ~24k stars.
Where it runs: Everywhere via Cosmopolitan Libc / APE (Actually Portable Executables — one binary is simultaneously valid PE, ELF, Mach-O, and a shell script).
Inference characteristics: llama.cpp under the hood, but Tunney’s hand-tuned matmul kernels achieved meaningful prompt-eval wins that were upstreamed. “84 optimized matrix multiplication kernels … approximately 810 gigaflops on her Alderlake processor — exceeding Intel’s proprietary MKL library for matrices fitting in L2 cache” [Tunney, justine.lol/matmul].
Quantization: GGUF.
Distinguishing feature: Distribution-as-a-feature. Tunney’s argument is explicit: the local-AI bottleneck is distribution friction, not inference perf. “Quantization could become the bigger bottleneck … less need to trade away knowledge for speed” once kernels are sufficient [Tunney, justine.lol/matmul].
When to use it on Mac: When you want a “double-click and it works” demo, or to ship a self-contained tool. Strongly trips macOS Gatekeeper unless properly signed/notarized.
When NOT to use it: When you need updates (multi-GB executable, no patching), composition with other apps, or shared model storage.
Maintainer voice: Justine Tunney’s blog justine.lol is the canonical primary source; Mozilla’s hacks.mozilla.org/2023/11/introducing-llamafile/ is the announce post.

Open WebUI

What it is: Self-hostable web frontend for local LLMs; the polish layer above Ollama for users who want a ChatGPT-like UI.
Maintainer: Timothy Jaeryang Baek and contributors.
License + activity: MIT (with recent contributor-license changes); active.
Where it runs: Anywhere Docker/Python runs; Mac usually via Docker Desktop.
Inference characteristics: Frontend only — defers inference to Ollama / OpenAI-compatible servers.
Quantization: N/A (frontend).
Distinguishing feature: The dominant “Ollama-with-a-real-UI” combination. Multi-user, RAG, vision-input plumbing.
When to use it on Mac: When you want a deployable web UI for an Ollama backend, especially multi-user.
When NOT to use it: When you need a native Mac app — it’s a webapp wearing a desktop costume.
Maintainer voice: README + project docs; no design-doc culture.

vLLM

What it is: High-throughput server inference engine; canonical PagedAttention implementation.
Maintainer: Woosuk Kwon, Zhuohan Li (UC Berkeley → originated at Sky Computing Lab); large committer base now under PyTorch Foundation governance.
License + activity: Apache 2.0; very active.
Where it runs: NVIDIA primarily; AMD ROCm secondary. macOS support exists experimentally but is not target-grade — no Metal backend in tree.
Inference characteristics: Many-concurrent-user serving with continuous batching. “vLLM achieves up to 24x higher throughput compared to HF and up to 3.5x [than TGI]” [vllm.ai blog, 2023-06-20]. PagedAttention is the central trick.
Quantization: AWQ, GPTQ, FP8, INT8, Marlin kernels, INT4, BitsAndBytes (some via integrations).
Distinguishing feature: PagedAttention. “Partitions the KV cache of each sequence into blocks” that “do not need to be contiguous in memory space” — analogous to OS virtual memory; reduces fragmentation from “60–80%” to “under 4%” [vllm.ai blog].
When to use it on Mac: Don’t.
When NOT to use it: Mac. But the PagedAttention algorithm influences cache-management thinking everywhere, including llama.cpp’s evolving cache layer and MLX’s chunked-prefill work.
Maintainer voice: Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP 2023; vLLM blog.

SGLang

What it is: Server inference engine + frontend DSL for structured LLM programs; competes with vLLM on throughput.
Maintainer: Lianmin Zheng, Ying Sheng (LMSYS).
License + activity: Apache 2.0; active.
Where it runs: NVIDIA / AMD; Linux server. Not Mac.
Inference characteristics: Continuous batching + RadixAttention. “Up to 5 times higher throughput” vs vLLM in the SGLang benchmark [Zheng et al., lmsys.org/blog/2024-01-17-sglang].
Quantization: AWQ, GPTQ, FP8.
Distinguishing feature: RadixAttention — “automatic and efficient KV cache reuse across multiple LLM generation calls” via a radix tree, with LRU eviction. Solves the shared-prefix problem (system prompt + few-shot) as a runtime concern rather than an app concern [LMSYS blog].
When to use it on Mac: Don’t.
When NOT to use it: Mac. But the RadixAttention idea is directly transferable to Mac runtimes for shared-prefix caching, which today is a manual --prompt-cache knob in llama.cpp.
Maintainer voice: LMSYS blog; arXiv paper SGLang: Efficient Execution of Structured Language Model Programs (Zheng et al., 2024).

TGI (Text Generation Inference)

What it is: HuggingFace’s production-grade inference server.
Maintainer: Olivier Dehaene and HF inference team.
License + activity: Apache 2.0 (with HFOIL period in 2023–24, since reverted); active.
Where it runs: NVIDIA primarily; AMD, Inferentia, TPU, and llama.cpp CPU via the multi-backend architecture. Not Mac-targeted.
Inference characteristics: Rust HTTP and scheduling layer; Python modeling. Refactored in 2024 around a trait Backend so TGI is now a unified frontend over multiple inference engines including vLLM, TRT-LLM, llama.cpp, AWS Neuron, Google TPU [HF blog, tgi-multi-backend].
Quantization: AWQ, GPTQ, EETQ, BNB.
Distinguishing feature: The “single frontend for many backends” approach is conceptually the inverse of vLLM/SGLang — it’s an inference gateway, not an inference engine.
When to use it on Mac: No.
When NOT to use it: Mac. But study the Rust trait Backend pattern as a template for engine-abstraction in any local runtime.
Maintainer voice: HF blog (huggingface.co/blog), Olivier Dehaene’s conference talks.

TensorRT-LLM

What it is: NVIDIA’s optimizing compiler-based inference engine; the reference frontier perf number on NVIDIA hardware.
Maintainer: NVIDIA engineering.
License + activity: Apache 2.0; active.
Where it runs: NVIDIA only — H100, H200, Blackwell, B200, A100.
Inference characteristics: Kernel-fusion-heavy; multiple quantization paths including INT4 AWQ; targets the upper bound of NVIDIA throughput.
Quantization: AWQ-W4, GPTQ-W4, FP8, INT8 SmoothQuant, INT4 WeightOnly. Programmatic QuantConfig per build.
Distinguishing feature: Vendor-supplied kernel quality (CUTLASS substrate); first hardware to ship FP8 paths. The reference for “what’s achievable on this NVIDIA SKU.”
When to use it on Mac: No.
When NOT to use it: Mac. But it’s the perf ceiling reference for any cross-platform benchmark.
Maintainer voice: NVIDIA dev blog; GTC talks (search “TensorRT-LLM” on NVIDIA on-demand).

ExLlamaV2 / EXL3

What it is: Consumer-NVIDIA LLM inference library with its own quant format.
Maintainer: turboderp (single-developer-led, community contributors); under turboderp-org org.
License + activity: MIT; active. EXL3 is the successor format.
Where it runs: “Modern consumer-class GPUs” — NVIDIA 30/40/50-series with CUDA. Not Mac.
Inference characteristics: Tuned for batch=1/2 single-user scenarios. Benchmarks reference 3090 Ti, 4090.
Quantization: EXL2 / EXL3 — “supports 2-8 bit quantization that allows for mixing quantization levels within a model to achieve any average bitrate between 2 and 8 bits per weight” [turboderp-org/exllamav2 README]. Per-tensor bitrate optimization on a measurement set, akin to quant compression with profile data.
Distinguishing feature: EXL2/3’s variable bitrate quant is mathematically more flexible than fixed-block GGUF K-quants — better at extracting quality from a tight VRAM budget on NVIDIA.
When to use it on Mac: No native Mac support.
When NOT to use it: Mac. But the EXL2/3 quant philosophy (mixed precision per-layer, calibrated on a small dataset) is conceptually transferable.
Maintainer voice: GitHub README; turboderp’s discussions in the repo issues. No formal blog.

Candle

What it is: HuggingFace’s minimalist Rust ML framework.
Maintainer: Laurent Mazare (HF), contributors.
License + activity: Apache 2.0 / MIT; active.
Where it runs: CPU (with MKL on x86, Accelerate on macOS), CUDA (with NCCL multi-GPU), WebAssembly. Metal support is limited; you get CPU+Accelerate on Mac, not the Metal GPU acceleration MLX or llama.cpp give you.
Inference characteristics: “Simple syntax, looks and feels like PyTorch.” Eliminates Python from production hot path.
Quantization: GGUF (via candle-transformers), some int8 paths.
Distinguishing feature: Rust-native ML without GIL or Python deps. Lightweight serverless binaries.
When to use it on Mac: When you want a pure-Rust app and CPU-only inference is acceptable. Otherwise prefer llama-cpp-2 / MLX Swift bindings to actually use the GPU.
When NOT to use it: When you need Apple Silicon GPU acceleration for LLM-class workloads — the Metal story is weaker than llama.cpp’s.
Maintainer voice: README; Laurent Mazare on HF blog (search “candle huggingface.co/blog”).

Apple Foundation Models

What it is: Swift/Python developer SDK for the ~3B on-device LLM bundled with macOS 26 / iOS 19.
Maintainer: Apple.
License + activity: Proprietary; free for developers, no API key. macOS 26 (Tahoe) +. Active.
Where it runs: Apple Silicon Macs (M1+), A17 Pro+ iPhones, M-series iPads. Apple Intelligence eligibility gates the API.
Inference characteristics: Single shared 3B model across the OS — apps don’t ship their own weights. NPU + GPU coordination via Apple’s stack. Apple has not published detailed throughput numbers.
Quantization: Apple’s own (opaque). Optional LoRA adapter fine-tuning for app customization.
Distinguishing feature: OS-bundled model; no per-app weight bundle; PCC fallback for harder requests with attestation-verified server enclaves.
When to use it on Mac: For “summarize this,” “classify this,” “extract this” tasks where a 3B model is sufficient.
When NOT to use it: When you need >3B-class capability, when you need cross-platform, when you need a non-Apple model (open-source weights, your own fine-tune beyond LoRA, or specific architectures).
Maintainer voice: WWDC 2025 session videos; developer.apple.com/documentation/foundationmodels. Cross-ref apple-foundation-models.md.

Apple Core ML

What it is: Apple’s on-device ML deployment runtime — the official path for shipping any ML model in a Mac/iOS app.
Maintainer: Apple.
License + activity: Proprietary; built into every Apple OS.
Where it runs: All Apple OSes.
Inference characteristics: Tightly integrated with Neural Engine + GPU + CPU; supports static graph compilation via Core ML compiler (coremlcompiler).
Quantization: Per-axis linear quant, palettization, sparse weights, FP16/INT8/INT4 (via coremltools).
Distinguishing feature: Only path with first-class Neural Engine access. LLMs converted via coremltools.convert and run via MLModel API.
When to use it on Mac: For non-LLM ML tasks (vision, speech, classifiers) and for LLMs where you want NE acceleration and tight system integration. Whisper-CoreML is a strong example.
When NOT to use it: For interactive chat LLMs — Apple’s static-graph model trades the dynamic-batching / KV-cache flexibility that llama.cpp and MLX provide. Apple’s stance per the WWDC framing is that Core ML is the deployment runtime; MLX is for research and dynamic inference.
Maintainer voice: WWDC sessions; developer.apple.com/documentation/coreml; coremltools repo on GitHub. Cross-ref apple-acceleration-frameworks.md.

bitsandbytes

What it is: Python library of 8-bit and 4-bit operators for PyTorch; the workhorse quantization library in transformers + accelerate.
Maintainer: Tim Dettmers (UW → CMU, then independent); now broader committer base, governance under the bitsandbytes-foundation org.
License + activity: MIT; active.
Where it runs: CUDA primary; AMD ROCm, Intel XPU, MPS support landing. Limited Apple GPU support — historically MPS was a known gap.
Inference characteristics: Not a serving engine — a kernels-and-types library used during model loading. Underpins QLoRA fine-tuning.
Quantization: LLM.int8() (mixed-precision decomposition for outlier features), 4-bit NF4 / FP4 (the QLoRA quant), 8-bit Adam optimizer states.
Distinguishing feature: The LLM.int8() insight — “transformers at scale develop emergent outlier features … certain hidden dimensions with extremely large values that disrupt standard quantization.” Mixed-precision decomposition keeps outliers in FP16 and quantizes the other 99.9% to INT8 [Dettmers, timdettmers.com/2022/08/17/llm-int8-and-emergent-features].
When to use it on Mac: For QLoRA fine-tuning if a Mac-supported path exists (still limited as of 2026). For inference, MLX or llama.cpp quants are more Mac-native.
When NOT to use it: Pure Mac inference; the MPS path is not the production-grade target.
Maintainer voice: timdettmers.com; Dettmers/Belkada/Lewis/Raffel et al., LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (NeurIPS 2022); Dettmers et al., QLoRA (NeurIPS 2023).

AirLLM

What it is: Disk-offloaded inference — runs huge models layer-by-layer from disk, well beyond GPU/RAM capacity.
Maintainer: Gavin Li (lyogavin, Anima AI).
License + activity: Apache 2.0; active.
Where it runs: Wherever PyTorch runs — Mac via MPS, CUDA, CPU.
Inference characteristics: “70B large language models to run inference on a single 4GB GPU card without quantization, distillation and pruning” [lyogavin/airllm README]. The cost is throughput — disk-streaming per token.
Quantization: Optional block-wise 4-bit / 8-bit.
Distinguishing feature: The disk-offload trick — decompose the model and stream layers in sequence. As cross-ref mac-llm-optimization.md notes, this is “punishingly slow” for normal decode since every layer is needed per token; viable mainly for batch prefill of long context.
When to use it on Mac: When the user explicitly wants a model too large for their RAM and is willing to accept very low tok/s (think 1 tok/s, not 30).
When NOT to use it: Any real interactive workload.
Maintainer voice: Gavin Li’s Medium posts and the repo README; no formal paper.

exo

What it is: Distributed inference across multiple consumer devices; shards a single model across a cluster of Macs (and other devices).
Maintainer: exo labs.
License + activity: Apache 2.0; active.
Where it runs: Mac / Linux / iPhone / iPad / Android. Auto-discovery.
Inference characteristics: Tensor parallelism with topology-aware placement. “1.8x speedup on 2 devices and 3.2x speedup on 4 devices” [exo-explore/exo README]. On macOS 26.2+, RDMA over Thunderbolt 5 reduces inter-device latency by ~99%.
Quantization: MLX / GGUF (depends on backend per device).
Distinguishing feature: Multi-Mac inference. Two Mac Studios + RDMA Thunderbolt 5 → effectively one big-RAM Mac for inference purposes.
When to use it on Mac: When you have multiple Macs and want to run a model that doesn’t fit on any single one (e.g., 405B at 4-bit needs ~200 GB; two 128 GB Mac Studios pooled work).
When NOT to use it: Single-machine use; latency-sensitive workloads (network hop adds latency even with RDMA).
Maintainer voice: GitHub README; exo labs Twitter; no formal paper as of writing.

transformers + MPS backend

What it is: HuggingFace’s transformers library running via PyTorch’s MPS (Metal Performance Shaders) backend.
Maintainer: HuggingFace + PyTorch teams.
License + activity: Apache 2.0; very active.
Where it runs: Apple Silicon via PyTorch MPS.
Inference characteristics: Reference correctness (matches the upstream model authors’ implementation) but historically the slowest path on Mac. Many ops fall back to CPU silently; KV-cache management is naive vs llama.cpp/MLX.
Quantization: bitsandbytes (limited on MPS), GPTQ, AWQ via optimum.
Distinguishing feature: Lowest-friction path from a HuggingFace model card to a working inference loop; correctness is the gold standard.
When to use it on Mac: Prototyping, debugging, or models that haven’t been ported to MLX/GGUF yet.
When NOT to use it: Production. The Mac perf gap vs MLX is large for chat workloads.
Maintainer voice: HF transformers repo; PyTorch MPS issues tracker.

WebLLM

What it is: In-browser LLM inference via WebGPU; deployment target of the MLC compiler.
Maintainer: Tianqi Chen’s group at MLC.
License + activity: Apache 2.0; active.
Where it runs: Any browser with WebGPU — Chrome stable, Safari 18+, Firefox behind a flag.
Inference characteristics: “High-performance, in-browser language model inference engine that leverages WebGPU for hardware acceleration” [webllm.mlc.ai]. Throughput is browser-and-GPU bound; small models (1B–8B) practical, larger ones constrained by 4 GB WebGPU buffer limits historically.
Quantization: MLC q4f16/q4f32 group quant.
Distinguishing feature: Zero-install for the end user — open a tab, run a model. No server, no API key, no upload.
When to use it on Mac: When the surface is the browser (web app, extension) and you want fully-client inference.
When NOT to use it: When you can ship a native app — the native Metal path will outperform WebGPU on the same hardware.
Maintainer voice: webllm.mlc.ai docs; MLC blog.

GPT4All

What it is: Consumer desktop app for local LLMs; ChatGPT-alternative framing.
Maintainer: Nomic AI (Yuvanesh Anand, Zach Nussbaum, Brandon Duderstadt, Benjamin Schmidt, Andriy Mulyar).
License + activity: MIT; active.
Where it runs: Mac / Windows / Linux desktops.
Inference characteristics: llama.cpp under the hood (their own fork); CPU-friendly defaults.
Quantization: GGUF.
Distinguishing feature: Was the first widely-distributed consumer “local ChatGPT” desktop app (mid-2023); now overshadowed by LM Studio / Jan but still maintained.
When to use it on Mac: When you want a simple desktop chat app with LocalDocs RAG.
When NOT to use it: Developer integration — Jan / Ollama / LM Studio have stronger APIs.
Maintainer voice: Nomic blog; GPT4All citation in the repo.

LocalAI

What it is: Self-hosted, OpenAI-API-compatible local AI server; multi-backend (llama.cpp, vLLM, transformers, diffusers, more — 36+ per README).
Maintainer: Ettore Di Giacinto (mudler).
License + activity: MIT; active.
Where it runs: Mac / Linux / Windows; primarily Docker-deployed.
Inference characteristics: Drop-in OpenAI API replacement; multi-modal (text, image, voice, embedding) via the backend abstraction.
Quantization: Whatever each backend supports.
Distinguishing feature: Broadest multi-modal coverage in a single OpenAI-API-compatible server. The “everything-shaped” local API.
When to use it on Mac: When you want a single API surface for chat + image + STT + TTS locally.
When NOT to use it: Pure chat — Ollama is lighter; pure performance — vLLM-direct is faster on Linux/NVIDIA.
Maintainer voice: mudler’s blog (mudler.pm); LocalAI docs.

Ratchet

What it is: Rust + WebGPU ML toolkit; “fast, cross-platform GPU-accelerated inference on native + browser.”
Maintainer: FL33TW00D under the huggingface org.
License + activity: Apache 2.0; active but small. Limited model coverage (Whisper, Phi 2/3, Moondream).
Where it runs: Anywhere WebGPU runs — including Mac natively via Metal-backed WebGPU.
Inference characteristics: Inference-only; quantization support; lazy compute graph.
Quantization: Per-model.
Distinguishing feature: Same Rust binary targets native and browser via WebGPU — a different cross-platform answer than llama.cpp.
When to use it on Mac: Niche — Rust + browser + ML in one stack.
When NOT to use it: Production LLM serving — model coverage too limited.
Maintainer voice: README; FL33TW00D on Twitter/X.

Cortex.cpp

Status: ARCHIVED 2025-07-04. Development moved to menloresearch/llama.cpp. Treat as legacy; don’t depend on it for new work.

Aphrodite Engine

What it is: vLLM fork with added quantization and sampling features; originated for PygmalionAI roleplay-chat serving.
Maintainer: PygmalionAI community.
License + activity: AGPL-3.0; active.
Where it runs: NVIDIA / AMD; Linux.
Inference characteristics: vLLM’s PagedAttention as the base. Adds AQLM, more quant formats, multi-LoRA, more sampling methods.
Quantization: AQLM, AWQ, GPTQ, Marlin, more.
Distinguishing feature: “Builds upon and integrates the exceptional work from various projects, primarily vLLM” [PygmalionAI/aphrodite-engine README] but with feature pile-ups vLLM proper rejects.
When to use it on Mac: No.
When NOT to use it: Mac. AGPL licensing also constrains adoption.

LightLLM

What it is: Python-based LLM inference and serving framework — “lightweight design, easy scalability, and high-speed performance.”
Maintainer: ModelTC (academic + industrial collaborators; ASPLOS ‘25 and ACL ‘25 papers on its scheduling and decoding work).
License + activity: Apache 2.0; active.
Where it runs: NVIDIA / AMD; Linux.
Inference characteristics: Continuous batching; some of its kernels are reused by vLLM upstream. Research-driven feature set.
Quantization: AWQ, GPTQ, FP8.
Distinguishing feature: Cross-fertilizes with vLLM both directions (“vLLM is listed as a project that uses some LightLLM’s kernel” — ModelTC/lightllm README).
When to use it on Mac: No.
When NOT to use it: Mac.

KoboldCpp

What it is: Pre-built llama.cpp fork bundled with a chat UI + roleplay / character-card features.
Maintainer: LostRuins / @concedo.
License + activity: AGPL-3.0; active.
Where it runs: Mac / Win / Linux; single-binary distribution.
Inference characteristics: llama.cpp under the hood.
Distinguishing feature: Roleplay + worldbuilding features that mainline llama.cpp doesn’t ship.
When to use it on Mac: Roleplay / fiction use cases.
When NOT to use it: Production app development — fork divergence, AGPL constraints.

text-generation-webui (“oobabooga”)

What it is: Multi-backend local LLM webui — historically the developer’s playground for trying many engines.
Maintainer: oobabooga (a16z-backed in 2023).
License + activity: AGPL-3.0; active (47k stars per repo).
Where it runs: Mac / Win / Linux.
Inference characteristics: Wraps llama.cpp, transformers, ExLlamaV3, TensorRT-LLM via a unified UI.
Distinguishing feature: Multi-backend swappability + an API. Older school but still influential.
When to use it on Mac: Experimenting with multiple backends from one UI.
When NOT to use it: Embedding in another app.

Part 3 — Kernel and library layer (one level down)

Engines aren’t monoliths — they compose kernels and libraries. Understanding the substrate explains why some engines are “fast” on certain hardware.

ggml

Maintainer: Georgi Gerganov.
Level: Tensor library — the layer beneath llama.cpp, whisper.cpp, stable-diffusion.cpp.
Design goals (from README): “Low-level cross-platform implementation; integer quantization support; broad hardware support; automatic differentiation; no third-party dependencies; zero memory allocations during runtime” [ggerganov/ggml README].
Used by: llama.cpp, whisper.cpp, stable-diffusion.cpp, bark.cpp, and a long tail of *.cpp apps.
Mac availability: Yes — Metal backend (ggml-metal.m).

FlashAttention 1 / 2 / 3

Maintainer: Tri Dao (Princeton, Together AI).
Level: Attention kernel.
Used by: vLLM, TGI, TensorRT-LLM, transformers (when CUDA), SGLang. llama.cpp’s CUDA path uses Flash-style attention; the Metal backend has a Flash-attention-shaped kernel toggled with -fa.
Mac availability: No direct port — FlashAttention is CUDA-specific (FA-2 Ampere, FA-3 Hopper-only). Mac equivalents are llama.cpp’s -fa Metal kernel and MLX’s fused attention.
Why it matters: Tri Dao’s FA-3 paper is the kernel-design state of the art. “Overlap overall computation and data movement via warp-specialization; interleave block-wise matmul and softmax operations; incoherent processing that leverages hardware support for FP8” → “1.5–2.0x faster than FlashAttention-2 with FP16 … 75% utilization of H100 theoretical max FLOPS” [Dao, tridao.me/blog/2024/flash3]. Mac equivalents lag this curve but follow the same architectural principles.

PagedAttention

Maintainer: Woosuk Kwon et al. (vLLM).
Level: KV-cache management algorithm + CUDA kernel.
Used by: vLLM (canonical), TensorRT-LLM (its own implementation), SGLang (combines with RadixAttention), TGI.
Mac availability: Influence-only — the algorithm is portable, but the optimized kernel is CUDA. llama.cpp’s KV cache uses unified contiguous allocation per slot, not paged blocks. As batched serving becomes relevant on Mac, expect paged-style management to land.

Marlin / Marlin-2

Maintainer: IST-DASLab (Elias Frantar, Dan Alistarh and team).
Level: FP16xINT4 matmul CUDA kernel.
Used by: vLLM, SGLang, TGI for INT4 inference paths.
Mac availability: No — Ampere/Hopper-specific. Mac’s equivalent is the K-quant / IQ-quant Metal kernels in ggml-metal.
Why it matters: Mathematically establishes that 4× weight-only quant can hold near-ideal speedup up to batch=16–32; the prior generation of 4-bit kernels died at batch=1–2. Sets the bar for what a good Metal Q4 matmul should look like.

Triton (OpenAI DSL)

Maintainer: Philippe Tillet (OpenAI). Now PyTorch’s primary kernel authoring path via torch.compile.
Level: Block-level GPU programming language; compiles to PTX (NVIDIA) / ROCm.
Used by: Most modern serving engines for some kernels. vLLM, SGLang, TGI all author or borrow Triton kernels.
Mac availability: No. Triton is CUDA/HIP-only. There is no Metal backend in mainline Triton; Apple Silicon kernel authoring on the Triton model would require a separate project.
Why it matters: Tillet’s model — “Blocked Program, Scalar Threads” vs CUDA’s “Scalar Program, Blocked Threads” — let the compiler handle “coalescing, thread swizzling, pre-fetching, and tensor core utilization” [triton-lang.org docs]. The fact that Mac has no equivalent (you write Metal Shading Language by hand) is part of why Mac kernel velocity lags NVIDIA’s.

CUTLASS

Maintainer: NVIDIA.
Level: C++ template library for GEMM and convolution; substrate beneath TensorRT, TensorRT-LLM, and many custom kernels.
Used by: TensorRT-LLM, vLLM’s Marlin path, custom CUDA kernels in research code.
Mac availability: No.

MPS / MPSGraph / MSL

Cross-reference apple-acceleration-frameworks.md. The Apple equivalent of CUTLASS / Triton / cuBLAS is a layered stack:

MSL (Metal Shading Language) — write a kernel manually.
MPS — Apple’s pre-baked GPU kernels (GEMM, attention primitives).
MPSGraph — graph-level ML primitives, like XLA/HLO for Apple GPUs.
BNNS / Accelerate / vDSP / AMX — CPU-side math libraries; AMX matters for prompt prefill on certain Mac chips.

Apple’s framing per WWDC: Core ML is the deployment runtime; MLX is research-and-dynamic; MPSGraph is the lower-level engine substrate. There is no Apple equivalent to Triton — kernel authoring goes through MSL or MLX’s higher-level fusion.

bitsandbytes kernels

Maintainer: Tim Dettmers + bitsandbytes-foundation contributors.
Level: 8-bit and 4-bit CUDA kernels exposed as PyTorch ops.
Used by: transformers + accelerate, QLoRA fine-tuning.
Mac availability: MPS work has landed in tree but is not the production-grade path. Functionally Mac-secondary.

Mojo / Mosaic / MAX

Maintainer: Modular Inc. (Chris Lattner et al.).
Level: A Python-superset systems language + an inference engine (MAX) trying to be the successor to Triton + CUDA + PyTorch all at once.
Used by: Modular’s MAX inference engine; some research adoption.
Mac availability: macOS support exists but Modular’s commercial focus is on NVIDIA/server. Watch but don’t bet on it for Mac LLM work in 2026.

Part 4 — Decision matrix

Use case	First-pick on Mac	First-pick elsewhere	Notes
Ship a Mac AI app to consumers	MLX (Swift bindings) with llama.cpp fallback	N/A	Pattern from `mlx.md`.
Max throughput on a single Mac Studio	MLX	vLLM (on H100)	MLX leads Mac throughput per arXiv:2511.05502.
Tiny embedded LLM inside another app	llama.cpp via `llama-cpp-2` Rust binding	Same	Linkable, one process, no daemon.
OpenAI-compatible API on localhost	Ollama (Mac), `mlx_lm.server`, or LocalAI	vLLM, TGI	Ollama for UX; mlx_lm.server for perf; LocalAI for multi-modal.
Fine-tune	MLX-LM LoRA, or QLoRA via bitsandbytes (if MPS path works)	bitsandbytes + transformers on NVIDIA	MLX-LM has working LoRA on Mac.
One codebase Win + Linux + Mac	MLC LLM, or llama.cpp via Rust binding	Same	MLC’s cross-target compile is its raison d’être.
Serve many concurrent users on a server	N/A on Mac	vLLM or SGLang	Continuous batching is a Linux/CUDA story.
Structured generation (JSON schema, grammar)	llama.cpp `--grammar` / `--json-schema`; MLX-LM (limited); Outlines wrapping llama.cpp	Outlines on vLLM; SGLang’s `regex`/`json` constraints	Apple Foundation Models has `@Generable` Swift macros as a first-party alternative.
Speculative decoding	llama.cpp `-md <draft>`; MLX `generate(draft_model=...)`	vLLM speculative decode	Pattern in `mac-llm-optimization.md` §3.
Multi-modal (vision + text)	llama.cpp llama-vlm; MLX-VLM	vLLM + LLaVA-style models	MLX-VLM covers Qwen-VL, Llama 3.2 Vision, Pixtral.
Voice in/out	Whisper.cpp / Whisper-MLX / Apple Speech (in) + `AVSpeechSynthesizer` or Kokoro (out)	Same; whisper.cpp universal	Cross-ref `whisper-and-stt-landscape.md`, `voice-to-voice-slms.md`.
Run inside a browser	WebLLM	Same	The MLC web target is the only credible browser path.
Multi-Mac for one big model	exo	N/A	Thunderbolt 5 RDMA on macOS 26.2+.
Models that don’t fit in RAM	AirLLM (degraded perf) or smaller quant	Same	Real answer: pick a smaller / more aggressively quantized model.
~3B-class summarization without shipping weights	Apple Foundation Models	N/A	Free, OS-bundled, no weight bundle to ship.

Part 5 — Quantization format compatibility map

Engine	GGUF	MLX-Q	AWQ	GPTQ	EXL2/3	bitsandbytes (NF4/INT8)	FP8	Native Core ML
llama.cpp	yes	no	partial (via converters)	no	no	no	partial (CUDA backend)	no
MLX / MLX-LM	no	yes	no	no	no	no	no	no
MLC LLM	no	no	partial	partial	no	no	partial	no
Ollama	yes (via llama.cpp)	no	no	no	no	no	no	no
LM Studio	yes	yes	no	no	no	no	no	no
Jan	yes	no	no	no	no	no	no	no
llamafile	yes	no	no	no	no	no	no	no
vLLM	no	no	yes	yes	no	yes	yes	no
SGLang	no	no	yes	yes	no	partial	yes	no
TGI	no	no	yes	yes	no	yes	yes	no
TensorRT-LLM	no	no	yes	yes	no	no	yes	no
ExLlamaV2/3	no	no	no	no	yes	no	no	no
Candle	yes	no	no	no	no	no	no	no
transformers + MPS	yes (loading only)	no	yes	yes	no	partial	no	no
WebLLM	no	no	no	no	no	no	no	no (MLC q4f16)
GPT4All	yes	no	no	no	no	no	no	no
LocalAI	yes	no	yes (via vLLM backend)	yes (via vLLM)	no	yes	yes	no
Apple Foundation Models	n/a	n/a	n/a	n/a	n/a	n/a	n/a	yes (opaque Apple format)
Apple Core ML	n/a	n/a	n/a	n/a	n/a	n/a	n/a	yes (palettized, per-axis linear, INT4)

Part 6 — Common confusions / what to ignore

“Engine X vs llama.cpp” benchmark posts are usually meaningless. They compare different quants (Q4_0 vs Q4_K_M vs MLX-Q4), different context lengths, different prompt-vs-generate ratios, and often a stale build of one side. Trust arXiv:2511.05502 and the maintainers’ own benchmarks; treat random Medium posts as noise.
“Best engine of 2024/2025/2026” listicles miss the architectural distinctions (server-batched vs local-single-stream). The question is wrong — the right question is “for what workload, on what hardware.”
Cortex.cpp is archived (July 2025). If you read about it in older notes, treat as legacy and use llama.cpp directly.
AutoGPTQ / GPTQ-for-LLaMA / AWQ-runner were the consumer NVIDIA quant runtimes pre-EXL2. Mostly absorbed into transformers + optimum or surpassed by EXL2/3. Use only if a specific model release insists.
MLC’s iOS / Android apps historically existed as demos but have been superseded by the unified MLCEngine path. Don’t try to use the old mlc-llm iOS app target — follow webllm.mlc.ai or the current iOS docs.
Cargo-cult wrapper-of-wrappers: Tools that wrap Ollama which wraps llama.cpp which wraps ggml — adding a fourth abstraction layer usually adds latency, hides bugs, and contributes nothing. Skip unless the wrapper provides a specific capability you need.
TensorRT-LLM’s research forks (e.g., DeepSpeed-MII, FasterTransformer) are NVIDIA-internal or research-only. Production NVIDIA = TensorRT-LLM, vLLM, or SGLang.
Comparing “llama.cpp Q4_0” against “MLX Q4” as if they’re the same — they’re not. Different rounding, different group sizes, different per-tensor metadata. Quality is comparable on standard benchmarks (per Hannun’s MLX-vs-GGUF benchmarks “within noise” on MMLU/HellaSwag) but the numbers are not bit-identical.
PyTorch MPS as a production target — it works for prototyping but the Mac perf gap is real. Don’t ship transformers+MPS into a consumer Mac app expecting MLX-class throughput.

Part 7 — Cross-cutting principles, from the legends

Gerganov on cross-platform vs Mac-optimized. No formal design doc from him — the code is the spec. The pattern across his discussions (#638 mmap, #5932 KV quant) is: keep the C/C++ core pure and dependency-free, push backends to compile flags, never compromise portability for a single-vendor win. The MLX lead is a deliberate trade — he could collapse Mac perf at the cost of cross-backend simplicity, and chose not to.
Hannun on why MLX exists alongside MPSGraph. No direct quoted comparison — the framing is implicit: MLX is “designed by machine learning researchers for machine learning researchers” [MLX README]. MPSGraph is Apple’s graph engine, not a programming surface researchers want to author against. MLX gives the NumPy/JAX shape with the unified-memory and lazy-eval primitives that the LLM workload specifically benefits from — and unlike MPSGraph it ships with a Python-first developer surface.
Tianqi Chen on compiler-first vs hand-written-kernel. Chen’s lineage (TVM 2018, TensorIR 2023, MLC) is a sustained bet that an IR + scheduling primitives + auto-tuning beats hand-written kernels at the long tail of hardware × model combinations. MLC’s unique value prop — one engine across NVIDIA / AMD / Intel / Apple / Adreno / Mali / WebGPU — is downstream of that bet. The cost is that on any single hardware, hand-written kernels (llama.cpp Metal, MLX, vLLM CUDA) tend to win on the most-common path.
Kwon on server inference as a different problem. vLLM’s whole thesis: server inference is a memory management problem, not a kernel problem — fragmentation wastes 60–80% of VRAM, PagedAttention recovers it [vllm.ai blog]. On Mac with batch=1 this lever doesn’t apply; the bottleneck is bandwidth, not fragmentation.
Dao on the kernel layer’s role. FA-3’s 75% H100 utilization at FP16 means attention is no longer the bottleneck on NVIDIA frontier hardware — the bottleneck shifts to GEMM, comms, and quant kernels. On Apple Silicon, where there is no FA-3-equivalent, attention is still a fertile optimization target. This is why MLX’s fused-attention work matters disproportionately to its perf lead.
Tunney on distribution as the bottleneck. Explicit thesis: “the bottleneck is mainly at the disk loading” / quality concerns matter once kernels are fast enough. Tunney’s llamafile bet was that getting the binary to the user mattered more than tok/s — and her matmul kernels (eventually upstreamed to llama.cpp) prove she can do both at once [Tunney, justine.lol/matmul].
Apple’s stance. Core ML is the deployment runtime; MLX is research and dynamic inference; Foundation Models is the OS-bundled LLM. There is no Apple equivalent to Triton — kernel authoring goes through MSL or MLX. Apple has not entered the “fastest open inference engine on Mac” race directly; they’re shipping the OS-bundled model and treating third-party LLMs as a system service via MLX.

Part 8 — What to track going forward

MLX releases: github.com/ml-explore/mlx and github.com/ml-explore/mlx-lm Releases tab; Awni Hannun on X (@awnihannun) and awnihannun.com. Cadence: multiple releases per month. Watch for: full fine-tuning (beyond LoRA), audio/vision parity with PyTorch, speculative-decode primitives.
llama.cpp releases: github.com/ggml-org/llama.cpp Releases + Discussions. No semver; pin commits. Watch for: Metal backend kernel updates, KV cache management innovations, GGUF spec changes.
MLC LLM: mlc.ai/blog, MLC engine releases. Watch for: WebGPU performance on Safari, iOS deployment polish.
vLLM: vllm.ai/blog and the vLLM PyTorch Foundation governance. Watch for: M-series support (unlikely but tracked).
SGLang: lmsys.org/blog, LMSYS Discord. Watch for: structured-generation primitives that local engines might adopt.
Apple Foundation Models: WWDC 2026 sessions (June). Watch for: larger model tier, additional adapter slots, multimodal expansion.
Tri Dao’s kernel work: tridao.me/blog, Together AI engineering blog. Watch for: FA-4 and whatever lands on Blackwell.
Tim Dettmers: timdettmers.com, bitsandbytes-foundation. Watch for: MPS quant kernels making it to production grade.
Justine Tunney: justine.lol, Mozilla Ocho. Watch for: post-llamafile distribution experiments and CPU kernel work.
exo labs: github.com/exo-explore/exo Releases, exo Twitter. Watch for: macOS 26.2 Thunderbolt 5 RDMA real-world numbers.
NVIDIA: NVIDIA GTC (spring + fall), NVIDIA Dev Blog, TensorRT-LLM Releases. Watch for: FP4 kernels, Blackwell architecture-specific paths — they set the upper-bound curve everyone else chases.
Modular (Mojo / MAX): modular.com/blog. Watch but don’t bet — multiple pivots already.

Specific learnings for Locara

Lock in MLX-default with llama.cpp-fallback on Mac. This was already the mlx.md recommendation; this survey reinforces it. The arXiv:2511.05502 throughput lead is real, MLX’s Swift bindings make embedding clean, and the mlx-community Hugging Face mirror covers the models Locara cares about. llama.cpp via the Rust llama-cpp-2 binding is the fallback for non-ported models.
The Locara runtime is the inference-engine arbiter — apps don’t pick. Per the manifest model, an app declares “I need text-to-text” and the runtime picks MLX vs llama.cpp based on availability of the specific model. This shields apps from engine churn (Cortex.cpp’s archival, future engine consolidation) and lets the runtime pool models across apps.
Don’t depend on Ollama as a hard dep. It’s fine as an optional model source (the runtime can read Ollama’s content-addressed store at ~/.ollama/models/), but Locara apps should be able to function without it installed. Ollama’s per-process daemon model conflicts with Locara’s per-app memory governance.
Don’t ship Cortex.cpp anywhere. Archived in July 2025. Even Jan moved off it. Use llama.cpp directly.
PagedAttention/RadixAttention are not Mac patterns yet — but the shared-prefix cache idea is. Locara should expose a runtime-level keep capability so apps can mark long system prompts + few-shot examples for the runtime to cache across calls. llama.cpp’s --prompt-cache is the primitive; the Locara runtime exposes it as app.cache.preserve(prefix:).
Apple Foundation Models is a real alternative for small tasks. Apps that need “classify this email” / “extract this date” / “summarise this paragraph” should use FM via Locara’s @Generable-equivalent abstraction — no weights to ship, no RAM cost. The manifest declares useFoundationModel: true and the runtime checks the user’s Apple Intelligence eligibility at install time.
WebLLM is the Locara browser-extension story. If a future Locara surface is a Safari/Chrome extension (rather than a Tauri app), WebLLM via MLC is the only credible path. Worth tracking but not v1.
exo is the “DeepSeek-V3 on two Macs” story. Apps that target the 405B / 671B class will eventually want this; Locara should manifest-declare distributable workloads so a power user with multiple Macs can spread inference automatically. Not v1, but architectural awareness now means the manifest won’t need a breaking change later.
The runtime should expose engine + kernel telemetry via signposts (per mac-performance-profiling.md). When a user reports slow tok/s, the support flow should be able to capture which engine was used, which model file (path + sha256), what KV cache type, and what recommendedMaxWorkingSetSize the GPU was operating against.
Educate app authors on the architectural distinctions. Most app developers don’t know the difference between “server inference” (PagedAttention, continuous batching, RadixAttention) and “local inference” (memory-bound batch=1 decode). The Locara docs should cover this directly — a short version of this note, focused on “you are not running a server.”

References

Primary maintainer voices:

Georgi Gerganov (llama.cpp / ggml): github.com/ggml-org/llama.cpp README + Discussions (#638 mmap, #5932 KV quant); github.com/ggerganov/ggml README.
Awni Hannun (MLX): github.com/ml-explore/mlx + ml-explore/mlx-lm READMEs; awnihannun.com; @awnihannun on X. “MLX is designed by machine learning researchers for machine learning researchers” — MLX README.
Justine Tunney (llamafile): justine.lol blog — especially justine.lol/matmul (matrix-multiplication kernel writeup), justine.lol/oneliners (single-file binaries). Mozilla hacks.mozilla.org/2023/11/introducing-llamafile/ announce post.
Tianqi Chen (MLC / TVM): TVM: An Automated End-to-End Optimizing Compiler for Deep Learning (OSDI 2018); TensorIR: An Abstraction for Automatic Tensorized Program Optimization (ASPLOS 2023). mlc.ai/blog. mlc-ai/mlc-llm README.
Woosuk Kwon, Zhuohan Li et al. (vLLM): Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023). vllm.ai/blog.
Lianmin Zheng, Ying Sheng (SGLang): SGLang: Efficient Execution of Structured Language Model Programs (arXiv 2312.07104; published 2024). lmsys.org/blog/2024-01-17-sglang.
Olivier Dehaene (TGI): HuggingFace blog (huggingface.co/blog, search “tgi”); TGI repo at huggingface/text-generation-inference.
Tri Dao (FlashAttention): FlashAttention (NeurIPS 2022); FlashAttention-2 (arXiv 2307.08691); FlashAttention-3 (tridao.me/blog/2024/flash3).
Tim Dettmers (bitsandbytes): timdettmers.com/2022/08/17/llm-int8-and-emergent-features (LLM.int8() writeup); LLM.int8() paper (NeurIPS 2022); QLoRA paper (NeurIPS 2023).
Philippe Tillet (Triton): triton-lang.org docs; original OSDI / MLSys talks; CUDA MODE Discord/talks.
turboderp (ExLlamaV2/EXL3): turboderp-org/exllamav2 README; repo discussions.
Laurent Mazare (Candle): huggingface/candle README; HF blog posts on Candle.
Apple WWDC (Core ML / MLX / Foundation Models): WWDC 2017–2025 ML sessions; developer.apple.com/documentation/coreml; developer.apple.com/documentation/foundationmodels. Cross-ref apple-foundation-models.md, apple-acceleration-frameworks.md.
Gavin Li (AirLLM): lyogavin/airllm README; Anima AI Medium posts.
exo labs: github.com/exo-explore/exo README + Twitter/X.
mudler (LocalAI): mudler.pm blog; LocalAI docs.
Menlo Research (Jan / Cortex.cpp): github.com/menloresearch/jan README and blog; cortex.cpp archived 2025-07-04 per the repo notice.
Nomic (GPT4All): Nomic blog and the GPT4All citation in repo.

Benchmark / measurement references:

arXiv:2511.05502 Production-Grade Local LLM Inference on Apple Silicon — quoted MLX vs llama.cpp throughput on M-series.
vLLM blog (vllm.ai/blog) — PagedAttention vs TGI/HF throughput numbers.
LMSYS blog (lmsys.org/blog) — SGLang vs vLLM throughput numbers.
AnandTech archive (closed Aug 2024) — Andrei Frumusanu’s M-series core/throughput analyses.
Chips and Cheese (chipsandcheese.com) — M-series microarchitecture.

Caveats / gaps:

No primary maintainer quote for LM Studio. Element Labs has not published a design doc.
No primary maintainer quote for the WebLLM project specifically — folded into MLC LLM docs and Tianqi Chen’s lineage.
bitsandbytes MPS state in 2026 is fluid — primary source is the repo’s MPS PRs, not a settled doc.
exo’s RDMA-over-Thunderbolt-5 numbers are exo-labs-reported only; not independently verified.
arXiv:2511.05502 MLX-vs-llama.cpp throughput numbers — methodology hasn’t been independently audited but the qualitative MLX lead aligns with community reports.
The 8-bucket taxonomy in Part 1 is this note’s framing, not a maintainer’s. Useful for engineers picking a runtime; doesn’t correspond to any project’s self-description.