llama.cpp
What it is: A C/C++ inference engine for transformer LLMs, originally a port of Llama by Georgi Gerganov (“ggerganov”). Now the dominant open-source local inference runtime: powers Ollama, LM Studio, Jan, Open WebUI, and most of the “run a local model” tooling. The associated GGUF file format is the de facto local model packaging standard. Status: MIT-licensed, extremely active (multiple commits per day), still ggerganov-led with a growing contributor base. Most relevant to Locara: This is the actual kernel layer Locara runs on. Other notes (Ollama, LM Studio, Jan) are consumers of llama.cpp; this note is about the substrate itself.
Background
ggerganov posted the first commit to llama.cpp in March 2023, days after Llama 1’s weights leaked. The original goal was modest — get Llama running on a MacBook in pure C++. Within months it had Metal acceleration, CUDA support, and a quantization scheme (Q4) that put 7B-class models within reach of consumer hardware. The underlying tensor library (ggml) is now a separate project; llama.cpp is the inference layer on top.
By 2025 it’s the most-portable local LLM runtime in existence: CPU (AVX-2/AVX-512/NEON), Metal, CUDA, ROCm, Vulkan, SYCL, and CANN backends, all maintained in one codebase.
Key design decisions
- Pure C/C++, no Python. Self-contained. Cross-compiles to virtually anything.
- Custom tensor library (
ggml). Hand-written kernels, no PyTorch dependency. Authored by ggerganov initially. - GGUF file format — single-file model bundle: weights + tokenizer + chat template + metadata. Successor to the original GGML format (which was deprecated cleanly in 2023).
- K-quants (Q4_K_M, Q5_K_M, Q6_K, Q8_0, etc.) — block-wise quantization with per-block scaling. Q4_K_M is the de facto sensible default.
- OpenAI-compatible server (
llama-server) — a built-in HTTP server exposing/v1/chat/completions. - Tooling parity for additions — vision (LLaVA, llama-vlm), embeddings, speculative decoding, draft models, tool calling, structured output (JSON schema constraints) all bolted on as the model architectures evolved.
- MIT license. No CLA.
- Single-developer-led with a growing committer base. ggerganov sets direction; specific kernel and feature work is increasingly distributed (Slaren, ngxson, JohannesGaessler, many others).
What worked
- First-mover advantage at the right moment. When Llama 1 weights dropped, llama.cpp was the only viable local runtime. By the time others appeared, llama.cpp was already the substrate they built on.
- Apple Silicon Metal backend was excellent and shipped early — single biggest reason “local LLMs on Mac” became real.
- GGUF adoption is essentially universal for community-quantized models. Hugging Face surfaces GGUF natively; model authors publish GGUF alongside safetensors.
- Cross-backend coverage is unmatched. No other local runtime supports as many hardware targets in one codebase.
- Aggressive performance work — kernel fusion, KV cache quantization, grouped-query attention, paged attention, draft model speculative decode all landed in 2023–25.
- Permissive licensing + visible activity = healthy contributor flywheel.
What failed / criticisms
- API churn. The C API, the server’s HTTP API, and command-line flags have all changed multiple times. No semver discipline; downstream consumers pin commits.
- Bus factor. ggerganov is the linchpin. Other contributors are real but project shape comes from him. If he stepped away, direction would fragment.
- Documentation lags features. Examples in the repo are the primary documentation; the README is partial.
- MLX outperforms it on M-series for some workloads. llama.cpp’s Metal backend is good but not always the throughput leader on Apple Silicon.
- Subtle numeric bugs in the Metal backend have shipped historically (fp16 accumulation issues, split-k matmul edge cases). Less frequent now but a real-world cost.
- Weights licensing handwave — same story as Ollama. The runtime is permissively licensed; the models people run on it often aren’t.
- No formal release schedule. “Pull master” is the actual recommended cadence, with all that implies.
Specific learnings for Locara
- llama.cpp is the kernel layer to depend on, not own. Building competing C++ inference is years of work. Locara should consume llama.cpp through a Rust binding (e.g.,
llama-cpp-2/llama_cppcrate) and treat it as substrate. Locara is the layer above; the kernel layer is upstream’s job. - Pin a specific commit per Locara release. llama.cpp doesn’t promise stability. The Locara runtime should pin a known-good commit, run a regression suite against it, and bump deliberately on a Locara cadence — not on every upstream commit.
- GGUF is the model file format to bet on. Don’t invent a Locara-native format. The model manifest references GGUF blobs in the content-addressed cache.
- MLX-default-with-llama.cpp-fallback is correct. MLX wins throughput on M-series for the workloads where it matters; llama.cpp wins portability and breadth. On Mac v1, MLX is the path of least surprise; llama.cpp is the floor that always works.
- Q4_K_M is the safe default to declare in the manifest. Pin specific quants per device class, validated. Don’t let app authors pick arbitrary quants without justification.
- Contribute upstream where Locara hits issues. Don’t fork unless absolutely necessary — the upstream cadence is faster than any fork’s. But have a plan if forking becomes unavoidable.
- Bus factor mitigation = treat upstream as critical infrastructure. Vendor a copy. Run mirrors. Have a continuity plan if the project’s leadership disperses. This is the same hardening Apple does for any single-vendor dependency.
llama-serveris not the daemon Locara should ship. It’s underpolished compared to a Rust-native server, and Locara’s daemon (when it eventually exists) wants tighter capability integration. Use llama.cpp as a library, not as a binary.- Watch the kernel-fusion frontier. FlashAttention, paged attention, speculative decode, draft models — these all happen in llama.cpp / MLX / vLLM, not in app frameworks. Locara’s job is to expose them via the manifest (e.g., “this app benefits from speculative decode with Llama-3-1B-draft”), not implement them.
References
- https://github.com/ggerganov/llama.cpp
- https://github.com/ggerganov/ggml
- https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/README.md (GGUF spec)
- https://huggingface.co/docs/hub/en/gguf (HF’s GGUF integration)
- ggerganov on Twitter/X
- llama.cpp’s k-quantization PRs and discussions (search “k-quants” in repo)
- Tri Dao’s FlashAttention papers — the upstream techniques llama.cpp eventually integrates