Locara

llama.cpp

What it is: A C/C++ inference engine for transformer LLMs, originally a port of Llama by Georgi Gerganov (“ggerganov”). Now the dominant open-source local inference runtime: powers Ollama, LM Studio, Jan, Open WebUI, and most of the “run a local model” tooling. The associated GGUF file format is the de facto local model packaging standard. Status: MIT-licensed, extremely active (multiple commits per day), still ggerganov-led with a growing contributor base. Most relevant to Locara: This is the actual kernel layer Locara runs on. Other notes (Ollama, LM Studio, Jan) are consumers of llama.cpp; this note is about the substrate itself.

Background

ggerganov posted the first commit to llama.cpp in March 2023, days after Llama 1’s weights leaked. The original goal was modest — get Llama running on a MacBook in pure C++. Within months it had Metal acceleration, CUDA support, and a quantization scheme (Q4) that put 7B-class models within reach of consumer hardware. The underlying tensor library (ggml) is now a separate project; llama.cpp is the inference layer on top.

By 2025 it’s the most-portable local LLM runtime in existence: CPU (AVX-2/AVX-512/NEON), Metal, CUDA, ROCm, Vulkan, SYCL, and CANN backends, all maintained in one codebase.

Key design decisions

  • Pure C/C++, no Python. Self-contained. Cross-compiles to virtually anything.
  • Custom tensor library (ggml). Hand-written kernels, no PyTorch dependency. Authored by ggerganov initially.
  • GGUF file format — single-file model bundle: weights + tokenizer + chat template + metadata. Successor to the original GGML format (which was deprecated cleanly in 2023).
  • K-quants (Q4_K_M, Q5_K_M, Q6_K, Q8_0, etc.) — block-wise quantization with per-block scaling. Q4_K_M is the de facto sensible default.
  • OpenAI-compatible server (llama-server) — a built-in HTTP server exposing /v1/chat/completions.
  • Tooling parity for additions — vision (LLaVA, llama-vlm), embeddings, speculative decoding, draft models, tool calling, structured output (JSON schema constraints) all bolted on as the model architectures evolved.
  • MIT license. No CLA.
  • Single-developer-led with a growing committer base. ggerganov sets direction; specific kernel and feature work is increasingly distributed (Slaren, ngxson, JohannesGaessler, many others).

What worked

  • First-mover advantage at the right moment. When Llama 1 weights dropped, llama.cpp was the only viable local runtime. By the time others appeared, llama.cpp was already the substrate they built on.
  • Apple Silicon Metal backend was excellent and shipped early — single biggest reason “local LLMs on Mac” became real.
  • GGUF adoption is essentially universal for community-quantized models. Hugging Face surfaces GGUF natively; model authors publish GGUF alongside safetensors.
  • Cross-backend coverage is unmatched. No other local runtime supports as many hardware targets in one codebase.
  • Aggressive performance work — kernel fusion, KV cache quantization, grouped-query attention, paged attention, draft model speculative decode all landed in 2023–25.
  • Permissive licensing + visible activity = healthy contributor flywheel.

What failed / criticisms

  • API churn. The C API, the server’s HTTP API, and command-line flags have all changed multiple times. No semver discipline; downstream consumers pin commits.
  • Bus factor. ggerganov is the linchpin. Other contributors are real but project shape comes from him. If he stepped away, direction would fragment.
  • Documentation lags features. Examples in the repo are the primary documentation; the README is partial.
  • MLX outperforms it on M-series for some workloads. llama.cpp’s Metal backend is good but not always the throughput leader on Apple Silicon.
  • Subtle numeric bugs in the Metal backend have shipped historically (fp16 accumulation issues, split-k matmul edge cases). Less frequent now but a real-world cost.
  • Weights licensing handwave — same story as Ollama. The runtime is permissively licensed; the models people run on it often aren’t.
  • No formal release schedule. “Pull master” is the actual recommended cadence, with all that implies.

Specific learnings for Locara

  1. llama.cpp is the kernel layer to depend on, not own. Building competing C++ inference is years of work. Locara should consume llama.cpp through a Rust binding (e.g., llama-cpp-2 / llama_cpp crate) and treat it as substrate. Locara is the layer above; the kernel layer is upstream’s job.
  2. Pin a specific commit per Locara release. llama.cpp doesn’t promise stability. The Locara runtime should pin a known-good commit, run a regression suite against it, and bump deliberately on a Locara cadence — not on every upstream commit.
  3. GGUF is the model file format to bet on. Don’t invent a Locara-native format. The model manifest references GGUF blobs in the content-addressed cache.
  4. MLX-default-with-llama.cpp-fallback is correct. MLX wins throughput on M-series for the workloads where it matters; llama.cpp wins portability and breadth. On Mac v1, MLX is the path of least surprise; llama.cpp is the floor that always works.
  5. Q4_K_M is the safe default to declare in the manifest. Pin specific quants per device class, validated. Don’t let app authors pick arbitrary quants without justification.
  6. Contribute upstream where Locara hits issues. Don’t fork unless absolutely necessary — the upstream cadence is faster than any fork’s. But have a plan if forking becomes unavoidable.
  7. Bus factor mitigation = treat upstream as critical infrastructure. Vendor a copy. Run mirrors. Have a continuity plan if the project’s leadership disperses. This is the same hardening Apple does for any single-vendor dependency.
  8. llama-server is not the daemon Locara should ship. It’s underpolished compared to a Rust-native server, and Locara’s daemon (when it eventually exists) wants tighter capability integration. Use llama.cpp as a library, not as a binary.
  9. Watch the kernel-fusion frontier. FlashAttention, paged attention, speculative decode, draft models — these all happen in llama.cpp / MLX / vLLM, not in app frameworks. Locara’s job is to expose them via the manifest (e.g., “this app benefits from speculative decode with Llama-3-1B-draft”), not implement them.

References