MLX
What it is: Apple’s open-source array framework for machine learning on Apple Silicon. NumPy-like API in Python, Swift, and C++; lazy evaluation; dynamic computation graphs; designed from scratch around the unified memory architecture of M-series chips. Status: MIT-licensed, very active. Authored by Apple’s machine learning research team (Awni Hannun et al.); first public commits November 2023, announced publicly December 5, 2023. Now the throughput leader for local LLM inference on Apple Silicon for sub-14B models. Most relevant to Locara: This is Locara’s stated default inference path on Apple Silicon. The 2026 throughput data makes the “MLX-default-with-llama.cpp-fallback” choice load-bearing, not a coin flip.
Background
MLX appeared in late 2023 from Apple’s ML research group, primarily authored by Awni Hannun with equal contributions from Jagrit Digani, Angelos Katharopoulos, and Ronan Collobert. Apple is unusual among major AI vendors in not previously having a public ML framework (CoreML is a deployment runtime, not a training framework); MLX was the first open invitation for the broader research community to develop on Apple Silicon, not just deploy to it.
The framework’s design choices follow directly from Apple Silicon’s physics: unified memory means no CPU↔GPU copies, so the runtime can hand any array to any device without serialization. Lazy evaluation lets the graph defer materialization until needed, fusing kernels opportunistically. The Python API is deliberately NumPy-shaped to make porting research code trivial.
By early 2026, MLX is the throughput leader on Apple Silicon for the model sizes Locara cares about. A widely-cited 2026 benchmark study (arXiv 2511.05502 — “Production-Grade Local LLM Inference on Apple Silicon”) puts MLX at ~230 tok/s vs llama.cpp’s ~150 tok/s on identical hardware for short-context workloads, with throughput leads of 20–87% over llama.cpp across models from Qwen3-0.6B to Nemotron-30B.
Key design decisions
- Unified memory as a first-class assumption. Arrays exist in a single address space; CPU and GPU paths share data without copies.
- Lazy evaluation. Computations build a graph; arrays only materialize when their values are needed (printed, evaluated explicitly, or fed to another op).
- Dynamic graphs. Shape changes don’t trigger expensive recompilation, unlike XLA/JAX.
- NumPy-shaped API in Python, with
mxas the conventional alias (vs.np). Function-and-array style, not module-and-tensor. - Swift and C++ bindings for native app integration. The Swift API mirrors Python.
- MLX-LM — the first-party LLM package: weight conversion, generation, fine-tuning, LoRA adapters.
- MIT license, public GitHub development, Apple maintainers actively review community PRs.
- mlx-community on Hugging Face — a parallel quantized-weights ecosystem to GGUF, with MLX-format weights uploaded by the community.
- No CUDA path. MLX is Apple-Silicon-only by design. (There’s experimental CUDA-via-translation but no roadmap parity.)
What worked
- NumPy-shaped API made adoption easy. Researchers could port code in hours, not weeks. The lift-and-shift story landed.
- Performance lead on Apple Silicon is real and growing. Multiple independent 2025–26 benchmarks place MLX ahead of llama.cpp on M1/M2/M3/M4 for the laptop-class model sizes most local apps target.
- Active model porting. mlx-community converts most major open-weights releases to MLX format within days.
- Fine-tuning works locally. LoRA on a 7B model is feasible on a 32 GB MacBook — a workflow llama.cpp doesn’t really support.
- Dual Python and Swift APIs means the same framework can drive a research notebook and a native macOS app.
- Apple investment is sustained. Releases are frequent (multiple per month). Community contributions are real but the maintainer cadence is set by Apple’s team.
What failed / criticisms
- Apple-Silicon-only is a hard cap on portability. Cross-platform Locara apps that target Mac + Linux + Windows will need a non-MLX path.
- Smaller community than PyTorch/JAX. Stack Overflow answers, blog posts, and tutorials are thinner. Edge cases require reading the source.
- Documentation lags features. The Python API is well-covered; Swift and the lower-level C++ extension surface less so.
- Quantization formats fragmented. MLX uses its own quant scheme (MLX-Q4, MLX-Q6, etc.) — incompatible with GGUF directly. mlx-community converts, but there are two parallel quant ecosystems now.
- No JIT compilation in the JAX/Triton sense. MLX does kernel fusion at graph build time but doesn’t expose user-level kernel authoring as cleanly as Triton or JAX/Pallas.
- Some operator gaps remain. Newer architectures (state-space models, specific attention variants) sometimes require waiting weeks for ops to land.
- Closed roadmap. Major direction is set inside Apple, with limited public visibility into what’s coming next.
Specific learnings for Locara
- MLX-default-with-llama.cpp-fallback is the right call, and the gap is bigger than the spec implies. 20–87% throughput lead over llama.cpp on consumer Apple Silicon for sub-14B models means MLX isn’t a “tied option” — it’s the obvious default for any Mac-targeted Locara app. The fallback only matters when MLX hasn’t ported a specific model, or when the user is on non-Mac.
- Adopt MLX-format weights as a first-class type in the manifest. Don’t only declare GGUF. The model manifest should have MLX-format paths for Apple Silicon and GGUF paths for portability, with content-addressed dedup across both.
- mlx-community on HF is the model registry to mirror. Locara’s content-addressed cache should pull from mlx-community for Mac-target apps and bbq-from-the-community-list-by-default rather than re-quantizing in-house.
- The Swift API matters for native apps. A Locara app that uses Swift + MLX directly (vs. Python + MLX in a subprocess) gets cleaner performance and tighter UI integration. The framework’s Swift binding is the right shape for native Mac apps; lean into it.
- Don’t try to ship MLX inside the app bundle — depend on the system runtime. MLX is large (binaries + Metal kernels); apps should declare it as a system dependency the runtime provides, not vendor it. This mirrors how iOS apps don’t ship CoreML.
- The lazy-evaluation model rewards different code than llama.cpp. Apps doing custom math (RAG over local docs, embedding pipelines) on MLX should batch eagerly and let the graph fuse. This is a documentation/template concern for the SDK, not just a runtime one.
- Quantization fragmentation is a real cost. Locara should not declare an opinion on MLX-Q vs GGUF quants in the manifest — let the runtime pick the right one for the platform from the content-addressed store. But document the existence of two parallel quant ecosystems for app authors.
- Apple’s roadmap influences yours. If MLX adds first-class fine-tuning, audio, or speculative decode primitives, Locara apps built on those features get a free upgrade. Watch the release notes.
- Apple-only is a feature for the Mac-first phase. Locara’s “Mac-first” stance is technically aligned with MLX’s Apple-only stance. Phase 2 (cross-platform) is where the llama.cpp fallback becomes load-bearing.
References
- https://github.com/ml-explore/mlx (main repo, MIT)
- https://github.com/ml-explore/mlx-examples
- https://github.com/ml-explore/mlx-lm (LLM-specific package)
- https://huggingface.co/mlx-community
- Awni Hannun on X: https://x.com/awnihannun
- “Apple drops new MLX machine learning framework for Apple silicon Macs” — 9to5Mac, 2023-12-06
- Simon Willison, “Run LLMs on macOS using llm-mlx and Apple’s MLX framework” (2025-02-15)
- Production-Grade Local LLM Inference on Apple Silicon (arXiv 2511.05502, 2025)
- MLX vs llama.cpp throughput comparisons — Contra Collective (2026), Groundy (2026)