Apple Acceleration Frameworks — Accelerate, Metal, MPS, ANE, AMX

What this is: A catalog-with-judgment for Apple’s hardware-accelerated frameworks (Accelerate / vDSP / BLAS / BNNS / AMX, Metal compute, MPS, MPSGraph, Core ML / ANE, Core Image, VideoToolbox, vImage). What each is for, when to reach for it, what alternatives exist, the mistakes to avoid. Sourced from canonical Apple WWDC sessions, Dougall Johnson’s reverse-engineering of Apple GPU and AMX, Eclectic Light’s power measurements, Warren Moore on Metal, Daniel Lemire on vectorization, Bill Dally on accelerator energy economics, Hennessy & Patterson on parallel architectures. Why it matters: Every “I have this compute task — should I use Accelerate? Metal? MPS? ANE? Or hand-rolled NEON?” decision affects perf, energy, and how much code you maintain. Most LLM apps under-use Apple’s acceleration stack because the catalog isn’t obvious. The biggest single fact: AMX (Apple Matrix Coprocessor) is reachable only through Accelerate — a hand-rolled NEON matmul leaves 2–4× of perf and ~3× of efficiency on the table. Most relevant to Locara: Pairs with mac-llm-optimization.md (LLM-specific framework choices on Mac), mlx.md (why MLX bypasses MPSGraph and writes raw Metal), mac-hardware-lineup.md (per-chip GPU/ANE capabilities), macos-memory-management.md (Metal storage modes and recommendedMaxWorkingSetSize).

Part 1 — The compute fabric of Apple Silicon

A single Apple Silicon SoC is at least seven distinct compute units sharing one DRAM pool. Understanding which one your work lands on is the foundation for every framework choice below.

1.1 The compute units, exhaustive

Unit	What you reach it with	Notes
P-cores (Firestorm → Avalanche → Everest → Coyote)	C/C++/Swift/ObjC; NEON via `arm_neon.h`; AMX via Accelerate; sometimes auto-vectorized	M3 base = 4 P / M3 Max 16-core = 12 P. `sysctl hw.perflevel0.physicalcpu`.
E-cores (Icestorm → Blizzard → Sawtooth → Tupai)	Same toolchain, GCD QoS `.utility`/`.background` schedules here	No AMX on most M-series (per Eclectic Light, Finding and evaluating AMX co-processors — AMX is per P-core cluster, not per E-core cluster).
AMX (Apple Matrix Coprocessor)	Only through Accelerate (BLAS, vDSP_mmul, sparse, BNNS)	Undocumented arm64 ISA extension. Reverse-engineered by Dougall Johnson, Apple AMX Dec 2020; full instruction reference at corsix/amx.
Integrated GPU (G13 → G14 → G15 → G16)	Metal compute kernels (MSL), MPS, MPSGraph, Core Image, Core Animation, SwiftUI	32-thread SIMD-groups, ~208 KB register file per core, ~32 KB threadgroup memory (dougallj/applegpu, philipturner/metal-benchmarks).
Apple Neural Engine (ANE)	Only through Core ML (no ISA, no public kernel API)	16-core fixed-function tensor accelerator. FP16/INT8. Reaches up to ~38 TOPS on M4. Fully opaque.
Media engines (ProRes, h.264, h.265, AV1 on M3+)	VideoToolbox (`VTCompressionSession`, `VTDecompressionSession`), AVFoundation	M3 Pro/Max have dedicated AV1 decoder; M4 added more.
ISP (Image Signal Processor)	Implicit via `AVFoundation` camera pipeline	Not user-programmable.

The key insight: of these seven units, only three are programmable by you with custom code (P-cores, E-cores, GPU). The other four expose their compute only via Apple’s libraries. AMX, the ANE, the media engines, and the ISP are all “private” silicon — your access goes through a framework or it doesn’t go.

1.2 What this means for a local-AI Mac app

Workload class	Lands on	Programming surface
Scalar/branchy code	P-core	C/Swift, no SIMD
Vector arithmetic (image, audio, embedding distance)	P-core NEON	`arm_neon.h`, `vDSP_`, `vImage_`
Big GEMM	AMX via Accelerate	`cblas_sgemm`, `vDSP_mmul`, `BNNSDirectApplyConvolution`
Custom GPU kernel (novel fused op)	GPU	Metal MSL `kernel` function
Standard ML op (matmul, conv, attention)	GPU	MPS / MPSGraph; routes to GPU
Small fixed-shape CNN/RNN	ANE via Core ML	`MLModelConfiguration.computeUnits = .all`
Transformer LLM (typical 2024+ pattern)	GPU (Metal)	MLX, llama.cpp Metal backend
Image filter pipeline	GPU via Core Image	`CIFilter` chain
h.264/h.265/ProRes/AV1 codec	Media engine	VideoToolbox

A common failure: an engineer hand-rolls NEON for a big GEMM. On Apple Silicon, this leaves AMX on the table, because AMX is not addressable from NEON or any user-mode ISA. You can’t write vfma_f32(...) and reach AMX. Only the libraries inside libBLAS.dylib, libBNNS.dylib, libvDSP.dylib, and libSparse.dylib know the secret instruction encodings (Dougall Johnson on X, Dec 2020; corsix/amx).

Part 2 — Accelerate.framework deep-dive

Accelerate is a single umbrella framework (#include <Accelerate/Accelerate.h> brings in everything) over half a dozen distinct libraries. Each library is the hand-tuned, vendor-optimized path for its domain. They share three properties: (a) they auto-detect the chip and dispatch to the best ISA path (NEON, AMX, sometimes even AVX on Intel), (b) they are zero-binary-cost — the dylibs are in the OS, you link against them, and (c) on Apple Silicon, for the supported sizes, they reach AMX without you knowing.

2.1 vDSP (`vDSP_*`)

Vector digital signal processing. Single- and double-precision. Used for FFT, convolution, biquad filters, vector arithmetic, polynomial evaluation, conversion between formats, sample-rate work.

FFTs: vDSP_DFT_*, vDSP_fft_*, and the older split-complex API. Apple recommends vDSP_DFT_* for new code (vDSP_DFT_zop_CreateSetup, vDSP_DFT_Execute). On Apple Silicon, the larger FFTs dispatch into AMX-accelerated kernels for the radix steps that map onto matrix shapes (Apple Accelerate vDSP Programming Guide; Eclectic Light, Finding and evaluating AMX co-processors).
Vector arithmetic: vDSP_vadd, vDSP_vmul, vDSP_dotpr. These compile to NEON FMA chains. AMX engages only for matrix multiplies (vDSP_mmul); element-wise vector ops stay on the P-core NEON pipe.
Convolution: vDSP_conv and the vDSP_imgfir family.
Biquad filters: vDSP_biquad_* — the canonical path for real-time audio filtering. The biquad cascade is hand-tuned to avoid the data-dependent recurrence stalls a naive scalar implementation would hit.

When you reach for vDSP: any audio pipeline, any embedding-distance loop, any “vectorize this loop” instinct. Cheaper than rolling your own NEON because Apple’s implementations also handle small-N tail cases without branch overhead.

2.2 vImage (`vImage*`)

Image-processing primitives. Convolution (vImageConvolve_*), affine and perspective transforms (vImageAffineWarp_*), color-space conversion (vImageConvert_RGBA8888toBGRA8888), scaling (vImageScale_*), alpha compositing, histogram operations.

Tuned for both NEON and AVX2 — same API on Intel and Apple Silicon. On Apple Silicon, the heavy ops (convolution, color-matrix multiplication) dispatch through AMX in the cases where the kernel/image dimensions allow it.
The right call when you have a one-shot operation on a CVPixelBuffer or raw byte array and don’t want to set up Core Image. vImage operates directly on byte buffers (vImage_Buffer) — no Metal command queue, no GPU upload.
Where it loses to Core Image: multi-stage filter chains. Each vImage call materializes the intermediate buffer; Core Image fuses the chain on the GPU before materializing the final result.

2.3 BLAS / LAPACK (`cblas_`, `LAPACK`)

The big-matrix path. cblas_sgemm (single-precision GEMM), cblas_dgemm (double), the cblas_*gemv family for matrix-vector. LAPACK on top adds decompositions (sgetrf, sgeqrf, sgesvd).

Apple’s implementation auto-dispatches to AMX for the supported shapes. Per Dougall Johnson’s reverse-engineering notes and corsix/amx, AMX’s outer-product-and-accumulate semantics map directly onto GEMM’s inner loop: a single AMX instruction produces a 32×32 outer product (32 elements from X register × 32 from Y register, with results accumulated into Z). This is why a cblas_sgemm on a 512×512 matrix on an M3 burns ~8 watts of single-thread power vs ~2.5 watts for the equivalent NEON computation — the AMX coprocessor is doing the work (Eclectic Light, Finding and evaluating AMX co-processors).
The throughput delta vs hand-rolled NEON GEMM is roughly 2–4× for FP32 on supported chips, increasing toward larger N as the AMX outer-product accumulates dominate the cost.
The “new BLAS” warning: Apple ships two BLAS implementations since macOS 13. The legacy one is enabled by #include <Accelerate/Accelerate.h>; the newer “ILP64” BLAS (LP64 vs ILP64 interfaces, with __LAPACK_int = int64_t instead of int32_t) is opt-in via #define ACCELERATE_NEW_LAPACK before the include, plus #define ACCELERATE_LAPACK_ILP64 for 64-bit indices. New code should use the new path — Apple has warned the old one will eventually be deprecated (Apple Bringing the latest BNNS and LAPACK functionality into your app).
NumPy / SciPy users: NumPy’s scipy.linalg.blas.sgemm on macOS, when NumPy is built with the Accelerate backend, hits AMX. macOS NumPy wheels from PyPI now default to using Accelerate on Apple Silicon as of NumPy 1.27+.

2.4 BNNS (`Basic Neural Network Subroutines`)

CPU-side small-neural-network inference. Pre-Core-ML; introduced 2016 (iOS 10 / macOS Sierra). API: BNNSConvolutionLayerParameters, BNNSFullyConnectedLayerParameters, BNNSFilterApply. Modern BNNS (since macOS 13) added BNNSGraph — a small graph compiler for fused CPU inference.

Still useful for: tiny graphs (per-frame OCR layout, classical CV) where setting up Core ML or Metal is overkill.
Less useful for: transformer inference or anything bigger than a few hundred MFLOPs. Core ML handles those better.
AMX-accelerated: the BNNS GEMM and convolution paths use AMX when the layer shape allows.

2.5 Sparse (`SparseMatrixStructure`, `SparseFactor`, `SparseMultiply`)

Sparse linear algebra: CSC/CSR matrices, factorizations (Cholesky, LU, QR), iterative solvers. Used heavily inside RealityKit (for IK/physics) and SceneKit, but available to apps directly.

SparseMultiply triggers AMX use per Eclectic Light — the dense × sparse matvec inner loop runs through the matrix coprocessor.

2.6 Quadrature, Geometry, Simd (capital S)

Quadrature — numerical integration (quadrature_integrate). Mostly used in physics simulations.
Geometry — geometric algorithms (point-in-polygon, etc.).
simd.h — Apple’s “small SIMD” header (simd_float4, simd_quatf, matrix_float4x4). These are not AMX. They are NEON FMA chains for 4-element vector and 4×4-matrix math, sized for graphics transforms. Convenient for ARKit/RealityKit usage. Not for big-matrix work.

2.7 The “AMX is reached only through Accelerate” rule

This is the most important practical fact about Apple Silicon’s compute fabric. AMX has no public ISA, no intrinsic, no Swift API. The only way your code uses it is inside one of Apple’s libraries — Accelerate’s BLAS, vDSP_mmul, BNNS, Sparse — or inside frameworks that themselves call those libraries (Core ML, MPSGraph in some cases, MLX in some cases).

Two consequences:

If you hand-roll NEON for matrix multiply, you lose AMX and burn 2–4× more power per flop (Eclectic Light power measurements). Always cblas_sgemm first; benchmark before deciding NEON is worth it.
Sometimes Accelerate doesn’t dispatch to AMX — the runtime makes a size-and-shape decision. Very small matmuls (sub-64×64) stay on NEON because the AMX setup overhead dominates. Very-large matmuls likewise. There’s no documented threshold; benchmark to find your specific crossover, or trust Apple’s heuristic.

The historical record: Dougall Johnson’s IDA plugin and notes (Dec 2020); Erik Engheim’s Medium writeup (Jan 2021); corsix/amx — the most complete public instruction reference, including X (amx0, 0x200 bytes), Y (amx1, 0x200 bytes), and Z (amx2, 0x1000 bytes, 64 rows of 64 bytes) register layouts.

Part 3 — Metal compute (when you need custom kernels)

Metal compute is your escape hatch when MPS doesn’t have the op you need, when you want to fuse multiple ops to avoid memory round-trips, or when you need full control over thread-group memory and warp-level reductions.

3.1 When Metal compute is the right answer

Custom kernel not in MPS / MPSGraph (e.g., a novel attention variant, your own quantization scheme, a non-standard FFT).
Fused operations across multiple steps — combine quant-dequant + matmul + activation in one kernel to keep intermediates in registers/threadgroup memory.
Workloads where you need bit-exact control (cryptography, hash, custom random number generation).

3.2 The mental model

The hardware (per dougallj/applegpu G13 documentation, philipturner/metal-benchmarks):

SIMD-group = 32 threads (called a “warp” in CUDA-speak). Fixed on G13/G14/G15/G16.
Threadgroup = multiple SIMD-groups that share threadgroup memory, max 1024 threads. The unit submitted by dispatchThreadgroups(_:threadsPerThreadgroup:).
Grid = many threadgroups. Total work.
Register file ~ 208 KB per core, dynamically allocated per SIMD-group on M3+ (the “Dynamic Caching” feature — Apple Explore GPU advancements in M3 and A17 Pro). Fewer registers per thread → more concurrent SIMD-groups → better latency hiding.
Threadgroup memory ≈ 32 KB per group on Apple GPUs (G13 spec; varies slightly on later families). The on-chip cache you control directly.
L1 data cache ~ 8 KB per core; L2 cache 768 KB–1 MB per core (per Philip Turner’s benchmarks). The latency gap between threadgroup memory and unified DRAM is roughly 10–20× — every tiling optimization exists to stay in threadgroup memory.

3.3 The API surface

let device = MTLCreateSystemDefaultDevice()!
let library = try device.makeDefaultLibrary(bundle: .main)
let function = library.makeFunction(name: "my_kernel")!
let pipeline = try device.makeComputePipelineState(function: function)

let queue = device.makeCommandQueue()!
let cmdBuffer = queue.makeCommandBuffer()!
let encoder = cmdBuffer.makeComputeCommandEncoder()!
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(input, offset: 0, index: 0)
encoder.setBuffer(output, offset: 0, index: 1)
encoder.dispatchThreadgroups(
    MTLSize(width: numGroups, height: 1, depth: 1),
    threadsPerThreadgroup: MTLSize(width: 256, height: 1, depth: 1)
)
encoder.endEncoding()
cmdBuffer.commit()

In MSL:

kernel void my_kernel(
    device const float* in  [[buffer(0)]],
    device       float* out [[buffer(1)]],
    uint gid [[thread_position_in_grid]],
    uint lid [[thread_position_in_threadgroup]],
    threadgroup float* tg   [[threadgroup(0)]])
{
    tg[lid] = in[gid];
    threadgroup_barrier(mem_flags::mem_threadgroup);
    // ...
}

3.4 The optimization checklist (from Apple’s own talks)

From WWDC20 #10632 Optimize Metal Performance for Apple silicon Macs and WWDC22 #10066 Squeeze the most out of Apple GPUs with Metal performance counters:

Use 16-bit types when you can (half, short). Halves register pressure, doubles occupancy, often doubles ALU throughput. (Apple GPUs have native FP16 paths.)
Coalesce memory access — consecutive threads should access consecutive addresses. Strided access bottlenecks bandwidth.
Use SIMD-group reductions (simd_sum, simd_max, simd_shuffle_xor) instead of threadgroup-memory reductions when possible. SIMD-group ops happen in registers; threadgroup ops cost a barrier.
simdgroup_matrix — the Metal 3+ tensor-style API for 8×8 matrix multiplies inside a SIMD-group. Maps to the GPU’s tensor pipeline; this is the “tensor core” of Apple GPUs (Apple Metal documentation).
Avoid threadgroup_barrier when you don’t need it — barriers serialize the SIMD-groups in a threadgroup.
Use MTLResourceHazardTrackingModeUntracked when you do your own fence management — skips the driver’s auto-dependency-tracking (WWDC22 #10106 Profile and optimize your game’s memory).
Set purgeableState = .volatile on caches the GPU can rebuild. The kernel reclaims them under pressure.
Pool MTLBuffers rather than allocating per frame.

3.5 Performance counters

Xcode’s GPU Frame Capture exposes the per-shader counters: ALU active %, memory stall %, register pressure, occupancy. WWDC22 #10066 Squeeze the most out of Apple GPUs walks through them. The dominant constraint on most LLM kernels is bandwidth, not ALU — confirming that the GEMV phase of decoder inference is bandwidth-bound on Apple Silicon, matching the practical numbers in mac-llm-optimization.md.

Part 4 — MPS and MPSGraph

4.1 MPS (Metal Performance Shaders) — the original

Released 2015 for iOS, 2016 for macOS. A library of hand-tuned Metal kernels for image and matrix work: MPSMatrixMultiplication, MPSImageConvolution, MPSImageGaussianBlur, MPSCNNConvolution. You wire them into your MTLCommandBuffer directly — they’re “fancy Metal kernels Apple already wrote.”

Use MPS when: you want one big op done well and don’t need a graph. MPSMatrixMultiplication is faster than most hand-rolled Metal GEMMs.
MPSNDArrayMatrixMultiplication is the modern n-dimensional matmul; takes MPSNDArray operands and handles batching.

4.2 MPSGraph (Metal Performance Shaders Graph) — the modern surface

Introduced WWDC 2021. A computation-graph API: you describe the computation as a graph (MPSGraph, MPSGraphTensor, MPSGraphOperation), Apple’s compiler fuses operations and schedules them across the GPU. This is the Apple-native ML graph framework, sitting beneath Core ML, PyTorch’s mps backend, TensorFlow-metal, and JAX-metal.

Key WWDC sessions:

WWDC21 #10152 Accelerate machine learning with Metal Performance Shaders Graph — introduction.
WWDC22 #10063 Accelerate machine learning with Metal — RNN ops, Philox random, shared events.
WWDC23 #10050 Optimize machine learning for Metal apps — MPSGraphPackage serialization (pre-compile graphs to skip launch overhead), mps-graph-tool CLI for converting CoreML/ONNX → MPSGraph, FP8/Int8 quantization fusion, bfloat16, 3D convolutions, FFT.

The MPSGraph compiler can fuse standard patterns (Conv2D+Bias+ReLU, dequant-matmul-quant, attention QK^T+softmax+V) into single kernels — the kind of fusion that hand-written Metal would require explicit work to achieve.

4.3 When MPSGraph beats hand-written Metal

Standard ML ops that the graph compiler already knows.
When you’ll run the graph repeatedly — the upfront compile pays off.
When you want automatic FP16/bfloat16 mixed precision.
When the graph is large enough that fusion pays back the framework overhead.

4.4 When MPSGraph loses

One-shot kernels with novel ops — graph compile overhead dominates.
Very small problems (the graph compile is ~10 ms; if your computation is 1 ms, it’s a loss).
When you need bit-exact control (custom quantization, deterministic reductions).
Anything not supported — adds a CPU-side fallback that can crater perf.

4.5 The relationship to Core ML and MLX

Core ML uses MPSGraph as one of its backends. When MLComputeUnits includes .cpuAndGPU and the model has GPU-compatible ops, Core ML lowers the model to MPSGraph internally.
MLX uses Metal directly, not MPSGraph. Per the MLX repo and Awni Hannun’s design talks, MLX wrote its own Metal kernels for the LLM-shaped workloads it targets — the team felt MPSGraph’s general-purpose path left perf on the table for their domain. This is a recurring pattern: the highest-performance ML libraries on Apple Silicon (MLX, parts of llama.cpp’s Metal backend) bypass MPSGraph for their hottest kernels and use raw Metal compute.

Part 5 — Core ML and the Apple Neural Engine

Core ML is the deployment framework. You convert your trained model (PyTorch, TensorFlow, ONNX) to .mlmodel / .mlpackage using coremltools, embed it in your app, and the runtime picks where to execute it. The runtime alone — not you — chooses CPU vs GPU vs ANE per operation.

5.1 The compute-unit knob

let config = MLModelConfiguration()
config.computeUnits = .all  // .cpuOnly, .cpuAndGPU, .cpuAndNeuralEngine, .all
let model = try MLModel(contentsOf: url, configuration: config)

.all lets the runtime use ANE first, then GPU, then CPU. Per Apple’s Core ML documentation and operator-support tables, the runtime walks the graph op-by-op and assigns each to whichever backend supports it.

5.2 Why your model often doesn’t land on ANE

The ANE is the most efficient compute on the chip if your graph fits, but the constraints are stringent (Apple research, Deploying Transformers on the Apple Neural Engine; ml-ane-transformers GitHub):

Limited operator set. Standard CNN/RNN ops are supported; anything novel falls back to GPU or CPU. One unsupported op in the middle of your graph can split it into chunks that cross compute units, paying transition cost per chunk.
FP16 (and INT8/INT4 with quantization) only. No FP32 on ANE.
Fixed batch shape. Dynamic-shape ops break ANE compilation.
Channel-first 4D format (B, C, 1, S). Apple’s ane-transformers paper recommends rewriting nn.Linear as nn.Conv2d to expose the 4D format the ANE prefers.
64-byte alignment on the last buffer axis. Apple notes: “Improper handling causes excessive padding — 32× memory overhead for FP16 and 64× for 8-bit precision.”
Tensor chunking. Split QKV into per-head tensors; ANE prefers small contiguous chunks over big strided ones (improves L2 residency, multicore utilization).
Avoid reshape/transpose. They trigger memory copies on ANE; use einsum formulas that map directly.

Apple’s reference implementation in ml-ane-transformers showed 10× speedup and 14× memory reduction for DistilBERT on iPhone 13 when rewritten to fit ANE constraints. Even then, 4 of 606 ops (embedding lookups) fell back to CPU.

5.3 Compile-time vs load-time vs run-time

Compile time (coremltools.convert(...)): converts the source-framework model to .mlpackage. Can take minutes. Done once, ahead of shipping.
Load time (MLModelCompiler.compileModel(at:) if you ship .mlmodel rather than .mlmodelc, or implicit on MLModel(contentsOf:) for .mlmodelc): the runtime materializes the model for the device. Takes 100ms–1s for a typical CNN; longer for big transformers. First-inference latency includes load time — you must call MLModel.prediction(...) once during app startup to warm up, otherwise the user pays the load time on their first interaction.
Run time: per-inference cost.

5.4 Why MLX / llama.cpp bypass ANE for LLMs

Both target the GPU rather than the ANE. Reasons:

Operator gaps. Custom attention variants (sliding-window, grouped-query, multi-query), KV-cache management, dynamic batch — many don’t fit ANE’s fixed-shape model.
Debugging difficulty. ANE is opaque; you can’t profile it with Xcode’s GPU tools. When something is slow on ANE, you have minimal visibility.
Quantization formats. ANE supports specific quantization recipes; the GGUF/MLX quants don’t always map cleanly.
Dynamic shapes. LLM decoding has growing KV cache, variable sequence length — hostile to ANE’s static-shape preference.

The MLX team chose Metal compute as the substrate because they wanted full control over kernel fusion, quantization, and the KV cache — none of which ANE makes easy.

5.5 When ANE is the right answer

Small fixed-graph CNNs (face detection, person segmentation, scene classification).
Small RNNs / transformer encoders with fixed sequence length (text classification, intent detection).
Real-time vision (depth estimation, hand tracking) where the power budget is the constraint.
Whisper-style speech models distilled to ANE-friendly shape (Apple’s Speech framework on macOS 14+ uses ANE-resident models).
Anything at <1W where battery life matters and the workload is stable.

Part 6 — Core Image (and CIKernel)

Core Image is the image-processing pipeline framework. The mental model: you build a chain of filters (CIFilter instances) on a CIImage, then materialize via a CIContext. The chain is lazy — Core Image’s compiler analyzes the whole chain, fuses what it can into Metal kernels, and renders only the final output.

6.1 The pieces

CIImage — describes an image (file, pixel buffer, output of a filter). Doesn’t materialize until you ask.
CIFilter — a unit of work. Built-ins: CIGaussianBlur, CIColorMatrix, CIPerspectiveTransform, etc. (Hundreds; see CIFilter documentation.)
CIContext — the renderer. Owns Metal device, command queue, caches.
CIKernel / CIColorKernel / CIWarpKernel / CIBlendKernel — custom kernels you write in MSL. Each variety has different signature constraints:
- CIColorKernel: per-pixel, no neighbor access — fastest, best fused.
- CIWarpKernel: per-pixel coordinate transform — for affine/perspective.
- CIKernel: general — can sample neighbors, but the optimizer is more conservative.
- CIBlendKernel: two-input blending.

6.2 When Core Image beats hand-rolled Metal

Multi-filter chains: Core Image fuses them. A chain of three color filters becomes one shader pass.
Standard filters: Apple’s built-ins are well-tuned.
Lazy region-of-interest: only the pixels you actually display are computed.
Integration: works directly with CVPixelBuffer, IOSurface, CGImage, MTLTexture.

6.3 When it loses to hand-rolled Metal

When you need to leave the GPU mid-pipeline (e.g., extract a histogram for CPU processing).
For non-image compute (Core Image is image-only).
When you need control over thread-group memory and can’t express it in a CIKernel.
Very large kernels where fusion isn’t possible.

Part 7 — VideoToolbox and AVFoundation

The lowest level of video pipeline access on macOS. VideoToolbox owns the hardware video codec; AVFoundation is the higher-level capture/playback framework that uses it.

7.1 VideoToolbox

VTCompressionSession — hardware encode. Supports h.264, h.265/HEVC, ProRes (kCMVideoCodecType_AppleProRes422 and variants), AV1 on M3+ for some configurations.
VTDecompressionSession — hardware decode. Same codec set.
The codecs are fixed-function silicon (the “media engines”). M-series chips have 1 (base), 2 (Pro), or 4 (Max/Ultra) ProRes encode/decode engines and separate h.264/HEVC engines.

7.2 Why this matters even for AI apps

Voice transcription apps that accept video files (e.g., user drops in a meeting recording): decode the video via VideoToolbox to access the audio track; encoder work is offloaded to silicon, leaving the CPU free for tokenization and the GPU free for inference. Software decode in ffmpeg would burn 1–2 cores and ~5W; hardware decode is sub-watt.
Apps that generate video output (slideshow exports, screen recordings, video summarization with overlay): hardware encode is essential for battery life.
AV1 decode on M3+ is hardware. On older chips, falling back to software decode for AV1 is painful on battery.

Part 8 — Decision matrix

The summary table for “I have this task; what do I reach for first?”

Task	First reach	Fallback	Notes
Big-matrix matmul FP32/FP64	Accelerate `cblas_sgemm`/`cblas_dgemm`	MPS `MPSNDArrayMatrixMultiplication`	AMX path inside Accelerate. Use `ACCELERATE_NEW_LAPACK` for new code.
Small matmul (<64×64)	Hand NEON or `simd.h`	Accelerate (overhead dominates)	Benchmark; no documented threshold.
FFT (>=1024 pts)	*Accelerate `vDSP_DFT_`**	Metal MSL FFT, MPSGraph FFT	`vDSP_DFT_Execute` is the modern API.
Image convolution one-shot	vImage	Metal, Core Image	vImage on byte buffers; no GPU upload.
Image filter pipeline (multi-stage)	Core Image (`CIFilter` chain)	Metal MSL	Core Image fuses the chain.
Custom GPU kernel	Metal compute MSL	—	When MPS doesn’t have it or you need fusion.
Standard NN inference	Core ML (`.computeUnits = .all`)	MPSGraph for fine control	Lets runtime pick ANE/GPU/CPU.
Transformer inference (LLM)	MLX (Apple Silicon) or llama.cpp Metal	Core ML if a small ANE-friendly model	Both bypass ANE; see `mlx.md`.
Video encode/decode	VideoToolbox	ffmpeg software (avoid)	Hardware codec; sub-watt.
Audio resample, filter	Accelerate vDSP	`AVAudioConverter`	`vDSP_biquad_*` for filters.
Sparse linear algebra	Accelerate Sparse	—	`SparseMultiply` hits AMX.
Vector arithmetic (embed distance, etc.)	Accelerate vDSP	Hand NEON	`vDSP_distancesq`, `vDSP_dotpr`.
Color conversion	vImage	Core Image	vImage for one-shot; CI for in-pipeline.
4×4 matrix transform (graphics)	`simd.h`	—	NEON FMA chains; not AMX.
Per-frame compute on video	Metal + CVPixelBuffer + IOSurface	Core Image	Zero-copy with `IOSurfaceRef`.
Quantized weight inference	Core ML (small) / MLX / llama.cpp (LLM)	—	ANE Int8/Int4, MLX 4-bit, GGUF Q4_K_M.

Part 9 — Mistakes to avoid

A catalog. These each have cost the median local-AI-app developer time:

Hand-rolling NEON for matmul on Apple Silicon. You can’t reach AMX from NEON. Always try cblas_sgemm first; the 2–4× perf and the power saving are both real. (Eclectic Light power measurements.)
Reaching for ANE when your model has one unsupported op. Core ML will split the graph and pay transition cost per chunk. Run Apple’s coremltools.utils.evaluate_* first to confirm what fraction of your graph fits.
Treating MPS as “old” and skipping straight to MPSGraph. MPS is fine for one-shot kernels; MPSGraph is the graph framework. They coexist.
Copying MTLResourceStorageMode.shared buffers to .private on Apple Silicon. This is the right pattern on Intel Macs with discrete GPUs; it is a pessimization on UMA. (Covered in macos-memory-management.md.)
Submitting empty Metal command buffers in a tight loop. Each commit costs CPU and partial GPU wake. Batch your work.
Allocating MTLBuffers per frame. Pool them. Or use MTLHeap for sub-allocation.
Not warming up the Metal compiler. First-shader compile is ~50–500 ms; cache pipelines at startup. Use MTLBinaryArchive to persist compiled pipelines across launches.
Not warming up Core ML. First inference includes model load and graph compile. Run a dummy inference at app launch.
Auto-vectorization assumption. Daniel Lemire and others have shown auto-vectorization is unreliable — even for scalar product loops, GCC/Clang behavior varies. If you need NEON, write the intrinsics, don’t pray for the compiler.
Using MTLResourceStorageMode.managed on Apple Silicon. It’s a discrete-GPU mode; on UMA it adds overhead with no benefit.
Loading Data(contentsOf:) for big weight files. Use .alwaysMapped. (Covered in macos-memory-management.md §2.4.)
Forgetting threadgroup_barrier between SIMD-group write and read of threadgroup memory (WWDC compute-kernel docs).
Using FP32 in Metal kernels where FP16 would suffice. Halves register pressure → 2× occupancy → often >2× throughput on bandwidth-bound kernels.
Calling Objective-C bridges in inference hot loops. Per Mike Ash, Performance Comparisons of Common Operations, objc_msgSend is ~2–4× slower than a vtable call (cached) and ~10× slower uncached. Keep the inner loop in C/C++/Swift, not bridging across.
Trusting cmdBuffer.waitUntilCompleted() in a per-frame UI path. Stalls the CPU. Use completion handlers and MTLSharedEvent.

Part 10 — The legends-say cross-cutting synthesis

Bill Dally (NVIDIA chief scientist): programmable GPUs become as efficient as fixed-function accelerators only when they grow specialized matrix instructions (Tensor Cores: HMMA, IMMA, QMMA). With ~hundreds of ops per instruction, the dispatch/decode overhead shrinks to ~15–20% (Christian Mills’ notes on Dally’s CUDA Mode talk). NVIDIA’s DLA still beats GPU energy efficiency 2.5× for stable workloads. Apple’s stack reflects this exactly: the ANE is the fixed-function path (most efficient if your graph fits), AMX is the matrix-instruction path inside CPU (programmable surface via Accelerate libraries), and simdgroup_matrix on the GPU is Apple’s tensor-core equivalent.

Hennessy & Patterson (Computer Architecture: A Quantitative Approach): the cost of launching a kernel must be amortized over enough work. Small kernels lose to large fused kernels. This is the reason MPSGraph and Core ML and MLX all do kernel fusion at graph build time — and the reason a hand-rolled chain of three Metal kernels is usually slower than the same computation expressed as one fused MSL kernel.

Daniel Lemire: measure, don’t assume (Lemire blog passim). The crossover between Accelerate and Metal, between NEON and AMX, between MPS and MPSGraph, is workload-specific. Auto-vectorization is unreliable; profile-driven NEON. The “fasta” case study (Lemire 2018) shows that SIMD inside one core can beat multicore parallelism for the right workload — a useful counterweight to “just add threads.”

Apple’s throughline (across WWDC 2017–2024 Core ML sessions, WWDC 2020–2024 Metal sessions, WWDC 2024 MLX sessions): use the highest-level framework that covers your case. Accelerate over hand-NEON. MPSGraph over hand-Metal. Core ML over MPSGraph when possible. Each level fuses, schedules, and dispatches across compute units in ways you would otherwise have to handcode. The cost is loss of control; the win is portability across generations and Apple’s ongoing tuning.

Dougall Johnson’s revelation: there’s a whole compute unit (AMX) reachable through one channel (Accelerate). Don’t try to bypass it. The reverse-engineered AMX instruction encoding exists in the public record, but using it from your code means embedding undocumented opcodes that Apple can break in any macOS release. The supported path is cblas_sgemm. Apple has signaled that the ANE-style “AMX successor” is unifying with the GPU in M5+ (“GPU neural accelerators” per WWDC25 #315) — another reason not to commit to private instruction sequences.

Awni Hannun / MLX team: even with all the higher-level frameworks available, the very highest performance for new architectures (transformers, attention variants) still requires hand-written Metal kernels. MLX exists because MPSGraph and Core ML couldn’t deliver on LLM-specific workloads at the time of authoring. The two-level model — high-level framework for 80% of work, custom Metal for the hot 20% — is the practical pattern.

chipsandcheese / philipturner: the Apple GPU’s character is bandwidth-bound, FP16-favoring, register-pressure-sensitive. The architecture is dual-dispatch from 2 SIMDs (a PowerVR legacy), 32 threads per SIMD-group, ~208 KB register file per core, ~8 KB L1 / ~1 MB L2. Every kernel optimization that matters on Apple Silicon traces back to these numbers (philipturner/metal-benchmarks).

Justine Tunney: llamafile’s tinyBLAS is a deliberate experiment in beating Accelerate without Apple’s proprietary libraries. It comes close on M2 Ultra (where bandwidth dominates), but on smaller chips the AMX path inside Accelerate still wins. The lesson: if you’re shipping a closed-source binary, depend on Accelerate; if you’re shipping an open-source library that must work without Xcode, you’ll trade some perf for portability.

Specific learnings for Locara

The Locara SDK’s compute primitives must default to Accelerate, not hand-rolled NEON. Apps that want matrix math get Locara.linalg.matmul(...) which delegates to cblas_sgemm — apps don’t write Accelerate calls directly, but they get AMX for free.
MLX-default-with-llama.cpp-fallback on Mac stays the right call (already documented in mlx.md). This note reinforces it — MLX bypasses MPSGraph and ANE because the LLM-specific tuning matters more than Apple’s general-purpose graph framework.
Don’t let apps reach for ANE for LLM work. The Locara runtime should refuse to compile a Core ML transformer with MLComputeUnits.all unless the manifest declares it as a known-ANE-friendly model (the small text-classification subset). The default is GPU via MLX/llama.cpp.
Provide a Locara.image surface for image work that maps to Core Image (multi-stage) and vImage (one-shot). Apps that need OCR / face detect / segmentation / image transforms don’t reach for Metal directly — they use these higher-level frameworks via Locara’s API, which lets the runtime route to ANE when supported (vision frameworks already do this).
VideoToolbox is the right answer for the Transcribe app’s “drop a video file” path. Hardware decode → audio track extraction → Whisper inference. Don’t bundle ffmpeg unless absolutely necessary; the hardware codec is sub-watt and ffmpeg software decode burns ~5W.
Document the seven compute units in the Locara developer docs. Most app authors will never have heard “AMX” or known that ANE is opaque. The note above is the source material; the docs version should be 1/4 the length but cover the same map.
Lint against hand-rolled NEON matmul. Static analysis at build time: if an app contains a loop pattern that looks like an N×K×M GEMM in scalar or NEON code, warn and suggest Locara.linalg.matmul. Catch the most common single perf footgun.
Provide a Locara.diagnostics.amxAvailable() helper. Per-chip variations exist (M3 Ultra has 2 AMX clusters; some E-clusters lack AMX). Apps that want to make smart choices (and edge cases like Intel Macs that have no AMX at all) need this introspection.
Cross-link to mac-llm-optimization.md for LLM-specific choices. The decision matrix in Part 8 of this note is the general answer; the LLM-specific deep dive lives there.
Warm up Metal pipelines + Core ML models at app launch. The Locara runtime should automatically pre-compile/warm any model referenced in the manifest. First-inference latency is the worst UX defect; eliminating it is a runtime concern, not an app concern.

References

Dougall Johnson:

Apple AMX further reading:

Apple GPU microarchitecture:

philipturner/metal-benchmarks — register file, FMA throughput, occupancy
Apple Explore GPU advancements in M3 and A17 Pro (Tech Talk #111375) — Dynamic Caching
Beyond3D forum, Apple Dynamic Caching on M3 GPU

WWDC sessions (canonical):

WWDC 2017 #703 Introducing Core ML
WWDC 2018 #708 What’s New in Core ML, Part 1 (+ #709 Part 2)
WWDC 2020 #10632 Optimize Metal Performance for Apple silicon Macs
WWDC 2021 #10152 Accelerate machine learning with Metal Performance Shaders Graph
WWDC 2021 #10153 Create image processing apps powered by Apple silicon
WWDC 2022 #10063 Accelerate machine learning with Metal — RNN, Philox, MPSGraph shared events
WWDC 2022 #10066 Squeeze the most out of Apple GPUs with Metal performance counters
WWDC 2022 #10106 Profile and optimize your game’s memory — Metal Untracked mode, MTLHeap
WWDC 2023 #10050 Optimize machine learning for Metal apps — MPSGraphPackage, bfloat16, FP8, mps-graph-tool
WWDC 2024 #10142 Get started with MLX for Apple silicon (+ later in WWDC25 #315)

Apple primary documentation:

Accelerate framework
vDSP (DSP routines, FFT, vector arithmetic)
BNNS (CPU NN inference)
Bringing the latest BNNS and ILP64 LAPACK functionality into your app
Metal Performance Shaders
Metal Performance Shaders Graph
Core ML and MLComputeUnits
Core Image
VideoToolbox
MTLDevice family/feature-set tables

Apple Research / ML:

Apple Research, Deploying Transformers on the Apple Neural Engine (2022)
apple/ml-ane-transformers on GitHub
ml-explore/mlx — Awni Hannun et al.

Warren Moore / Metal:

metalbyexample.com — Warren Moore’s blog
Metal Programming Guide (Pearson, Warren Moore)

Daniel Lemire:

Bill Dally:

Mike Ash:

Performance Comparisons of Common Operations, 2016 Edition — objc_msgSend cost, dispatch overhead

Hennessy & Patterson: Computer Architecture: A Quantitative Approach, 6e (2017), Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures and Chapter 7 Domain-Specific Architectures — the canonical academic frame for the Apple-stack choices above.

Justine Tunney:

llamafile and tinyBLAS — open-source alternative to Accelerate’s BLAS path

Charlie Miller, Dion Blazakis, Dino Dai Zovi: iOS Hacker’s Handbook (Wiley, 2012) — chapters on the iOS kernel and userland set the stage for understanding how IOKit/IOAccel clients are how the GPU and ANE are reached from userspace at the lowest level.

Contested / version-dependent:

AMX instruction set details — reverse-engineered; Apple has never published it. New instructions added across M-series generations; treat opcode tables as “M1 baseline” unless verified on your chip.
AMX-vs-NEON power numbers (Eclectic Light’s ~8W vs ~2.5W for matrix on M3 Pro): chip- and microbenchmark-specific.
Whether AMX is per-cluster or per-core: Eclectic Light’s measurements indicate per-P-cluster on M1/M2/M3 (one AMX per cluster, shared by the P-cores in that cluster). M4 may differ; not independently verified.
ANE TOPS figures (~38 TOPS on M4): Apple’s marketing number; the comparable workload running on ANE-friendly graphs gets perhaps 30–60% of peak.
MPSGraph operator coverage: ships new ops each macOS release; “what runs on MPSGraph today” is a moving target — always check the runtime, never the docs alone.
Whether ANE will survive in current form past M5: WWDC 2025 hinted at GPU-resident “neural accelerators” that overlap with ANE’s role; the architecture is in flux.