macOS Memory Management — Architecture and Optimization for Native Apps

What this is: A reference for how macOS actually manages memory (VM subsystem, unified memory, pressure, compressor, swap, jetsam, wired memory) and the established best practices for native macOS apps to be good memory citizens — especially under the hostile conditions of a heavy local-AI workload. Why it matters: Locara’s apps run alongside the user’s Chrome / Slack / Xcode. Bad memory citizenship doesn’t just make the app slow — it forces the OS into compression and swap, which beachballs everything else the user is doing. The user blames the local-AI app, not the OS. The dominant failure mode for a local-LLM platform on consumer hardware is “we technically fit, but we made the user’s machine miserable.” Most relevant to Locara: Foundational. Pairs with mac-llm-optimization.md for the LLM-specific layer above this, mac-hardware-lineup.md for the per-SKU memory tiers, and tauri.md / mac-app-store-sandbox.md for the runtime shell.

Part 1 — How macOS memory management actually works

1.1 The VM subsystem (Mach, XNU, page tables)

macOS inherits its virtual-memory subsystem from Mach (Carnegie Mellon, 1980s), layered with BSD-style memory APIs. The combined kernel is XNU, open-sourced as apple-oss-distributions/xnu.

Address space: 64-bit on both Intel and Apple Silicon. User processes have a very large virtual address space; the first page (__PAGEZERO) is unmapped to catch NULL dereferences.
Page size:
- Intel Macs (x86_64): 4 KB pages.
- Apple Silicon Macs (arm64) + iOS/iPadOS: 16 KB pages. This is real, not “emulated” — getpagesize() and sysconf(_SC_PAGESIZE) return 16384. Larger pages reduce TLB pressure and page-table overhead but increase fragmentation cost per allocation. Per Tristan Ross’s writeup (https://tristanxr.com/post/why-16k-page-size/), a page-table block (L3 block descriptor) covers 32 MB on Apple Silicon vs 2 MB on Intel.
VM regions: Each process has an ordered list of vm_map_entry structures pointing at vm_objects (the backing store). vmmap <pid> shows them: __TEXT (read-only code, COW from disk), __DATA, __LINKEDIT, MALLOC_* (large/small/tiny/nano zones), STACK[], IOKit, MALLOC_NANO, SUBMAP (shared regions).
Mach VM API: mach_vm_allocate, mach_vm_map, mach_vm_deallocate, mach_vm_protect, mach_vm_inherit, mach_vm_purgable_control — all take a vm_map_t (the task port) and operate in 64-bit. Almost all higher-level allocators (malloc, libdispatch, NSData) call into these.
The MD/MI split: the machine-dependent pmap layer manages page tables, TLBs, and ASIDs; the machine-independent vm_map layer is portable across architectures.

1.2 Unified Memory Architecture (UMA) on Apple Silicon

Apple Silicon Macs put DRAM dies on the SoC package and expose a single physical pool to the CPU complex (P+E cores), the integrated GPU, the Apple Neural Engine (ANE), the media engines, and ISP. There is no discrete VRAM and no PCIe transit between CPU and GPU.

Concrete consequences for local AI:

No CPU→GPU copy with MTLResourceStorageMode.shared. The same physical page is mapped into both CPU and GPU address spaces. This is why llama.cpp’s Metal backend, MLX, and ggml-metal can run models the size of system RAM without a staging buffer.
GPU-wired memory: pages used by the GPU are wired (not pageable) for the lifetime of an in-flight command buffer. On Apple Silicon they live in the same DRAM as everything else, just with a “do not page” flag.
MTLDevice.recommendedMaxWorkingSetSize is Apple’s exposed ceiling — “an approximation of how much memory, in bytes, a GPU device can allocate without affecting its runtime performance” (https://developer.apple.com/documentation/metal/mtldevice/recommendedmaxworkingsetsize). In practice this is ~66–75% of physical RAM, depending on macOS version and total memory. Exceeding it doesn’t error — Metal will swap GPU resources, killing throughput.
Why this differs from a PC: on a discrete-GPU PC, cudaMemcpy()-style staging buffers and PCIe transfers are unavoidable. On UMA Macs they are unnecessary and actively wasteful — copying a .shared buffer to a .private buffer is a pessimization on Apple Silicon, even though it’s a good pattern on Intel Macs with discrete AMD GPUs.

1.3 Memory pressure (the signal your app must respond to)

macOS exposes a tri-state pressure level — NORMAL / WARN / CRITICAL — via DISPATCH_SOURCE_TYPE_MEMORYPRESSURE (https://developer.apple.com/documentation/dispatch/dispatch_source_type_memorypressure).

Sysctl: kern.memorystatus_vm_pressure_level (1=normal, 2=warn, 4=critical).
The XNU memorystatus subsystem (apple-oss-distributions/xnu, doc/vm/memorystatus_notify.md) describes thresholds roughly: WARN at ~50% available RAM, CRITICAL at ~29–40% — but these have shifted across releases. Treat as ranges, not constants.
The kernel uses a vm_pressure_thread and EVFILT_VM kevents under the hood (per Jonathan Levin’s newosxbook.com/articles/MemoryPressure.html).
A subtle gotcha that Quinn “The Eskimo!” repeatedly notes on Apple Developer Forums (e.g., thread 85474): the dispatch source does not always report transitions back to NORMAL. If your app launches while pressure is already elevated, you won’t get a “current state” event. Architect on the assumption that you’ll be told “warn” or “critical” but may never be told you’re out of the woods.
Apple’s standing recommendation (Quinn, thread 118867 “On Free Memory”): “Use standard techniques — NSCache, purgeable memory, DISPATCH_SOURCE_TYPE_MEMORYPRESSURE — rather than any free-memory value.” Don’t roll your own pressure detection by querying %-free; use the dispatch source.

1.4 Memory compressor (WKdm)

OS X 10.9 Mavericks (2013) replaced swap-first behavior with in-RAM compression of inactive pages, using the Wilson–Kaplan WKdm algorithm.

What gets compressed: pages from the inactive list — anonymous dirty pages (heap, stack, zero-fill) belonging to processes that haven’t touched them recently.
Compression ratio: typically ~2× (about 50%). Activity Monitor reports this as the “Compressed” line.
Why WKdm: it compresses page-sized blocks fast — Apple’s marketing claim was that compress+decompress beats reading from disk. Modern SSDs narrow that gap, but the energy cost of memory I/O vs. CPU work still favors compression on battery.
Compressor + swap are layered, not alternatives. When the compressor pool fills, pages can be paged out from the compressed pool to swap.
iOS uses the same compressor but does not swap (until iOS 17+ on M-series iPads with Stage Manager, which still has narrower swap behavior than macOS). When iOS exceeds its memory budget, jetsam kills processes rather than swap.

1.5 Swap behavior

Location: /private/var/vm/swapfile0, swapfile1, … created by dynamic_pager(8) at boot.
Sizing: default swap files start around 64 MB and double up to a cap. dynamic_pager flags -S file_size, -H hi_water, -L low_water govern creation/retirement.
Encryption: swap is encrypted on all modern Macs — unconditionally with FileVault, and even without FileVault on Apple Silicon (SEP-managed disk key, per the Apple Platform Security Guide).
“Memory Pressure” vs “Memory Used”: Apple’s UI ranks pressure on a green/yellow/red gauge based on compressor and swap activity, not on bytes used. A Mac can show 95% memory used and zero pressure (the bulk is reclaimable cached files). Conversely a Mac can be in red with 60% used if there’s lots of active churn. Pressure is the metric that matters; Used is not.

1.6 Jetsam (the prioritized OOM killer)

Jetsam is the memorystatus subsystem in XNU — the kernel’s prioritized OOM-killer. Name from “jettison.”

iOS: aggressive. The kernel kills user processes that exceed their per-app footprint cap or when pressure rises.
macOS: jetsam exists but is opt-in at the per-process level and primarily targets backgrounded “managed” daemons via RunningBoard (Eclectic Light, “What does RunningBoard do?”, https://eclecticlight.co/2025/07/15/what-does-runningboard-do-2-managed-apps/). A foreground macOS app will not be killed for memory pressure. It will just see swap and beachballs.
Priority bands (defined in kern_memorystatus.h): JETSAM_PRIORITY_IDLE=0, _BACKGROUND_OPPORTUNISTIC=6, _BACKGROUND=10, _FOREGROUND_SUPPORT=15, _FOREGROUND=18 (default for active apps), _AUDIO_AND_ACCESSORY=20, _HOME=22, _CRITICAL=27. Jetsam kills from band 0 upward.
What it tracks: not just RSS but physical footprint = anonymous RSS + compressed bytes + IOKit-mapped. Read via memorystatus_control() syscall #440 (MEMORYSTATUS_CMD_GET_*).
Per-process limits: set via MEMORYSTATUS_CMD_SET_JETSAM_HIGH_WATER_MARK. Crossing it triggers an EXC_RESOURCE exception; the process can either crash (default) or get a soft warning.
Jetsam event reports: macOS/iOS emit .ips jetsam reports when killing for memory; documented at https://developer.apple.com/documentation/xcode/identifying-high-memory-use-with-jetsam-event-reports.

1.7 Wired memory

“Wired” pages (vm_page_wire_count in the kernel) cannot be paged or compressed. Categories:

Kernel data structures, kernel extensions, file-system buffers.
IOKit mappings and DMA-able buffers — including Metal command buffers and resources allocated via MTLDevice. These get wired for the duration of GPU access.
Pages locked via mlock(2) (rarely allowed for unprivileged apps), vm_wire, or IOKit pinning.
Page-table pages themselves at high memory usage.

Activity Monitor’s “Memory Wired” sums all of these and is generally not addressable by user apps directly — but it grows as Metal allocations grow. On Apple Silicon at high GPU residency, wired memory can dominate.

1.8 Activity Monitor categories — what they actually mean

Per https://support.apple.com/guide/activity-monitor/view-memory-usage-actmntr1004/mac:

App Memory — anonymous dirty memory used by running apps.
Wired Memory — kernel-pinned, unavailable to apps.
Compressed — WKdm-compressed pool bytes.
Cached Files — file-backed pages still in RAM but reclaimable. Not “used” in the traditional sense; the OS can reclaim them instantly. High Cached Files is healthy, not concerning.
Swap Used — bytes paged to disk.

A Mac with 90% memory “used” but mostly Cached Files is in good shape. Pressure, not Used, drives apps to action.

1.9 iOS-style limits on Mac Catalyst / Designed for iPad

Native macOS apps: no per-process hard memory cap by default. Can use all of physical RAM + swap, subject to compressor/swap dynamics.
Mac Catalyst apps: behave like native macOS. os_proc_available_memory() returns 0, meaning “no limit” (Apple Developer Forums thread 724195).
“Designed for iPad” apps on Apple Silicon Macs: jetsam-style 16 GB cap per process, replicating an M-series iPad limit. They receive memory warnings at 16 GB even on a 96 GB Studio.
iOS / iPadOS native: per-app limits vary by device — ~5 GB on iPad Pro M1/M2 by default. iOS 15 introduced the com.apple.developer.kernel.increased-memory-limit entitlement to raise the cap closer to physical RAM for select apps.

1.10 What’s actually addressable at each RAM tier

A 16 GB MacBook Air has 16 GB physical, but the budget visible to your app is significantly less:

~0.5–1.5 GB kernel + system daemons at idle.
GPU working-set cap (recommendedMaxWorkingSetSize) ≈ ~66–75% of physical RAM. On a 16 GB Mac that’s ~10–12 GB of Metal-addressable memory.
Override (unsupported): sudo sysctl iogpu.wired_limit_mb=N on Sonoma+ raises the cap. Pre-Sonoma: debug.iogpu.wired_limit in bytes. Community LLM users do this routinely on 64+ GB Macs; Apple doesn’t endorse it.

For a Locara app, the practical app-memory budget on a 16 GB Mac is ~9–11 GB after accounting for the OS, the foreground apps the user already has open, and the GPU working-set headroom that lets Metal stay performant.

Part 2 — Best practices to optimize macOS app memory usage

2.1 Diagnostic tools

Tool	What it shows
`vm_stat [interval]`	Mach VM counters: free, active, inactive, speculative, wired, throttled, compressed pages; pageins/pageouts; faults. Reports page size.
`top -o mem`	Per-process RSS, virtual size, compressed, dirty, purgeable, instructions retired. `-stats` to pick columns.
Activity Monitor	Pressure gauge + App/Wired/Compressed/Cached/Swap; per-process Real Memory, Memory Compressed, Real Private Memory.
`vmmap <pid>`	Per-region VM dump: address, size, prot, sharing, swapped/dirty/resident. `vmmap -summary` for totals.
`footprint <pid>`	High-level breakdown by category — “App Memory”, “Compressed”, “IOKit”, “Graphics”. Mirrors what jetsam/RunningBoard see. Introduced in WWDC21 #10180.
`heap <pid>`	Walks malloc zones; class histograms, allocation sizes. `heap -diffFrom file.memgraph` is the gold flow for growth diagnosis. WWDC24 #10173.
`leaks <pid>`	Cycle detection in malloc allocations. `--outputGraph file.memgraph` saves a memgraph.
`malloc_history`	Recorded allocation stacks. Set `MallocStackLogging=1` env var.
`memory_pressure`	Simulator and observer. `sudo memory_pressure -S -l warn -s 60` simulates warn for 60 s; `-l critical` simulates critical. Without sudo you only observe.
`sysdiagnose`	Full system snapshot — memgraph dumps of all processes, jetsam logs, spindumps. Invoked from Activity Monitor or via key combo.
Instruments Allocations	Live history of heap + VM allocations with call trees; generation marks for “what grew between A and B.”
Instruments VM Tracker	Periodic snapshots of all VM regions, separating dirty vs swapped/compressed.
Instruments Leaks	Periodic root-tracing snapshot. Misses retain cycles between two mutually-retaining unreachable objects — use Memory Graph Debugger for those.
Xcode Memory Graph Debugger (Debug > Debug Memory Graph)	Live object graph for retain-cycle hunting.

Canonical WWDC sessions:

WWDC 2018 #416 “iOS Memory Deep Dive” — defines footprint = dirty + compressed.
WWDC 2021 #10180 “Detect and diagnose memory issues” — introduces footprint CLI and the memgraph flow.
WWDC 2022 #10106 “Profile and optimize your game’s memory” — Metal apps focus.
WWDC 2024 #10173 “Analyze heap memory” — heap -diffFrom.

2.2 Lifecycle responses to memory pressure

macOS native (AppKit): there is no applicationDidReceiveMemoryWarning. Use the dispatch source:

let src = DispatchSource.makeMemoryPressureSource(
    eventMask: [.warning, .critical],
    queue: .global()
)
src.setEventHandler {
    switch src.data {
    case .warning: shrinkCaches()
    case .critical: shrinkCaches(); dropAll()
    default: break
    }
}
src.resume()

Mac Catalyst / UIKit: UIApplicationDelegate.applicationDidReceiveMemoryWarning(_:) and UIViewController.didReceiveMemoryWarning() work as on iOS.

NSCache auto-evicts under pressure because it listens to the dispatch source itself (NSHipster, https://nshipster.com/nscache/). Use NSCache, not your own dictionary, for caches that should be discardable.

NSProcessInfo activity assertions disable App Nap and reduce sudden-termination eligibility while inference runs:

let token = ProcessInfo.processInfo.beginActivity(
    options: [.userInitiated, .latencyCritical, .idleSystemSleepDisabled],
    reason: "Local LLM inference"
)
// ... must hold strong reference; release when done.
ProcessInfo.processInfo.endActivity(token)

Reference: Jeff Johnson, “Prevent App Nap Programmatically” (https://lapcatsoftware.com/articles/prevent-app-nap.html).

2.3 Allocation patterns

Avoid heap fragmentation. macOS’s default malloc has multiple zones (nano, tiny, small, large) per CPU; large allocations (~>128 KB) bypass the small-block allocator and go to mmap of an anonymous region. Mixing many short-lived large objects with long-lived small ones creates fragmentation.
Pool large transient buffers. Reuse a single NSMutableData or MTLBuffer rather than allocate-deallocate per frame.
Tag VM regions with VM_MAKE_TAG. When calling mmap() directly, encode a tag (1–255) into the fd argument: mmap(addr, size, prot, flags, VM_MAKE_TAG(MY_TAG), 0). Instruments and vmmap will display your tag name (after registration in vm_statistics.h; predefined tags exist for VM_MEMORY_MALLOC, VM_MEMORY_COREGRAPHICS, etc.). Makes large allocations attributable in profilers.
Prefer mmap over read for files bigger than a few MB (see 2.6 for the LLM case).
Wrap tight loops in @autoreleasepool to bound peak RSS — without this, autoreleased allocations inside the loop accumulate until the next run-loop iteration.

2.4 Bundle vs heap memory (critical insight for LLM apps)

Executable code (__TEXT segment) is clean, file-backed from the code-signed bundle. Under pressure, the kernel can evict it and re-fetch from the bundle. Counts as “Cached Files,” not “App Memory.”
Embedded resources accessed via NSData(contentsOf:, options: .alwaysMapped) or mmap() directly are also file-backed and behave the same way.
Critical for local AI: 4 GB of model weights mapped from a code-signed .app bundle or any file on disk counts as Cached Files, not App Memory. They are not “your dirty RAM” even though they are resident — exactly the property llama.cpp leverages (Justine Tunney, “Edge AI Just Got Faster”, https://justine.lol/mmap/).
Loading the same data with Data(contentsOf:) (no .alwaysMapped) creates a dirty anonymous copy in App Memory. Always pass .mappedIfSafe or .alwaysMapped when reading large model files.

2.5 Purgeable memory

Two layers:

NSPurgeableData / NSDiscardableContent (Cocoa) — refcounted access tokens:
```
if ([purg beginContentAccess]) {
    /* safe to use bytes */
    [purg endContentAccess];
} else {
    /* must regenerate */
}
[purg discardContentIfPossible]; // frees only if refcount == 0
```
When refcount hits zero, the VM may discard the bytes under pressure. NSCache with evictsObjectsWithDiscardedContent = YES integrates with this protocol.
Mach VM purgeable allocations — lower level. mach_vm_allocate with VM_FLAGS_PURGABLE, then mach_vm_purgable_control() toggles VM_PURGABLE_VOLATILE / VM_PURGABLE_NONVOLATILE. Volatile regions are first to be reclaimed; when reclaimed, they are discarded (zero-filled on next touch), not paged out.

Best for: image thumbnails, tokenizer caches, JIT-compiled kernels, anything regenerable from a deterministic source.

2.6 mmap vs read for large files — THE critical pattern for LLM weights

The default llama.cpp / ggml / MLX approach:

open() the model file (typically .gguf for llama.cpp, .safetensors for MLX).
mmap(NULL, fileSize, PROT_READ, MAP_SHARED, fd, 0) — establishes virtual mappings. Only page tables are allocated immediately (~40 MB of page tables for a 20 GB model).
As inference touches tensors, the kernel page-faults them in. On Apple Silicon, the same pages are simultaneously visible to the Metal GPU via shared storage mode if the buffer was created with MTLDevice.makeBuffer(bytesNoCopy:length:options: .storageModeShared, deallocator:).
Under pressure, the kernel can evict clean mmap’d pages without writing them anywhere — they’re re-fetched from disk on next touch.

References: Justine Tunney’s mmap writeup, plus llama.cpp discussions #638 and #9999 (https://github.com/ggml-org/llama.cpp/discussions/638, https://github.com/ggml-org/llama.cpp/discussions/9999).

Tuning hints:

posix_madvise(ptr, len, POSIX_MADV_SEQUENTIAL) for the model-load pass — triggers readahead.
posix_madvise(ptr, len, POSIX_MADV_RANDOM) for the inference pass when only some experts are active (sparse MoE).
The BSD-specific madvise() adds MADV_FREE_REUSE / MADV_FREE_REUSABLE (Apple-only) which lets malloc cooperatively hand pages back to the kernel.

Beware RSS reporting: when llama.cpp says it’s using 5.8 GB to host a 30 B model, that’s because mmap’d pages don’t all materialize in RSS until touched. The true “is this fitting” metric is physical footprint (use footprint <pid>) + pressure, not top’s RSS.

2.7 Metal storage modes (Apple-Silicon-specific)

MTLResourceStorageMode values:

Mode	macOS Intel (discrete GPU)	macOS Apple Silicon	Use case
`.shared`	CPU+GPU share via PCIe (separate copy)	CPU+GPU share the same physical page	Default on Apple Silicon for almost everything.
`.managed`	Two synchronized copies (VRAM + RAM); `didModifyRange:` / `synchronize:` required	Avoid; behaves like shared with overhead	Discrete-GPU only
`.private`	Pure VRAM	GPU-only pages in unified DRAM (no CPU mapping)	Render targets, intermediates, weights the CPU never touches
`.memoryless`	n/a	n/a on macOS — iOS/iPadOS only	Tile-local render attachments

Apple-Silicon rules:

Use .shared as the default for any buffer the CPU needs to populate or read. Zero overhead.
Use .private only when the GPU is the sole consumer for the lifetime — e.g., intermediate activations during inference that you never copy back. Lets the GPU use optimal cache modes.
MTLResourceHazardTrackingModeUntracked lets you skip the driver’s automatic dependency tracking when you do your own fence/event management — a real win for inference loops (WWDC 22 #10106).
MTLHeap lets you reuse physical memory across MTLTexture/MTLBuffer resources whose lifetimes don’t overlap, dramatically cutting transient peak (Apple, “Reducing the memory footprint of Metal apps”).
recommendedMaxWorkingSetSize is the effective ceiling. Above it, GPU resources may be swapped out — performance cliff.

2.8 Background app suspension

App Nap (10.9+): an app gets napped when (a) no visible windows are frontmost, (b) hasn’t drawn for some seconds, (c) isn’t playing audio, (d) hasn’t taken an IOKit power assertion or NSProcessInfo activity assertion (Eclectic Light, “Did that app quit, or is it just napping?”). Napped apps run at reduced timer coalescing and may have their CPU throttled.
Sudden Termination: opted-in via NSProcessInfo.disableSuddenTermination() / .enableSuddenTermination(). Apps that opt in tell the OS “you can SIGKILL me without graceful shutdown if you need memory.” Useful for stateless background services; dangerous for editors.
Automatic Termination (NSProcessInfo.disableAutomaticTermination(_:)): the OS may quit an idle app with no unsaved state. Requires NSSupportsAutomaticTermination in Info.plist.
Under memorystatus jetsam: opted-in processes (RunningBoard-managed) can be killed at low priority bands.

For a local-LLM app: hold an NSProcessInfo.beginActivity assertion while inference runs; release it when idle. Hold a strong reference to the returned token.

2.9 Common anti-patterns

Strong self capture in closures. Use [weak self] or [unowned self]. Mike Ash, “Dealing With Retain Cycles” (mikeash.com, Friday Q&A 2010-04-30).
Unbounded autorelease in tight loops. Wrap with autoreleasepool { … }.
Observers that outlive their targets. NotificationCenter blocks retain captured state until explicitly removed; same for Combine.AnyCancellable.
Holding Data(contentsOf:) for huge files when .mappedIfSafe would suffice.
Using Dictionary/Array as a cache instead of NSCache — they don’t respond to pressure.
Allocating MTLBuffers per draw call. Pool them.
Loading PNG/HEIC as NSImage repeatedly instead of using ImageIO with kCGImageSourceShouldCache = false.
Background threads creating but never draining their own autorelease pool — Cocoa allocations made on non-main threads leak unless wrapped.
Misreading top RSS for mmap’d files (see 2.6).
Using .managed storage on Apple Silicon when .shared is what you want.

2.10 Profiling memory leaks

Instruments Allocations + Leaks workflow:

Allocations records every malloc/free with stack. Set MallocStackLogging=1 in scheme env vars for full backtraces.
Generation marks: click “Mark Generation” between two app states; the next pane shows only objects allocated between the marks. Catches “this view controller never released” bugs.
Leaks is a periodic root-trace; any allocation unreachable from roots is a leak. It does not catch retain cycles between two mutually-retaining unreachable objects — that’s the Memory Graph Debugger’s job.
leaks --outputGraph file.memgraph + heap/vmmap/malloc_history on the file is the recommended offline flow for production samples (WWDC 21 #10180).

Retain cycles: Swift/ObjC ARC means most cycles arise from closures or delegates. Use [weak self] in closures and weak var delegate: for delegate properties. The Memory Graph Debugger surfaces these visually.

Practical playbook for a local-AI Mac app

mmap your weights with MAP_SHARED (POSIX) or MTLDevice.makeBuffer(bytesNoCopy:length:options: .storageModeShared, deallocator:). Pass .alwaysMapped to NSData/Data reads.
Set posix_madvise hints: _SEQUENTIAL during load, _RANDOM during inference if access is sparse.
Allocate GPU resources with .shared storage mode. Reserve .private for things the CPU never touches.
Subscribe to DISPATCH_SOURCE_TYPE_MEMORYPRESSURE at startup. On WARN, drop non-essential caches or LRU-evict KV history. On CRITICAL, release inference state and notify the UI.
Use NSCache for any data you can rebuild (tokenizer outputs, embeddings) — not custom dictionaries.
Hold an NSProcessInfo.beginActivity token while inference runs.
Test under simulated pressure: sudo memory_pressure -S -l warn -s 60, then -l critical. Confirm your app survives without crashing.
Profile with the memgraph flow (leaks --outputGraph on a healthy run; then again on a suspected leak; heap -diffFrom).
Don’t trust RSS for mmap’d models. Use footprint <pid> for the metric jetsam actually uses.
Document recommendedMaxWorkingSetSize as your effective ceiling, and surface to power users that bumping iogpu.wired_limit_mb is an unsupported workaround for very large models.

Specific learnings for Locara

The Locara runtime should subscribe to memory pressure on behalf of every app. Apps run inside the runtime; if pressure rises, the runtime can evict KV cache, drop secondary models, and notify the app via a capability — rather than every app reinventing this. Per the entitlement model in mac-app-store-sandbox.md, the runtime owns system signals, the app receives normalized events.
The model loader must use mmap and .shared storage by default. Anything else is an immediate footgun on 16 GB Macs. Locara’s model-loading primitive should expose only the mmap path; an --no-mmap escape is a foot-shotgun.
Quantize KV cache by default for any context past 8K. The math from llm-memory-math.md shows KV cache at 128K context exceeds the weights for most models. The runtime should default to q8_0 K/V quantization with an opt-out for quality-sensitive apps.
Refuse to install on Intel Macs. No UMA, no MLX, no Metal-shared-storage benefit. Locara v1 should detect (sysctl machdep.cpu.brand_string contains “Apple”) and refuse with a clear message.
Don’t pretend “Designed for iPad” on Mac is a fallback. It’s capped at 16 GB per process; Locara’s larger-model apps would hit the wall. Native macOS app target only.
Surface footprint numbers in app dev tools, not RSS. Reviewers and app authors will look at “memory used” — if Locara’s dev tooling shows RSS, they’ll panic about mmap’d model files that aren’t actually dirty. Show footprint and “compressed” separately, and explain Cached Files.
Memory tier is a permanent property of the user. Apple Silicon RAM is soldered. Locara’s onboarding should profile the user’s machine once, persist the result, and gate app install on it. Don’t surprise the user at runtime that “your Mac can’t run this app” after they’ve downloaded.
The runtime should call NSProcessInfo.beginActivity during model load and inference, with .userInitiated + .latencyCritical flags. Without this, App Nap can kick in mid-inference if the user backgrounds the window.
Document the iogpu.wired_limit_mb workaround for power users, but never apply it programmatically. Apple does not support raising the limit and the consequences (system instability, kernel panic in extreme cases) are real. Provide it as a documented manual step for the user who wants to push their 192 GB Mac Studio to run a 175B model.
Mac Catalyst is not the path. Locara’s apps need full macOS memory semantics — no 16 GB cap, full NSProcessInfo API, full dispatch-source pressure handling. Native AppKit (or Tauri with native AppKit shell) is the only viable path.

References

Apple primary documentation:

DISPATCH_SOURCE_TYPE_MEMORYPRESSURE — https://developer.apple.com/documentation/dispatch/dispatch_source_type_memorypressure
DISPATCH_MEMORYPRESSURE_CRITICAL — https://developer.apple.com/documentation/dispatch/dispatch_memorypressure_critical
Memory and Virtual Memory (Kernel Programming Guide) — https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/KernelProgramming/vm/vm.html
Viewing Virtual Memory Usage — https://developer.apple.com/library/archive/documentation/Performance/Conceptual/ManagingMemory/Articles/VMPages.html
Caching and Purgeable Memory — https://developer.apple.com/library/archive/documentation/Performance/Conceptual/ManagingMemory/Articles/CachingandPurgeableMemory.html
TN2434 Minimizing your app’s Memory Footprint — https://developer.apple.com/library/archive/technotes/tn2434/_index.html
Choosing a resource storage mode for Apple GPUs — https://developer.apple.com/documentation/metal/choosing-a-resource-storage-mode-for-apple-gpus
Reducing the memory footprint of Metal apps — https://developer.apple.com/documentation/metal/reducing-the-memory-footprint-of-metal-apps
recommendedMaxWorkingSetSize — https://developer.apple.com/documentation/metal/mtldevice/recommendedmaxworkingsetsize
os_proc_available_memory — https://developer.apple.com/documentation/os/3191911-os_proc_available_memory
Identifying high-memory use with jetsam event reports — https://developer.apple.com/documentation/xcode/identifying-high-memory-use-with-jetsam-event-reports
View memory usage in Activity Monitor — https://support.apple.com/guide/activity-monitor/view-memory-usage-actmntr1004/mac

WWDC sessions (most useful for this domain):

WWDC 2018 #416 “iOS Memory Deep Dive” — https://developer.apple.com/videos/play/wwdc2018/416/
WWDC 2020 “Explore the new system architecture of Apple silicon Macs” — UMA introduction
WWDC 2021 #10180 “Detect and diagnose memory issues” — footprint, memgraph CLI flow
WWDC 2021 #10254 “Tune CPU job scheduling for Apple silicon Macs” — P/E core scheduling
WWDC 2022 #10106 “Profile and optimize your game’s memory” — Metal apps
WWDC 2024 #10173 “Analyze heap memory” — heap -diffFrom

XNU / open-source kernel:

xnu source — https://github.com/apple-oss-distributions/xnu
memorystatus_notify docs — https://github.com/apple-oss-distributions/xnu/blob/main/doc/vm/memorystatus_notify.md
kern_memorystatus.h (priority bands)

Authors and resources worth citing directly:

Quinn “The Eskimo!” — Apple Developer Forums. Threads 85474 (pressure level reporting quirks), 118867 (“On Free Memory”), 805580 (avoid querying % memory), 724195 (iPad-on-Mac 16 GB limit).
Mike Ash — mikeash.com/pyblog/ Friday Q&A series. Especially 2011-09-30 ARC, 2010-04-30 Retain Cycles, 2012-02-17 weak references.
Saagar Jha — saagarjha.com and github.com/saagarjha (VirtualApple; HN comments on macOS swap and Apple Silicon internals).
Jeff Johnson — lapcatsoftware.com/articles/prevent-app-nap.html and many other critical macOS resource-management writings.
Marcin Krzyżanowski — “Swift Runtime Performance” (blog.krzyzanowskim.com); Swift Forums on memory pools.
Jonathan Levin — *MacOS and OS Internals Vol I (User Mode) and Vol II (Kernel Mode) — canonical references for memorystatus, jetsam, sysctls, memorystatus_control() syscall. Also newosxbook.com (especially articles/MemoryPressure.html).
Amit Singh — Mac OS X Internals: A Systems Approach — older but still the best dead-tree XNU VM overview.
Hillegass & Preble — Cocoa Programming for Mac OS X 4e, memory management chapter.
Dalrymple & Hillegass — Advanced Mac OS X Programming, Mach VM chapter.

Community / industry blogs:

AppleInsider, “Compressed Memory in OS X 10.9 Mavericks” — https://appleinsider.com/articles/13/06/13/compressed-memory-in-os-x-109-mavericks-aims-to-free-ram-extend-battery-life
Justine Tunney, “Edge AI Just Got Faster” — https://justine.lol/mmap/
llama.cpp Discussion #638 (mmap design) — https://github.com/ggml-org/llama.cpp/discussions/638
llama.cpp Discussion #9999 (mmap RSS reporting) — https://github.com/ggml-org/llama.cpp/discussions/9999
Greggant, “How Memory Works in macOS” — https://blog.greggant.com/posts/2024/07/03/macos-memory-management.html
Eclectic Light Co. — “What does RunningBoard do?”, “Did that app quit, or is it just napping?”, “Why macOS has to change…”
NSHipster, “NSCache” — https://nshipster.com/nscache/
Greg Stencel, “Apple silicon limitations with usage on local LLM” — https://stencel.io/posts/apple-silicon-limitations-with-usage-on-local-llm.html
ivanopcode/devnote-override-macos-metal-vram-cap on GitHub — the iogpu.wired_limit_mb workaround
Tristan Ross, “Why 16k page size matters” — https://tristanxr.com/post/why-16k-page-size/
Eternal Storms, “Mac Developer Tip: How to Simulate Memory Pressure” — https://eternalstorms.wordpress.com

Contested / version-dependent:

Exact pressure WARN/CRITICAL thresholds (have changed across releases — subscribe, don’t hard-code).
GPU working-set fraction (~66% early Apple Silicon, ~75–80% on Sonoma+).
iogpu.wired_limit_mb (Sonoma+) vs debug.iogpu.wired_limit (pre-Sonoma).
iOS per-app memory limits (device- and OS-version-specific; not fully published).
Compressor algorithm details (confirmed WKdm in 2013; Apple has not publicly disclosed changes since).