macOS Memory Management — Architecture and Optimization for Native Apps
What this is: A reference for how macOS actually manages memory (VM subsystem, unified memory, pressure, compressor, swap, jetsam, wired memory) and the established best practices for native macOS apps to be good memory citizens — especially under the hostile conditions of a heavy local-AI workload.
Why it matters: Locara’s apps run alongside the user’s Chrome / Slack / Xcode. Bad memory citizenship doesn’t just make the app slow — it forces the OS into compression and swap, which beachballs everything else the user is doing. The user blames the local-AI app, not the OS. The dominant failure mode for a local-LLM platform on consumer hardware is “we technically fit, but we made the user’s machine miserable.”
Most relevant to Locara: Foundational. Pairs with mac-llm-optimization.md for the LLM-specific layer above this, mac-hardware-lineup.md for the per-SKU memory tiers, and tauri.md / mac-app-store-sandbox.md for the runtime shell.
Part 1 — How macOS memory management actually works
1.1 The VM subsystem (Mach, XNU, page tables)
macOS inherits its virtual-memory subsystem from Mach (Carnegie Mellon, 1980s), layered with BSD-style memory APIs. The combined kernel is XNU, open-sourced as apple-oss-distributions/xnu.
- Address space: 64-bit on both Intel and Apple Silicon. User processes have a very large virtual address space; the first page (
__PAGEZERO) is unmapped to catch NULL dereferences. - Page size:
- Intel Macs (x86_64): 4 KB pages.
- Apple Silicon Macs (arm64) + iOS/iPadOS: 16 KB pages. This is real, not “emulated” —
getpagesize()andsysconf(_SC_PAGESIZE)return 16384. Larger pages reduce TLB pressure and page-table overhead but increase fragmentation cost per allocation. Per Tristan Ross’s writeup (https://tristanxr.com/post/why-16k-page-size/), a page-table block (L3 block descriptor) covers 32 MB on Apple Silicon vs 2 MB on Intel.
- VM regions: Each process has an ordered list of
vm_map_entrystructures pointing atvm_objects (the backing store).vmmap <pid>shows them:__TEXT(read-only code, COW from disk),__DATA,__LINKEDIT,MALLOC_*(large/small/tiny/nano zones),STACK[],IOKit,MALLOC_NANO,SUBMAP(shared regions). - Mach VM API:
mach_vm_allocate,mach_vm_map,mach_vm_deallocate,mach_vm_protect,mach_vm_inherit,mach_vm_purgable_control— all take avm_map_t(the task port) and operate in 64-bit. Almost all higher-level allocators (malloc, libdispatch, NSData) call into these. - The MD/MI split: the machine-dependent
pmaplayer manages page tables, TLBs, and ASIDs; the machine-independentvm_maplayer is portable across architectures.
1.2 Unified Memory Architecture (UMA) on Apple Silicon
Apple Silicon Macs put DRAM dies on the SoC package and expose a single physical pool to the CPU complex (P+E cores), the integrated GPU, the Apple Neural Engine (ANE), the media engines, and ISP. There is no discrete VRAM and no PCIe transit between CPU and GPU.
Concrete consequences for local AI:
- No CPU→GPU copy with
MTLResourceStorageMode.shared. The same physical page is mapped into both CPU and GPU address spaces. This is why llama.cpp’s Metal backend, MLX, and ggml-metal can run models the size of system RAM without a staging buffer. - GPU-wired memory: pages used by the GPU are wired (not pageable) for the lifetime of an in-flight command buffer. On Apple Silicon they live in the same DRAM as everything else, just with a “do not page” flag.
MTLDevice.recommendedMaxWorkingSetSizeis Apple’s exposed ceiling — “an approximation of how much memory, in bytes, a GPU device can allocate without affecting its runtime performance” (https://developer.apple.com/documentation/metal/mtldevice/recommendedmaxworkingsetsize). In practice this is ~66–75% of physical RAM, depending on macOS version and total memory. Exceeding it doesn’t error — Metal will swap GPU resources, killing throughput.- Why this differs from a PC: on a discrete-GPU PC,
cudaMemcpy()-style staging buffers and PCIe transfers are unavoidable. On UMA Macs they are unnecessary and actively wasteful — copying a.sharedbuffer to a.privatebuffer is a pessimization on Apple Silicon, even though it’s a good pattern on Intel Macs with discrete AMD GPUs.
1.3 Memory pressure (the signal your app must respond to)
macOS exposes a tri-state pressure level — NORMAL / WARN / CRITICAL — via DISPATCH_SOURCE_TYPE_MEMORYPRESSURE (https://developer.apple.com/documentation/dispatch/dispatch_source_type_memorypressure).
- Sysctl:
kern.memorystatus_vm_pressure_level(1=normal, 2=warn, 4=critical). - The XNU memorystatus subsystem (
apple-oss-distributions/xnu,doc/vm/memorystatus_notify.md) describes thresholds roughly: WARN at ~50% available RAM, CRITICAL at ~29–40% — but these have shifted across releases. Treat as ranges, not constants. - The kernel uses a
vm_pressure_threadandEVFILT_VMkevents under the hood (per Jonathan Levin’snewosxbook.com/articles/MemoryPressure.html). - A subtle gotcha that Quinn “The Eskimo!” repeatedly notes on Apple Developer Forums (e.g., thread 85474): the dispatch source does not always report transitions back to NORMAL. If your app launches while pressure is already elevated, you won’t get a “current state” event. Architect on the assumption that you’ll be told “warn” or “critical” but may never be told you’re out of the woods.
- Apple’s standing recommendation (Quinn, thread 118867 “On Free Memory”): “Use standard techniques — NSCache, purgeable memory, DISPATCH_SOURCE_TYPE_MEMORYPRESSURE — rather than any free-memory value.” Don’t roll your own pressure detection by querying %-free; use the dispatch source.
1.4 Memory compressor (WKdm)
OS X 10.9 Mavericks (2013) replaced swap-first behavior with in-RAM compression of inactive pages, using the Wilson–Kaplan WKdm algorithm.
- What gets compressed: pages from the inactive list — anonymous dirty pages (heap, stack, zero-fill) belonging to processes that haven’t touched them recently.
- Compression ratio: typically ~2× (about 50%). Activity Monitor reports this as the “Compressed” line.
- Why WKdm: it compresses page-sized blocks fast — Apple’s marketing claim was that compress+decompress beats reading from disk. Modern SSDs narrow that gap, but the energy cost of memory I/O vs. CPU work still favors compression on battery.
- Compressor + swap are layered, not alternatives. When the compressor pool fills, pages can be paged out from the compressed pool to swap.
- iOS uses the same compressor but does not swap (until iOS 17+ on M-series iPads with Stage Manager, which still has narrower swap behavior than macOS). When iOS exceeds its memory budget, jetsam kills processes rather than swap.
1.5 Swap behavior
- Location:
/private/var/vm/swapfile0,swapfile1, … created bydynamic_pager(8)at boot. - Sizing: default swap files start around 64 MB and double up to a cap.
dynamic_pagerflags-S file_size,-H hi_water,-L low_watergovern creation/retirement. - Encryption: swap is encrypted on all modern Macs — unconditionally with FileVault, and even without FileVault on Apple Silicon (SEP-managed disk key, per the Apple Platform Security Guide).
- “Memory Pressure” vs “Memory Used”: Apple’s UI ranks pressure on a green/yellow/red gauge based on compressor and swap activity, not on bytes used. A Mac can show 95% memory used and zero pressure (the bulk is reclaimable cached files). Conversely a Mac can be in red with 60% used if there’s lots of active churn. Pressure is the metric that matters; Used is not.
1.6 Jetsam (the prioritized OOM killer)
Jetsam is the memorystatus subsystem in XNU — the kernel’s prioritized OOM-killer. Name from “jettison.”
- iOS: aggressive. The kernel kills user processes that exceed their per-app footprint cap or when pressure rises.
- macOS: jetsam exists but is opt-in at the per-process level and primarily targets backgrounded “managed” daemons via RunningBoard (Eclectic Light, “What does RunningBoard do?”,
https://eclecticlight.co/2025/07/15/what-does-runningboard-do-2-managed-apps/). A foreground macOS app will not be killed for memory pressure. It will just see swap and beachballs. - Priority bands (defined in
kern_memorystatus.h):JETSAM_PRIORITY_IDLE=0,_BACKGROUND_OPPORTUNISTIC=6,_BACKGROUND=10,_FOREGROUND_SUPPORT=15,_FOREGROUND=18(default for active apps),_AUDIO_AND_ACCESSORY=20,_HOME=22,_CRITICAL=27. Jetsam kills from band 0 upward. - What it tracks: not just RSS but physical footprint = anonymous RSS + compressed bytes + IOKit-mapped. Read via
memorystatus_control()syscall #440 (MEMORYSTATUS_CMD_GET_*). - Per-process limits: set via
MEMORYSTATUS_CMD_SET_JETSAM_HIGH_WATER_MARK. Crossing it triggers anEXC_RESOURCEexception; the process can either crash (default) or get a soft warning. - Jetsam event reports: macOS/iOS emit
.ipsjetsam reports when killing for memory; documented athttps://developer.apple.com/documentation/xcode/identifying-high-memory-use-with-jetsam-event-reports.
1.7 Wired memory
“Wired” pages (vm_page_wire_count in the kernel) cannot be paged or compressed. Categories:
- Kernel data structures, kernel extensions, file-system buffers.
- IOKit mappings and DMA-able buffers — including Metal command buffers and resources allocated via
MTLDevice. These get wired for the duration of GPU access. - Pages locked via
mlock(2)(rarely allowed for unprivileged apps),vm_wire, or IOKit pinning. - Page-table pages themselves at high memory usage.
Activity Monitor’s “Memory Wired” sums all of these and is generally not addressable by user apps directly — but it grows as Metal allocations grow. On Apple Silicon at high GPU residency, wired memory can dominate.
1.8 Activity Monitor categories — what they actually mean
Per https://support.apple.com/guide/activity-monitor/view-memory-usage-actmntr1004/mac:
- App Memory — anonymous dirty memory used by running apps.
- Wired Memory — kernel-pinned, unavailable to apps.
- Compressed — WKdm-compressed pool bytes.
- Cached Files — file-backed pages still in RAM but reclaimable. Not “used” in the traditional sense; the OS can reclaim them instantly. High Cached Files is healthy, not concerning.
- Swap Used — bytes paged to disk.
A Mac with 90% memory “used” but mostly Cached Files is in good shape. Pressure, not Used, drives apps to action.
1.9 iOS-style limits on Mac Catalyst / Designed for iPad
- Native macOS apps: no per-process hard memory cap by default. Can use all of physical RAM + swap, subject to compressor/swap dynamics.
- Mac Catalyst apps: behave like native macOS.
os_proc_available_memory()returns 0, meaning “no limit” (Apple Developer Forums thread 724195). - “Designed for iPad” apps on Apple Silicon Macs: jetsam-style 16 GB cap per process, replicating an M-series iPad limit. They receive memory warnings at 16 GB even on a 96 GB Studio.
- iOS / iPadOS native: per-app limits vary by device — ~5 GB on iPad Pro M1/M2 by default. iOS 15 introduced the
com.apple.developer.kernel.increased-memory-limitentitlement to raise the cap closer to physical RAM for select apps.
1.10 What’s actually addressable at each RAM tier
A 16 GB MacBook Air has 16 GB physical, but the budget visible to your app is significantly less:
- ~0.5–1.5 GB kernel + system daemons at idle.
- GPU working-set cap (
recommendedMaxWorkingSetSize) ≈ ~66–75% of physical RAM. On a 16 GB Mac that’s ~10–12 GB of Metal-addressable memory. - Override (unsupported):
sudo sysctl iogpu.wired_limit_mb=Non Sonoma+ raises the cap. Pre-Sonoma:debug.iogpu.wired_limitin bytes. Community LLM users do this routinely on 64+ GB Macs; Apple doesn’t endorse it.
For a Locara app, the practical app-memory budget on a 16 GB Mac is ~9–11 GB after accounting for the OS, the foreground apps the user already has open, and the GPU working-set headroom that lets Metal stay performant.
Part 2 — Best practices to optimize macOS app memory usage
2.1 Diagnostic tools
| Tool | What it shows |
|---|---|
vm_stat [interval] | Mach VM counters: free, active, inactive, speculative, wired, throttled, compressed pages; pageins/pageouts; faults. Reports page size. |
top -o mem | Per-process RSS, virtual size, compressed, dirty, purgeable, instructions retired. -stats to pick columns. |
| Activity Monitor | Pressure gauge + App/Wired/Compressed/Cached/Swap; per-process Real Memory, Memory Compressed, Real Private Memory. |
vmmap <pid> | Per-region VM dump: address, size, prot, sharing, swapped/dirty/resident. vmmap -summary for totals. |
footprint <pid> | High-level breakdown by category — “App Memory”, “Compressed”, “IOKit”, “Graphics”. Mirrors what jetsam/RunningBoard see. Introduced in WWDC21 #10180. |
heap <pid> | Walks malloc zones; class histograms, allocation sizes. heap -diffFrom file.memgraph is the gold flow for growth diagnosis. WWDC24 #10173. |
leaks <pid> | Cycle detection in malloc allocations. --outputGraph file.memgraph saves a memgraph. |
malloc_history | Recorded allocation stacks. Set MallocStackLogging=1 env var. |
memory_pressure | Simulator and observer. sudo memory_pressure -S -l warn -s 60 simulates warn for 60 s; -l critical simulates critical. Without sudo you only observe. |
sysdiagnose | Full system snapshot — memgraph dumps of all processes, jetsam logs, spindumps. Invoked from Activity Monitor or via key combo. |
| Instruments Allocations | Live history of heap + VM allocations with call trees; generation marks for “what grew between A and B.” |
| Instruments VM Tracker | Periodic snapshots of all VM regions, separating dirty vs swapped/compressed. |
| Instruments Leaks | Periodic root-tracing snapshot. Misses retain cycles between two mutually-retaining unreachable objects — use Memory Graph Debugger for those. |
| Xcode Memory Graph Debugger (Debug > Debug Memory Graph) | Live object graph for retain-cycle hunting. |
Canonical WWDC sessions:
- WWDC 2018 #416 “iOS Memory Deep Dive” — defines
footprint = dirty + compressed. - WWDC 2021 #10180 “Detect and diagnose memory issues” — introduces
footprintCLI and the memgraph flow. - WWDC 2022 #10106 “Profile and optimize your game’s memory” — Metal apps focus.
- WWDC 2024 #10173 “Analyze heap memory” —
heap -diffFrom.
2.2 Lifecycle responses to memory pressure
macOS native (AppKit): there is no applicationDidReceiveMemoryWarning. Use the dispatch source:
let src = DispatchSource.makeMemoryPressureSource(
eventMask: [.warning, .critical],
queue: .global()
)
src.setEventHandler {
switch src.data {
case .warning: shrinkCaches()
case .critical: shrinkCaches(); dropAll()
default: break
}
}
src.resume()
Mac Catalyst / UIKit: UIApplicationDelegate.applicationDidReceiveMemoryWarning(_:) and UIViewController.didReceiveMemoryWarning() work as on iOS.
NSCache auto-evicts under pressure because it listens to the dispatch source itself (NSHipster, https://nshipster.com/nscache/). Use NSCache, not your own dictionary, for caches that should be discardable.
NSProcessInfo activity assertions disable App Nap and reduce sudden-termination eligibility while inference runs:
let token = ProcessInfo.processInfo.beginActivity(
options: [.userInitiated, .latencyCritical, .idleSystemSleepDisabled],
reason: "Local LLM inference"
)
// ... must hold strong reference; release when done.
ProcessInfo.processInfo.endActivity(token)
Reference: Jeff Johnson, “Prevent App Nap Programmatically” (https://lapcatsoftware.com/articles/prevent-app-nap.html).
2.3 Allocation patterns
- Avoid heap fragmentation. macOS’s default malloc has multiple zones (
nano,tiny,small,large) per CPU; large allocations (~>128 KB) bypass the small-block allocator and go tommapof an anonymous region. Mixing many short-lived large objects with long-lived small ones creates fragmentation. - Pool large transient buffers. Reuse a single
NSMutableDataorMTLBufferrather than allocate-deallocate per frame. - Tag VM regions with
VM_MAKE_TAG. When callingmmap()directly, encode a tag (1–255) into thefdargument:mmap(addr, size, prot, flags, VM_MAKE_TAG(MY_TAG), 0). Instruments andvmmapwill display your tag name (after registration invm_statistics.h; predefined tags exist forVM_MEMORY_MALLOC,VM_MEMORY_COREGRAPHICS, etc.). Makes large allocations attributable in profilers. - Prefer
mmapoverreadfor files bigger than a few MB (see 2.6 for the LLM case). - Wrap tight loops in
@autoreleasepoolto bound peak RSS — without this, autoreleased allocations inside the loop accumulate until the next run-loop iteration.
2.4 Bundle vs heap memory (critical insight for LLM apps)
- Executable code (
__TEXTsegment) is clean, file-backed from the code-signed bundle. Under pressure, the kernel can evict it and re-fetch from the bundle. Counts as “Cached Files,” not “App Memory.” - Embedded resources accessed via
NSData(contentsOf:, options: .alwaysMapped)ormmap()directly are also file-backed and behave the same way. - Critical for local AI: 4 GB of model weights mapped from a code-signed
.appbundle or any file on disk counts as Cached Files, not App Memory. They are not “your dirty RAM” even though they are resident — exactly the property llama.cpp leverages (Justine Tunney, “Edge AI Just Got Faster”,https://justine.lol/mmap/). - Loading the same data with
Data(contentsOf:)(no.alwaysMapped) creates a dirty anonymous copy in App Memory. Always pass.mappedIfSafeor.alwaysMappedwhen reading large model files.
2.5 Purgeable memory
Two layers:
-
NSPurgeableData / NSDiscardableContent (Cocoa) — refcounted access tokens:
if ([purg beginContentAccess]) { /* safe to use bytes */ [purg endContentAccess]; } else { /* must regenerate */ } [purg discardContentIfPossible]; // frees only if refcount == 0When refcount hits zero, the VM may discard the bytes under pressure.
NSCachewithevictsObjectsWithDiscardedContent = YESintegrates with this protocol. -
Mach VM purgeable allocations — lower level.
mach_vm_allocatewithVM_FLAGS_PURGABLE, thenmach_vm_purgable_control()togglesVM_PURGABLE_VOLATILE/VM_PURGABLE_NONVOLATILE. Volatile regions are first to be reclaimed; when reclaimed, they are discarded (zero-filled on next touch), not paged out.
Best for: image thumbnails, tokenizer caches, JIT-compiled kernels, anything regenerable from a deterministic source.
2.6 mmap vs read for large files — THE critical pattern for LLM weights
The default llama.cpp / ggml / MLX approach:
open()the model file (typically.gguffor llama.cpp,.safetensorsfor MLX).mmap(NULL, fileSize, PROT_READ, MAP_SHARED, fd, 0)— establishes virtual mappings. Only page tables are allocated immediately (~40 MB of page tables for a 20 GB model).- As inference touches tensors, the kernel page-faults them in. On Apple Silicon, the same pages are simultaneously visible to the Metal GPU via shared storage mode if the buffer was created with
MTLDevice.makeBuffer(bytesNoCopy:length:options: .storageModeShared, deallocator:). - Under pressure, the kernel can evict clean mmap’d pages without writing them anywhere — they’re re-fetched from disk on next touch.
References: Justine Tunney’s mmap writeup, plus llama.cpp discussions #638 and #9999 (https://github.com/ggml-org/llama.cpp/discussions/638, https://github.com/ggml-org/llama.cpp/discussions/9999).
Tuning hints:
posix_madvise(ptr, len, POSIX_MADV_SEQUENTIAL)for the model-load pass — triggers readahead.posix_madvise(ptr, len, POSIX_MADV_RANDOM)for the inference pass when only some experts are active (sparse MoE).- The BSD-specific
madvise()addsMADV_FREE_REUSE/MADV_FREE_REUSABLE(Apple-only) which letsmalloccooperatively hand pages back to the kernel.
Beware RSS reporting: when llama.cpp says it’s using 5.8 GB to host a 30 B model, that’s because mmap’d pages don’t all materialize in RSS until touched. The true “is this fitting” metric is physical footprint (use footprint <pid>) + pressure, not top’s RSS.
2.7 Metal storage modes (Apple-Silicon-specific)
MTLResourceStorageMode values:
| Mode | macOS Intel (discrete GPU) | macOS Apple Silicon | Use case |
|---|---|---|---|
.shared | CPU+GPU share via PCIe (separate copy) | CPU+GPU share the same physical page | Default on Apple Silicon for almost everything. |
.managed | Two synchronized copies (VRAM + RAM); didModifyRange: / synchronize: required | Avoid; behaves like shared with overhead | Discrete-GPU only |
.private | Pure VRAM | GPU-only pages in unified DRAM (no CPU mapping) | Render targets, intermediates, weights the CPU never touches |
.memoryless | n/a | n/a on macOS — iOS/iPadOS only | Tile-local render attachments |
Apple-Silicon rules:
- Use
.sharedas the default for any buffer the CPU needs to populate or read. Zero overhead. - Use
.privateonly when the GPU is the sole consumer for the lifetime — e.g., intermediate activations during inference that you never copy back. Lets the GPU use optimal cache modes. MTLResourceHazardTrackingModeUntrackedlets you skip the driver’s automatic dependency tracking when you do your own fence/event management — a real win for inference loops (WWDC 22 #10106).MTLHeaplets you reuse physical memory acrossMTLTexture/MTLBufferresources whose lifetimes don’t overlap, dramatically cutting transient peak (Apple, “Reducing the memory footprint of Metal apps”).recommendedMaxWorkingSetSizeis the effective ceiling. Above it, GPU resources may be swapped out — performance cliff.
2.8 Background app suspension
- App Nap (10.9+): an app gets napped when (a) no visible windows are frontmost, (b) hasn’t drawn for some seconds, (c) isn’t playing audio, (d) hasn’t taken an IOKit power assertion or
NSProcessInfoactivity assertion (Eclectic Light, “Did that app quit, or is it just napping?”). Napped apps run at reduced timer coalescing and may have their CPU throttled. - Sudden Termination: opted-in via
NSProcessInfo.disableSuddenTermination()/.enableSuddenTermination(). Apps that opt in tell the OS “you can SIGKILL me without graceful shutdown if you need memory.” Useful for stateless background services; dangerous for editors. - Automatic Termination (
NSProcessInfo.disableAutomaticTermination(_:)): the OS may quit an idle app with no unsaved state. RequiresNSSupportsAutomaticTerminationin Info.plist. - Under memorystatus jetsam: opted-in processes (RunningBoard-managed) can be killed at low priority bands.
For a local-LLM app: hold an NSProcessInfo.beginActivity assertion while inference runs; release it when idle. Hold a strong reference to the returned token.
2.9 Common anti-patterns
- Strong
selfcapture in closures. Use[weak self]or[unowned self]. Mike Ash, “Dealing With Retain Cycles” (mikeash.com, Friday Q&A 2010-04-30). - Unbounded autorelease in tight loops. Wrap with
autoreleasepool { … }. - Observers that outlive their targets.
NotificationCenterblocks retain captured state until explicitly removed; same forCombine.AnyCancellable. - Holding
Data(contentsOf:)for huge files when.mappedIfSafewould suffice. - Using
Dictionary/Arrayas a cache instead ofNSCache— they don’t respond to pressure. - Allocating
MTLBuffers per draw call. Pool them. - Loading PNG/HEIC as
NSImagerepeatedly instead of usingImageIOwithkCGImageSourceShouldCache = false. - Background threads creating but never draining their own autorelease pool — Cocoa allocations made on non-main threads leak unless wrapped.
- Misreading
topRSS for mmap’d files (see 2.6). - Using
.managedstorage on Apple Silicon when.sharedis what you want.
2.10 Profiling memory leaks
Instruments Allocations + Leaks workflow:
- Allocations records every malloc/free with stack. Set
MallocStackLogging=1in scheme env vars for full backtraces. - Generation marks: click “Mark Generation” between two app states; the next pane shows only objects allocated between the marks. Catches “this view controller never released” bugs.
- Leaks is a periodic root-trace; any allocation unreachable from roots is a leak. It does not catch retain cycles between two mutually-retaining unreachable objects — that’s the Memory Graph Debugger’s job.
leaks --outputGraph file.memgraph+heap/vmmap/malloc_historyon the file is the recommended offline flow for production samples (WWDC 21 #10180).
Retain cycles: Swift/ObjC ARC means most cycles arise from closures or delegates. Use [weak self] in closures and weak var delegate: for delegate properties. The Memory Graph Debugger surfaces these visually.
Practical playbook for a local-AI Mac app
mmapyour weights withMAP_SHARED(POSIX) orMTLDevice.makeBuffer(bytesNoCopy:length:options: .storageModeShared, deallocator:). Pass.alwaysMappedtoNSData/Datareads.- Set
posix_madvisehints:_SEQUENTIALduring load,_RANDOMduring inference if access is sparse. - Allocate GPU resources with
.sharedstorage mode. Reserve.privatefor things the CPU never touches. - Subscribe to
DISPATCH_SOURCE_TYPE_MEMORYPRESSUREat startup. On WARN, drop non-essential caches or LRU-evict KV history. On CRITICAL, release inference state and notify the UI. - Use
NSCachefor any data you can rebuild (tokenizer outputs, embeddings) — not custom dictionaries. - Hold an
NSProcessInfo.beginActivitytoken while inference runs. - Test under simulated pressure:
sudo memory_pressure -S -l warn -s 60, then-l critical. Confirm your app survives without crashing. - Profile with the memgraph flow (
leaks --outputGraphon a healthy run; then again on a suspected leak;heap -diffFrom). - Don’t trust RSS for mmap’d models. Use
footprint <pid>for the metric jetsam actually uses. - Document
recommendedMaxWorkingSetSizeas your effective ceiling, and surface to power users that bumpingiogpu.wired_limit_mbis an unsupported workaround for very large models.
Specific learnings for Locara
-
The Locara runtime should subscribe to memory pressure on behalf of every app. Apps run inside the runtime; if pressure rises, the runtime can evict KV cache, drop secondary models, and notify the app via a capability — rather than every app reinventing this. Per the entitlement model in
mac-app-store-sandbox.md, the runtime owns system signals, the app receives normalized events. -
The model loader must use mmap and
.sharedstorage by default. Anything else is an immediate footgun on 16 GB Macs. Locara’s model-loading primitive should expose only the mmap path; an--no-mmapescape is a foot-shotgun. -
Quantize KV cache by default for any context past 8K. The math from
llm-memory-math.mdshows KV cache at 128K context exceeds the weights for most models. The runtime should default toq8_0K/V quantization with an opt-out for quality-sensitive apps. -
Refuse to install on Intel Macs. No UMA, no MLX, no Metal-shared-storage benefit. Locara v1 should detect (
sysctl machdep.cpu.brand_stringcontains “Apple”) and refuse with a clear message. -
Don’t pretend “Designed for iPad” on Mac is a fallback. It’s capped at 16 GB per process; Locara’s larger-model apps would hit the wall. Native macOS app target only.
-
Surface
footprintnumbers in app dev tools, not RSS. Reviewers and app authors will look at “memory used” — if Locara’s dev tooling shows RSS, they’ll panic about mmap’d model files that aren’t actually dirty. Showfootprintand “compressed” separately, and explain Cached Files. -
Memory tier is a permanent property of the user. Apple Silicon RAM is soldered. Locara’s onboarding should profile the user’s machine once, persist the result, and gate app install on it. Don’t surprise the user at runtime that “your Mac can’t run this app” after they’ve downloaded.
-
The runtime should call
NSProcessInfo.beginActivityduring model load and inference, with.userInitiated+.latencyCriticalflags. Without this, App Nap can kick in mid-inference if the user backgrounds the window. -
Document the
iogpu.wired_limit_mbworkaround for power users, but never apply it programmatically. Apple does not support raising the limit and the consequences (system instability, kernel panic in extreme cases) are real. Provide it as a documented manual step for the user who wants to push their 192 GB Mac Studio to run a 175B model. -
Mac Catalyst is not the path. Locara’s apps need full macOS memory semantics — no 16 GB cap, full
NSProcessInfoAPI, full dispatch-source pressure handling. Native AppKit (or Tauri with native AppKit shell) is the only viable path.
References
Apple primary documentation:
DISPATCH_SOURCE_TYPE_MEMORYPRESSURE—https://developer.apple.com/documentation/dispatch/dispatch_source_type_memorypressureDISPATCH_MEMORYPRESSURE_CRITICAL—https://developer.apple.com/documentation/dispatch/dispatch_memorypressure_critical- Memory and Virtual Memory (Kernel Programming Guide) —
https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/KernelProgramming/vm/vm.html - Viewing Virtual Memory Usage —
https://developer.apple.com/library/archive/documentation/Performance/Conceptual/ManagingMemory/Articles/VMPages.html - Caching and Purgeable Memory —
https://developer.apple.com/library/archive/documentation/Performance/Conceptual/ManagingMemory/Articles/CachingandPurgeableMemory.html - TN2434 Minimizing your app’s Memory Footprint —
https://developer.apple.com/library/archive/technotes/tn2434/_index.html - Choosing a resource storage mode for Apple GPUs —
https://developer.apple.com/documentation/metal/choosing-a-resource-storage-mode-for-apple-gpus - Reducing the memory footprint of Metal apps —
https://developer.apple.com/documentation/metal/reducing-the-memory-footprint-of-metal-apps recommendedMaxWorkingSetSize—https://developer.apple.com/documentation/metal/mtldevice/recommendedmaxworkingsetsizeos_proc_available_memory—https://developer.apple.com/documentation/os/3191911-os_proc_available_memory- Identifying high-memory use with jetsam event reports —
https://developer.apple.com/documentation/xcode/identifying-high-memory-use-with-jetsam-event-reports - View memory usage in Activity Monitor —
https://support.apple.com/guide/activity-monitor/view-memory-usage-actmntr1004/mac
WWDC sessions (most useful for this domain):
- WWDC 2018 #416 “iOS Memory Deep Dive” —
https://developer.apple.com/videos/play/wwdc2018/416/ - WWDC 2020 “Explore the new system architecture of Apple silicon Macs” — UMA introduction
- WWDC 2021 #10180 “Detect and diagnose memory issues” —
footprint, memgraph CLI flow - WWDC 2021 #10254 “Tune CPU job scheduling for Apple silicon Macs” — P/E core scheduling
- WWDC 2022 #10106 “Profile and optimize your game’s memory” — Metal apps
- WWDC 2024 #10173 “Analyze heap memory” —
heap -diffFrom
XNU / open-source kernel:
- xnu source —
https://github.com/apple-oss-distributions/xnu - memorystatus_notify docs —
https://github.com/apple-oss-distributions/xnu/blob/main/doc/vm/memorystatus_notify.md kern_memorystatus.h(priority bands)
Authors and resources worth citing directly:
- Quinn “The Eskimo!” — Apple Developer Forums. Threads 85474 (pressure level reporting quirks), 118867 (“On Free Memory”), 805580 (avoid querying % memory), 724195 (iPad-on-Mac 16 GB limit).
- Mike Ash —
mikeash.com/pyblog/Friday Q&A series. Especially 2011-09-30 ARC, 2010-04-30 Retain Cycles, 2012-02-17 weak references. - Saagar Jha —
saagarjha.comandgithub.com/saagarjha(VirtualApple; HN comments on macOS swap and Apple Silicon internals). - Jeff Johnson —
lapcatsoftware.com/articles/prevent-app-nap.htmland many other critical macOS resource-management writings. - Marcin Krzyżanowski — “Swift Runtime Performance” (
blog.krzyzanowskim.com); Swift Forums on memory pools. - Jonathan Levin — *MacOS and OS Internals Vol I (User Mode) and Vol II (Kernel Mode) — canonical references for
memorystatus, jetsam, sysctls,memorystatus_control()syscall. Alsonewosxbook.com(especiallyarticles/MemoryPressure.html). - Amit Singh — Mac OS X Internals: A Systems Approach — older but still the best dead-tree XNU VM overview.
- Hillegass & Preble — Cocoa Programming for Mac OS X 4e, memory management chapter.
- Dalrymple & Hillegass — Advanced Mac OS X Programming, Mach VM chapter.
Community / industry blogs:
- AppleInsider, “Compressed Memory in OS X 10.9 Mavericks” —
https://appleinsider.com/articles/13/06/13/compressed-memory-in-os-x-109-mavericks-aims-to-free-ram-extend-battery-life - Justine Tunney, “Edge AI Just Got Faster” —
https://justine.lol/mmap/ - llama.cpp Discussion #638 (mmap design) —
https://github.com/ggml-org/llama.cpp/discussions/638 - llama.cpp Discussion #9999 (mmap RSS reporting) —
https://github.com/ggml-org/llama.cpp/discussions/9999 - Greggant, “How Memory Works in macOS” —
https://blog.greggant.com/posts/2024/07/03/macos-memory-management.html - Eclectic Light Co. — “What does RunningBoard do?”, “Did that app quit, or is it just napping?”, “Why macOS has to change…”
- NSHipster, “NSCache” —
https://nshipster.com/nscache/ - Greg Stencel, “Apple silicon limitations with usage on local LLM” —
https://stencel.io/posts/apple-silicon-limitations-with-usage-on-local-llm.html ivanopcode/devnote-override-macos-metal-vram-capon GitHub — theiogpu.wired_limit_mbworkaround- Tristan Ross, “Why 16k page size matters” —
https://tristanxr.com/post/why-16k-page-size/ - Eternal Storms, “Mac Developer Tip: How to Simulate Memory Pressure” —
https://eternalstorms.wordpress.com
Contested / version-dependent:
- Exact pressure WARN/CRITICAL thresholds (have changed across releases — subscribe, don’t hard-code).
- GPU working-set fraction (~66% early Apple Silicon, ~75–80% on Sonoma+).
iogpu.wired_limit_mb(Sonoma+) vsdebug.iogpu.wired_limit(pre-Sonoma).- iOS per-app memory limits (device- and OS-version-specific; not fully published).
- Compressor algorithm details (confirmed WKdm in 2013; Apple has not publicly disclosed changes since).