macOS Performance Profiling — Tools, Methodology, Recipes

What this is: A reference for the engineer who has just been handed a complaint of the form “your app makes my Mac slow.” Maps symptoms → tools → philosophy. Covers Instruments, the CLI toolbox, signposts, DTrace, off-CPU analysis, and the principled-measurement traditions of Gregg, Cantrill, Muratori, Lemire, Fog. Why it matters: Locara apps run alongside the user’s browser, Slack, Xcode, and possibly another Locara app. Bad citizenship doesn’t just slow your app — it beachballs theirs. The hardest performance bugs are the ones the user reports vaguely (“it feels slow”) that don’t reproduce in your test harness. The diagnostic loop is what bridges that gap. Without a profiling vocabulary you’re guessing. Most relevant to Locara: Pairs with macos-memory-management.md (covers what to do once you find a memory pathology; this note covers how to find pathologies in the first place), mac-power-and-thermals.md (for energy-specific cases), mac-llm-optimization.md (LLM-specific instrumentation).

Part 1 — The mental model

1.1 The USE method (Brendan Gregg)

Gregg’s USE method — Utilization, Saturation, Errors — is the canonical first-pass triage framework, introduced 2011 and codified in Systems Performance (2nd ed., 2020), Chapter 2. Thesis in one sentence: “For every resource, check utilization, saturation, and errors.” [Gregg, USE Method, brendangregg.com/usemethod.html]

The discipline: enumerate resources (CPU, memory, disk, network, GPU) and ask the same three questions of each, before reaching for a profiler. This catches the saturated-but-not-utilized case (something queued behind a lock, the CPU looks fine) — the most common reason a naïve top reading lies to you.

On macOS the U/S/E mapping:

Resource	Utilization	Saturation	Errors
CPU	`top` `%CPU` per core; `powermetrics --samplers cpu_power` per-core duty cycle; Activity Monitor “CPU Usage”	`uptime` load average vs core count; Thread State Trace’s “Runnable” bars in Instruments	`top -o cpu` zombies; thermal throttling lines from `powermetrics --samplers thermal`
Memory	Activity Monitor “Memory Used” minus Cached Files; `vm_stat` active+wired+compressed	Memory pressure gauge (yellow/red); `sysctl kern.memorystatus_vm_pressure_level`; swap I/O in `vm_stat` pageins/pageouts; compressed pool size	jetsam `.ips` reports under `~/Library/Logs/DiagnosticReports/`; `EXC_RESOURCE` exceptions
Disk	`iostat -d 1` `%b` busy column; Activity Monitor “Disk” tab	I/O queue length in `iostat`’s `kr/s kw/s`; `fs_usage` showing serialized waits; spindump output	I/O errors in `dmesg`; SMART status via `smartctl`; APFS errors in `log show --predicate 'subsystem == "com.apple.filesystem.apfs"'`
Network	`nettop -P` per-process bytes/s; Activity Monitor “Network” tab	Drops/retransmits in `netstat -s`; `tcpdump` showing TCP zero-window or DUP-ACKs	`ifconfig` error counters; `netstat -i` errors column
GPU	`powermetrics --samplers gpu_power` GPU % busy; Metal System Trace GPU Timeline	Command-buffer queue depth in Metal System Trace; `recommendedMaxWorkingSetSize` exceeded	Metal API validation errors; `MTLCommandBufferStatusError`; Xcode GPU Frame Capture’s “Issues” pane

The USE method’s value: it turns a vague complaint into a structured search. When you find one resource saturated, you have a place to point the deeper tools.

Subtle Gregg point: 100% utilization is not by itself a problem. A CPU at 100% on the right workload is good — you’re getting the chip you paid for. A CPU at 100% with non-zero saturation means there is queued work that is not getting through, which is when you start losing latency. The cleanest visualization is the run-queue length, not the % busy.

1.2 On-CPU vs off-CPU analysis (Gregg)

The most important conceptual distinction in the entire field. From Gregg’s Off-CPU Analysis (brendangregg.com/offcpuanalysis.html):

On-CPU profiling samples the program counter while the thread is executing on a core. Time Profiler (Instruments), sample <pid>, Linux perf record, DTrace’s profile-997 provider all do this. Answers: “what code is burning CPU?”
Off-CPU profiling captures stacks at the moment a thread leaves the CPU — when it blocks on a syscall, a lock, a condition variable, I/O, or sleep — and the time spent off-CPU until it returns. Answers: “what code is waiting?”

Gregg’s classic example: tar reports 50.8 s wallclock but only 12.6 s of CPU. The remaining 38.2 s of “where did the time go” is invisible to on-CPU profiling. A pure CPU profile of a beachballing app shows nothing wrong precisely because the app is not on the CPU — it’s waiting on a lock, a read(), a network round-trip. Reaching for Time Profiler when the symptom is a hang is a category error.

On macOS the off-CPU analogue is the Thread State Trace instrument (in System Trace and in the Hangs template) and the older System Call Trace. Both annotate threads with wait reason — “Blocked on kernel call”, “Preempted”, “Suspended”, “Mutex wait” — and let you correlate the blocking instant with the stack at that instant. [Apple WWDC23 #10248 “Analyze hangs with Instruments”]

1.3 The hierarchy: visible UI → main thread → background threads → kernel

When the user says “the app is slow,” they’re reporting an artifact of the UI thread. Work down this hierarchy:

Layer	First-line tool	What it tells you
Visible UI / rendering	Core Animation FPS instrument; Animation Hitches template; Quartz Debug	Frame rate, missed VSyncs, hitch count and duration
Main thread state	Time Profiler with main-thread filter; Hangs instrument; Thread State Trace	Whether main thread is busy (high %CPU) or blocked (low %CPU, waiting) — the diagnostic divide for hangs
Background threads / GCD queues	Time Profiler full-process; Swift Concurrency instrument (WWDC22 #110350); Thread Performance Checker	Priority inversions, queue saturation, non-UI work blocking the UI thread
Kernel / syscalls / I/O	System Trace; `fs_usage`; `latency` CLI; spindump; DTrace where SIP permits	What `read`/`write`/`mach_msg` calls take, page faults, scheduler decisions
Hardware (cache, branch, GPU)	Counters instrument; Metal System Trace + GPU counters; `powermetrics`; Xcode GPU Frame Capture	µarch behavior — IPC, cache misses, branch mispredicts, GPU stalls, thermal pressure

The discipline: do not skip levels. Reaching for dtrace when Activity Monitor would have told you “this app is using 800% CPU and the GPU is idle” wastes time. The diagnostic loop: form a hypothesis at the highest layer matching the symptom; confirm or refute with the cheapest tool at that layer; only then descend.

1.4 The Cantrill philosophy: observability vs debugging

Bryan Cantrill’s foundational distinction — articulated through DTrace and his ACM Queue writing — is that observability is not the same as logging. Logging records what the developer anticipated might go wrong. Observability “answers questions the developer never thought to ask.” [Cantrill, Hidden in Plain Sight, ACM Queue 2006]

Corollaries for desktop apps:

Instrument for production, not just dev. The performance bug that happens once in three weeks on a customer’s M1 Air will never reproduce under your test harness. Instrumentation must be on by default with negligible overhead when not actively collecting (DTrace’s design constraint), or it isn’t there when you need it.
Safety is non-negotiable. Cantrill’s rule: “If you run a tool and the system dies as a result, you will never be allowed to run that tool again.” On macOS this principle survives in os_signpost, Instruments’ system-trace sampling, the read-only nature of xctrace record. You can leave them on.
Don’t predict your questions in advance. DTrace’s tagline was “concise answers to arbitrary questions.” On macOS: prefer the System Trace template (captures everything — scheduler, syscalls, signposts, VM events, GCD queue activity) over a single-purpose template, when you don’t yet know what’s wrong. Filter on import, not on capture.

1.5 Muratori’s counterweight: reason about what the CPU is trying to do

Casey Muratori’s “Clean” Code, Horrible Performance (computerenhance.com, 2023) and his Performance-Aware Programming course supply the corrective: profiling tells you what is slow, but it does not tell you what the CPU is capable of. A code path profiled as “the slowest function in the trace” might already be running at 80% of the µarch’s theoretical throughput — in which case there is no win to be had there, and the profile is misleading you.

Muratori’s framing: most performance problems are not 2x problems, they are 100x problems. 100× differences from code organization are routine and 10,000× is not atypical, when you account for cache, prefetcher, branch predictor, and SIMD effects. [Muratori, “Clean” Code, Horrible Performance, 2023]

For the macOS workflow: when you find a hot function, before optimizing, ask:

What is the theoretical peak for this CPU (cycles/sec × IPC × SIMD width)?
Where is the data coming from — L1, L2, SLC, DRAM? powermetrics and the Counters instrument can hint; Agner Fog’s microarchitecture.pdf gives latencies.
Is the loop body branch-predictable? Lemire: modern CPUs learn branch patterns within ~10 trials on small inputs, so a microbenchmark that loops 1,000 times on the same data wildly overestimates branch-prediction quality. [Lemire 2019]

The Muratori/Lemire/Fog tradition is the reasoning layer above the profiler. You still need the profiler.

Part 2 — The Instruments cookbook (per symptom)

Instruments is a wrapper around the same underlying capture machinery (ktrace, os_signpost, kdebug, Core Animation kernel hooks, Metal’s GPU counters, Mach scheduler events). Each template is a curated bundle of “instruments” (lanes) for a workload. xctrace list templates enumerates them.

2.1 Beachballs and hangs

Template: Hangs (since Xcode 14) or Time Profiler with Thread State Trace added.

The diagnostic divide, drawn directly from WWDC23 #10248:

Busy main-thread hang → CPU usage on the main thread is high during the hang. Cause: synchronous heavy work on the UI thread.
Blocked main-thread hang → CPU on the main thread is low during the hang. Cause: a lock, an I/O, an XPC round-trip, a priority inversion.

The Hangs lane labels each hang interval; click it, drill into the Heaviest Stack Trace view filtered to main thread, then add the Thread State Trace lane to see why the thread left the CPU. For blocked hangs the bottom of the call stack will be __psynch_*, __semwait_*, mach_msg2_trap, or read/write — and the off-CPU stack (the stack at the moment of blocking) is what you want, not the on-CPU samples.

CLI alternatives without Instruments:

sudo spindump <pid> 5 -file /tmp/spin.txt — captures 5 seconds of user+kernel stacks for the process. Output shows N samples per stack with frequency counts. Quinn at Apple DTS: “There are N samples and you get a tree showing the backtraces of those samples with frequency counts.” If 727 of 929 samples showed the main thread idle on mach_msg_trap, the bulk of the time was spent waiting.
sample <pid> 5 -file /tmp/sample.txt — userland-only sampling, ~1 ms default. Less invasive than spindump; suitable for production. Same tree-of-stacks output.
xctrace record --template Hangs --attach <pid> --time-limit 30s --output /tmp/hangs.trace then open in Instruments.

2.2 High CPU at idle

Template: Time Profiler, weight tree view, group by thread.

Look for:

Threads that should not exist when idle (timers firing every 100 ms; combine subscriptions that haven’t been cancelled; runaway dispatch sources).
Methods named *Timer*Callback* or __select in tight loops.
Hot CFRunLoop modes other than kCFRunLoopDefaultMode — indicates a modal loop.

CLI workflow:

sample <pid> 10 -mayDie -file /tmp/idle.txt

Then grep for <thread> and read the stacks. A common idle-CPU bug is a Timer.scheduledTimer with a short interval whose target retain-cycles its owner, so it never invalidates.

2.3 Memory growth

Covered in depth in macos-memory-management.md §2.10. Summary:

leaks --outputGraph /tmp/before.memgraph <pid> at a quiescent baseline.
Drive the suspected leak path (open/close a document N times).
leaks --outputGraph /tmp/after.memgraph <pid>.
heap -diffFrom /tmp/before.memgraph /tmp/after.memgraph to see which object types grew.
With MallocStackLogging=1 in scheme env: malloc_history <pid> --callTree -invert <addr> gives allocation stacks.
For retain cycles between two mutually-retaining unreachable objects (which leaks misses), use Xcode > Debug > View Debugging > Capture View Hierarchy / Debug Memory Graph. [Apple WWDC21 #10180, WWDC24 #10173]

In Instruments: Allocations for live history + Mark Generation buttons; VM Tracker for periodic VM-region snapshots; Leaks for periodic root-trace.

2.4 Slow startup

Template: App Launch (introduced in the WWDC19 Instruments refresh).

The instrument breaks launch into named phases — pre-main (dyld linking, ObjC class loading, static initializers in purple), then post-main (UIKit/AppKit init, first frame render in green). Apple’s published target: the first frame should appear within 400 ms of user tap. [WWDC19 #423]

To attribute time to your code, wrap startup phases with os_signpost:

let log = OSLog(subsystem: "com.locara.app", category: .pointsOfInterest)
let id = OSSignpostID(log: log)
os_signpost(.begin, log: log, name: "ModelLoad", signpostID: id)
defer { os_signpost(.end, log: log, name: "ModelLoad", signpostID: id) }
loadModel()

For deterministic clean-state startup: defaults write <bundle-id> ApplePersistenceIgnoreState YES and defaults write -g NSWindowRestoresWorkspaceAtLaunch -bool false.

2.5 Sluggish scrolling / animation hitches

Template: Animation Hitches (Instruments 12+) or the older Core Animation instrument.

A hitch is Apple’s defined term: any frame that arrives later than expected. Metric is hitch time ratio = hitch milliseconds per second of wall time. The Animation Hitches template records the render loop in three phases (commit, render, display) and flags which phase caused each missed frame. [Apple Tech Talks 10855–10857]

Common findings:

Offscreen render passes for cornerRadius + clipsToBounds + shadow* combinations. The renderer copies the layer contents to an offscreen texture, applies the effect, copies back — visible in the Core Animation instrument’s “Offscreen Stages” toggle.
Layer-backed views with frequently-changing bounds triggering layout invalidation up the tree.
drawRect:-implemented views taking longer than one frame at the user’s refresh rate (8.33 ms at 120 Hz on ProMotion, 16.67 ms at 60 Hz).

Live overlay: Quartz Debug (Apple’s Additional Tools for Xcode) provides FPS overlay, flash-on-redraw, and offscreen-render visualization for development.

2.6 High energy / battery drain

Template: Energy Log in Instruments; CLI is powermetrics. Full deep-dive in mac-power-and-thermals.md.

sudo powermetrics --samplers cpu_power,gpu_power,thermal,smc --interval 1000 gives a rolling per-second breakdown: package power (mW), per-core P-state and duty cycle, thermal pressure (Nominal / Light / Moderate / Heavy / Trapping / Sleeping), DRAM bandwidth and power on Apple Silicon.

For root-cause work, take a sysdiagnose — it bundles an Energy log, jetsam log, and spindumps in one archive Apple’s bug-report flow accepts directly.

2.7 Slow disk I/O

Template: File Activity in Instruments; CLI is fs_usage and latency.

sudo fs_usage -w -f filesys <pid> shows every filesystem syscall the process makes, with µs-resolution timing in the leftmost column. Patterns to look for:

Many small reads on the same fd (missing buffering).
getattrlist storms (typical of [NSFileManager attributesOfItem:] in a loop — use NSURL resource values with prefetch keys instead).
fcntl(F_FULLFSYNC) calls that take 10–100 ms on rotating media.

2.8 Network latency

Template: Network with two instruments — Network Connections (TCP/UDP state, per-connection RTT, retransmits) and HTTP Traffic (request/response timing including TLS, body sizes, headers). HTTP Traffic does not require a proxy certificate — it hooks URLSession and CFNetwork at the framework layer. [WWDC21 #10212]

For wire-level capture: sudo tcpdump -i en0 -w /tmp/capture.pcap port 443 then open in Wireshark. For an iOS device tethered to a Mac, use rvictl -s <UDID> to create a virtual interface.

2.9 SwiftUI re-renders / layout thrash

Template: SwiftUI (WWDC23 #10160 “Demystify SwiftUI performance”; expanded in WWDC25 #306).

The instrument records View.body evaluations grouped by identity. Diagnostic moves:

Bodies that re-evaluate when they shouldn’t → over-broad @Observable / @Published source of truth.
Identity churn (a view’s identity changes every update) → mis-keyed ForEach, or implicit identity from position.
Long bodies that should be split into smaller views (each body is recomputed atomically).

For AppKit: the Layout instrument; NSView.layout() time can be visualized per view.

2.10 GPU underutilization or stalls

Template: Metal System Trace in Instruments, plus Xcode GPU Frame Capture (Debug > Capture GPU Workload).

Metal System Trace shows the CPU encoding timeline, command-buffer commit-to-present timeline, the GPU’s actual execution of each encoder, and (on Apple Silicon) per-pipeline stage breakdown — vertex, fragment, compute, tile. GPU counters expose IPC, ALU utilization, texture-cache hit rate, bandwidth. [WWDC20 #10603 “Optimize Metal apps and games with GPU counters”]

Frame Capture goes further: freezes a frame, lets you scrub through every draw call, and (on Apple Silicon) gives per-line shader cost in the source view. The Shader Cost Graph (WWDC21 #10157) triages expensive shaders by aggregate cost.

For local-LLM workloads the common findings:

Command buffers waiting on previous-frame completion (insufficient parallelism between encode and execute) — fix with MTLResourceHazardTrackingModeUntracked and explicit fences.
Compute kernels not saturating ALUs because of memory-bound access patterns — use bandwidth and ALU counters to confirm; flatten data layout, use simdgroup_matrix ops.
recommendedMaxWorkingSetSize exceeded → resources get swapped, throughput drops by an order of magnitude. See macos-memory-management.md §1.2.

Part 3 — The CLI tools (without Xcode)

The toolbox for the “I SSH’d to my MacStadium runner” case, and the in-app diagnostic bundle a user can run for you.

Tool	Purpose	Flags worth remembering
`sample <pid> <sec>`	Userland sampling, ~1 ms cadence. Non-disruptive — safe in production. Tree of stacks with frequency.	`-mayDie`, `-file <path>`, `-fullPaths`
`spindump <pid> <sec>`	Like `sample` but includes kernel stacks (root required). Apple’s reference for “what was this app doing during the hang.”	`-i <interval>` (default 10 ms), `-noText`, `-onlyTarget`, `-stdout`
`sysdiagnose`	The full kitchen sink: `ps`, `fs_usage`, `vm_stat`, `spindump` of every process, jetsam log, energy log, logarchive. The Apple-preferred attachment for a Feedback Assistant bug.	`-f <dir>`; keychord `Shift+Ctrl+Option+Cmd+.` triggers it system-wide
`vm_stat <interval>`	Mach VM page counters. Pageins/pageouts/compressed indicate pressure history. Reports page size (4K Intel, 16K Apple Silicon).	`vm_stat 1` for per-second deltas
`top`	Real-time process view.	`top -o cpu`, `top -o mem`, `top -stats pid,command,cpu,mem,vsize,rsize,instrs,cycles,purg,cmprs`
`nettop`	Per-process network bytes/connections.	`nettop -P`, `nettop -L 1 -P -J csv` for one-shot CSV
`fs_usage`	Per-process filesystem and pathname-aware syscall trace.	`sudo fs_usage -w -f filesys <pid>`; `-f network` for sockets; `-f exec` for spawns
`iostat`	Disk throughput + CPU split.	`iostat -d -w 1` disk; `iostat -c 1` CPU
`latency`	Real-time scheduling-latency observer; spots preemption stalls and interrupt-handler stretches. Root only.	`sudo latency -rt -p <pid>`
`dtrace`	The fully programmable tracer (SIP-gated; see Part 5).	`sudo dtrace -n 'syscall::: { @[probefunc] = count(); }'`
`powermetrics`	Per-second SoC telemetry — power, P-states, thermal, ANE residency. Root only.	`sudo powermetrics --samplers cpu_power,gpu_power,thermal,ane_power,smc -i 1000 -n 60`
`xctrace`	The CLI face of Instruments. CI-friendly, no Xcode UI.	`xctrace list templates`; `xctrace record --template "Time Profiler" --attach <pid> --time-limit 30s --output app.trace`; `xctrace export --input app.trace --xpath '//trace-toc[1]/run[1]/data[1]/table[@schema="time-profile"]'`
`heap <pid>`	Walks malloc zones; class histograms.	`heap -diffFrom before.memgraph after.memgraph` is the gold flow
`vmmap <pid>`	Per-region VM dump with dirty/swapped/resident.	`vmmap -summary <pid>`; works on memgraph files
`footprint <pid>`	High-level breakdown by category (App, Compressed, IOKit, Graphics). Mirrors what jetsam sees. WWDC21 #10180.	Works on memgraph files; `--all-processes` for system-wide
`leaks <pid>`	Cycle detection in malloc allocations.	`--outputGraph file.memgraph`; `--list` for short form
`malloc_history <pid>`	Allocation stacks (requires `MallocStackLogging=1`).	`--callTree -invert`, `--allBySize`, `--allByCount`
`atos`	Symbolicate a raw address against a binary + dSYM.	`atos -o MyApp.app/Contents/MacOS/MyApp -arch arm64 -l 0x100000000 0x10000abcd`
`taskinfo <pid>`	Mach task info — port counts, VM stats, thread states.	`taskinfo -all <pid>`

A note on kernel_task: macOS exposes a special kernel-only PID 0 that absorbs system CPU. Apple Silicon also uses it as a thermal sink — when SoC temperature is high, the scheduler runs kernel_task on cores to displace user work. High kernel_task CPU is therefore a thermal symptom, not a kernel bug. Cross-check with powermetrics --samplers thermal.

Part 4 — Signposts and custom instrumentation

4.1 The `os_signpost` API

Introduced iOS 12 / macOS 10.14. Three event types:

Interval — os_signpost(.begin, ...) and os_signpost(.end, ...), paired by name and ID. Instruments draws as a bar.
Event — single point in time. Instruments draws as a tick.

Key API constraint: intervals are matched per-thread by ID. Multiple intervals with the same name can be in flight simultaneously if they have distinct OSSignpostID values. [Apple, os_signpost.h]

Categories worth knowing:

OSLog.Category.pointsOfInterest — Instruments default lane.
OSLog.Category.dynamicTracing — only recorded under Instruments; zero cost otherwise.
OSLog.Category.dynamicStackTracing — like dynamicTracing but captures backtraces on every signpost. Use sparingly; bigger overhead.

Recipe for measuring a phase:

import os
let log = OSLog(subsystem: "com.locara.runtime", category: .pointsOfInterest)
let id = OSSignpostID(log: log, object: requestObject)
os_signpost(.begin, log: log, name: "Inference",
            signpostID: id, "model=%{public}@ tokens=%d",
            modelName, tokenCount)
// ... work ...
os_signpost(.end, log: log, name: "Inference", signpostID: id,
            "ttft=%{public}.2fms", ttftMs)

Format string is os_log-style with %{public} / %{private} privacy modifiers. %{private} strings are redacted in release builds unless the device is configured for development.

4.2 `kdebug` — lower-level

kdebug is the kernel-level trace facility that ktrace, fs_usage, and System Trace are all built on. User code can post kdebug events via kdebug_trace() (<sys/kdebug.h>) for events that should appear interleaved with kernel events. Useful when you want your app’s marks to line up exactly with mach_msg events or scheduler events. os_signpost writes through kdebug under the hood; reach for kdebug_trace directly only when you need to match the kernel’s event taxonomy.

4.3 Right granularity

Signposts cost something (a few hundred ns each, dropping to ~30 ns with dynamicTracing outside a trace). Apple-recommended granularity:

Per significant user-visible action (open document, send message, run inference).
Per frame for animation diagnostics.
Per network request.
Per phase of startup.
Not per line, not per loop iteration, not per allocation — those swamp the trace and skew the measurement.

4.4 Composing with Time Profiler

In a System Trace recording, your signposts appear as a lane parallel to the Time Profiler samples, Thread State Trace, syscalls, scheduler events. Select a signpost interval and the other lanes scope to it — so you can ask “during the ‘Inference’ interval, what were the most-sampled stacks across all threads, and which threads were blocked, and what syscalls did we make?” in one selection. This composition is the reason signposts are worth the effort — they turn “the slow function” into “the slow phase of my workload.”

Part 5 — DTrace as the power tool

5.1 The philosophy (Cantrill / Leventhal / Shapiro)

DTrace was designed by Cantrill, Shapiro, and Leventhal at Sun (2003), USENIX paper Dynamic Instrumentation of Production Systems. Three design constraints:

Safety. Probes cannot crash the system, leak memory, or modify kernel state. A bounded VM evaluates probe actions.
Negligible cost when disabled. A disabled probe is a NOP, not a branch. Production can leave the framework loaded.
No prediction required. Anywhere there’s a function or syscall, attach a probe at runtime without recompiling.

The mental shift: stop thinking “what should I have logged?” and start thinking “what would I like to ask?” — and then just ask. The answer is in the system; you don’t have to plan in advance.

5.2 DTrace on macOS, post-SIP

Apple ported DTrace to OS X 10.5 (2007) and shipped 40+ DTraceToolkit scripts in /usr/bin — iosnoop, execsnoop, opensnoop, dtruss, errinfo. [Gregg, brendangregg.com/dtrace.html]

SIP changed this in 10.11 (2015). With SIP enabled (the default), DTrace cannot:

Trace SIP-protected system binaries (most of /usr/bin, all of /System/).
Set probes that would write to protected memory.
Use the syscall provider against protected processes.

What you can still do with SIP on:

Probe your own (unsigned or developer-signed) binaries with the pid$<pid> provider.
Use a userland-only subset of providers (profile-997 for CPU sampling, USDT probes you compiled in).
Run scripts that don’t touch protected code paths.

To restore full DTrace: boot to recoveryOS and csrutil enable --without dtrace. Not recommended for production machines — disables the protection that prevents kernel-mode DTrace exploits. Long-standing Apple Silicon bug (developer.apple.com/forums/thread/735939): wake-from-sleep can leave DTrace freezing if SIP is partially disabled.

5.3 Canonical DTrace scripts to know

From the DTraceToolkit (Gregg) and the macOS-bundled subset:

Script	Purpose
`iosnoop`	Live trace of every block I/O — pid, command, file, bytes, latency
`opensnoop`	Live trace of every `open(2)` — “what file did this app just touch?”
`execsnoop`	Live trace of every `fork+exec` — finds rogue subprocess spawns
`dtruss`	DTrace-based `strace` analogue — every syscall with arguments
`procsystime <pid>`	Per-process syscall time accounting
`errinfo`	Live trace of every syscall that returned an error, with errno
`hot`	Hottest user-stacks across the system, sampled
`hotuser`	Same but for one process
`pridist`	Per-priority distribution of running threads

Even when SIP restricts the live tracing of system binaries, these scripts still work for your app’s syscalls and I/O paths.

Part 6 — Off-CPU analysis on macOS

Gregg’s distinction (Part 1.2) says on-CPU profilers miss blocking time. The macOS-specific construction:

6.1 The wallclock approach

A wallclock profiler samples every thread regardless of whether it’s on-CPU, attributing time by stack. spindump is exactly this — samples user+kernel stacks at a fixed interval (default 10 ms) for every thread of every process for the recording window, then aggregates. Output shows, for each thread, the percentage of samples each call stack represented. A thread blocked on mach_msg2_trap for 90% of the window shows 90% of samples in that stack, distinguishable from a thread running mach_msg2_trap 90% of the window by the kernel stack annotation.

6.2 Off-CPU flame graphs on macOS

There is no first-class “off-CPU flame graph” Instrument template, but you can construct one:

Record with System Trace template + Thread State Trace.
Export with xctrace export --xpath '//table[@schema="thread-state"]' to XML/CSV.
Feed into Gregg’s flamegraph.pl (github.com/brendangregg/FlameGraph) by folding “stack;at;blocked;event = duration_us” lines.

For day-to-day work, the Thread State Trace in Instruments is the equivalent visualization without the flame-graph rendering — shows each thread as a horizontal bar with “running” / “blocked-on-X” / “preempted” / “interrupted” colored segments and the stack at each transition.

6.3 Wallclock vs CPU-time

A 30-second hang where the main thread waits on a lock costs 0 ms of CPU time but 30,000 ms of wallclock. A pure CPU profile (Time Profiler with default settings) will show ~0 samples on the main thread during the hang, because no samples are taken when the thread is off-CPU. This is the single most common cause of “I profiled and nothing looked slow.” Fix: add Thread State Trace to your recording — it samples all threads at all times, on-CPU or not.

Part 7 — Principles from the CPU-aware tradition

7.1 Muratori: “performance-aware programming”

Muratori’s framing in the Computer Enhance course is that performance is not a phase, it is a property of how you wrote the code in the first place. The categorical errors he catalogues:

Polymorphism in the hot loop. Virtual dispatch is fine; virtual dispatch in a 100M-iteration loop is not. The pattern: replace with an enum + switch when the hot path is the polymorphic dispatch itself.
Per-item allocation. Allocating an object per row of a table at first-render is 1000× slower than allocating one array and indexing.
Pointer-chasing data structures (linked lists, trees of small nodes) when an array would do — defeats the prefetcher.

For macOS specifically: a Swift class hierarchy traversed in a tight loop is a Muratori-grade red flag. Marcin Krzyżanowski’s Mysterious Swift Performance (iOSConfSG 2017) makes the related point that unoptimized Swift (-Onone) can be 500×–1000× slower than -O Swift, primarily due to virtual dispatch overhead the optimizer otherwise devirtualizes. [Krzyżanowski, blog.krzyzanowskim.com/2017/11/28/swift-runtime-performance/]

7.2 Lemire: measurement discipline

Lemire’s blog (lemire.me) is a continuous case study in benchmark methodology. Distilled rules:

Always measure on the target. A benchmark on your M3 Max says nothing about the M1 Air. Apple Silicon generations have substantially different cache hierarchies and prefetcher behavior.
Compare to a baseline. “X takes 12 ms” is meaningless; “X takes 12 ms, baseline-no-op takes 11 ms, so X actually costs 1 ms” is information.
Use OS counters, not wall-clock. mach_absolute_time() (mach_timebase_info for the conversion to ns), clock_gettime(CLOCK_UPTIME_RAW_APPROX), and the Counters instrument’s INST_RETIRED / CPU_CYCLES — wall-clock varies with system load.
Vary input size. Lemire’s branch-prediction findings: at 2,000 elements modern CPUs perfectly predict patterns within 10 trials; at 10,000 elements the same pattern is unpredictable. A microbenchmark that fixes one size lies.
Beware “warmup”: the warmup is what you’re measuring. Cache and branch-predictor state are part of the result.

7.3 Fog: microarchitecture awareness

Agner Fog’s manuals (agner.org/optimize) come in five volumes; transferable principles from the x86 manuals to Apple Silicon ARM:

Cache hierarchy you must reason about: L1 (32–64 KB per core, 3–4 cycle latency), L2 (a few MB per cluster, ~12 cycle), SLC (system-level cache, dozens of MB, ~40 cycle), DRAM (~80–120 ns, hundreds of cycles). Exact numbers vary by SKU; treat as orders of magnitude. Use sysctl hw for cache topology and the Counters instrument for measured miss rates.
TLB pressure. Apple Silicon’s 16 KB pages help here vs. Intel’s 4 KB — fewer pages for the same working set means fewer TLB entries needed. But large working sets (LLM weights) still blow the TLB; this is one reason mmap’d weights with sequential access patterns outperform random access.
False sharing. Two threads writing different fields of the same cache line (64 bytes on Apple Silicon ARM, same as x86) serialize through the cache-coherence protocol. Pad shared-counter structures to cache-line boundaries.
Branch misprediction cost. Approximately 15–20 cycles on modern Apple cores (not verified per generation; Apple does not publish microarchitecture details — community has measured via timing). Avoid hard-to-predict branches in inner loops; prefer branchless idioms.

7.4 The 100x framing

Muratori’s “100x problem” framing matters because most performance work is done with a 2× mindset — “this is 30% too slow, let me find a 30% speedup.” But the architectural decisions (polymorphism vs flat array, allocation per item vs pool, cache-line alignment, SIMD vs scalar) are 10×–10,000× decisions. The order of operations is: do the architectural work first (Muratori), then the measurement (Lemire) of µarch behavior (Fog), then profile (Gregg) the remaining bottleneck. Profiling first and trying to architect later often discovers that the entire data structure was wrong.

Part 8 — Recipes (workflow chains)

8.1 “User reports the app feels sluggish”

Activity Monitor first. Is the app at >100% CPU? Is the GPU bar pegged? Is memory pressure red? These three pictures rule out 80% of cases in 5 seconds.
If CPU is high → Time Profiler (or sample <pid> 10). Look for the hot stack on the main thread.
If CPU is low but UI is frozen → Hangs template + Thread State Trace. Off-CPU analysis. Look for the blocking syscall at the bottom of the main-thread stack at the moment of the hang.
If both look fine but scrolling is janky → Animation Hitches template. Look for offscreen render passes and long commit-phase frames.
If nothing is hot but the app feels slow → SwiftUI/Layout instrument. Look for over-frequent body evaluations.
If still mysterious → sysdiagnose and read the spindump for the app + neighbors. Sometimes the sluggish app is innocent; another process is wiring all the memory and forcing yours into swap.

8.2 “Battery drains too fast when this app is running”

sudo powermetrics --samplers cpu_power,gpu_power,thermal,ane_power -i 1000 -n 60 — establish baseline. Package power draw with the app idle? Active?
Energy Log instrument. Captures the same data plus per-process attribution, network/GPS/audio activity flags.
High CPU power with low GPU power → look for a runaway thread. Time Profiler with the weight column sorted descending.
High GPU power without rendering → a Metal compute kernel is running unbatched. Metal System Trace will show command-buffer cadence.
High network/wake activity → nettop -P for the talker; pmset -g log | grep Wake for wake reasons.
Thermal pressure climbing into Moderate/Heavy → SoC is at thermal limits; kernel_task will rise to displace work. This is the OS protecting the chip; the fix is in your app’s workload shape (batching, sleep between bursts), not the OS.
Submit a sysdiagnose to yourself — the Energy bundle inside is what Apple’s bug-report tooling consumes.

Deeper recipe in mac-power-and-thermals.md §5.

8.3 “Memory keeps growing — leak hunt”

Activity Monitor’s “Memory” tab + column gear → enable “Real Mem”, “Real Private Mem”, “Compressed Mem”. Watch for steady growth.
footprint <pid> to see what kind of memory grew — App Memory, IOKit, Compressed, Wired. The footprint metric is what jetsam sees.
leaks --outputGraph baseline.memgraph <pid> at a quiescent state.
Drive the suspected leak path (e.g., open and close a document 50×).
leaks --outputGraph after.memgraph <pid> and heap -diffFrom baseline.memgraph after.memgraph to see which class instances grew.
With MallocStackLogging=1 in scheme: malloc_history <pid> --callTree -invert gives the allocation stacks for the leaking class.
For cycles between two mutually-retaining unreachable objects: Xcode > Debug > Debug Memory Graph. Visual graph view surfaces them.
Verify the fix with the same memgraph diff workflow.

8.4 “Random hangs during heavy AI inference”

Hangs template with the app running inference. Confirm: blocked-main hang (low CPU on main thread) or busy-main hang (high CPU)?
If busy-main: inference is on the main thread (common bug). Move to a dedicated DispatchQueue or actor. Verify with a follow-up trace.
If blocked-main: main thread is wait-ing on something. Most common causes for AI workloads:
- Synchronous @MainActor await on an inference task. Make the call async.
- Lock contention on a shared tokenizer or sampler. Per-thread state.
- Metal command-buffer wait on the main thread (MTLCommandBuffer.waitUntilCompleted called from main). Convert to addCompletedHandler callbacks.
Metal System Trace in parallel: confirm the GPU is actually doing work, not stalled waiting for a previous buffer. Look for gaps in the GPU timeline.
powermetrics --samplers thermal in another terminal: if thermal pressure is Heavy, the chip is being throttled and every operation slows. Inference becomes 3–5× slower under thermal throttling. Often masquerades as a hang.
Off-CPU flame graph or Thread State Trace: look for the worker thread also blocked — sometimes the hang is on the worker, and main is correctly waiting for it.

8.5 “Slow first-token latency in a chat app”

The metric is TTFT (time-to-first-token). Workflow:

Bracket the whole chain with signposts:

os_signpost(.begin, log: log, name: "TTFT", signpostID: id)
// ... user submit → tokenize → prefill → first token → render ...
os_signpost(.end, log: log, name: "TTFT", signpostID: id)

Then sub-signposts for tokenize, prefill, firstDecode, firstRender.

Record with System Trace. Open in Instruments, select the TTFT interval, the other lanes scope to it. You now have main-thread samples, worker samples, GPU command-buffer execution, memory allocation, signposts — all aligned to that one window.
Decompose by signpost lane. Typical findings on first run:
- tokenize shouldn’t be more than a few ms; if it is, the tokenizer is loading lazily — preload.
- prefill dominates because the KV cache is being allocated and warmed; this is the GPU-bound phase. Metal System Trace will show the GPU pegged.
- firstDecode is the autoregressive single-token step; should be 10s of ms. If 100s, the model is paged from disk (mmap is fine; mmap’s first touch faults pages from disk). Pre-fault by reading the file once at load.
- firstRender is the SwiftUI layout cost; if surprising, it’s because a 1000-token-context view is being rendered to fit.
Memory mapping check: vmmap <pid> after first inference. Are the weights showing as MALLOC regions (bad — they’re dirty in App Memory) or as mapped file regions (good — they’re Cached Files, file-backed)? See macos-memory-management.md §2.4 and §2.6.
Thermal check: if TTFT degrades on the 3rd or 4th request, run powermetrics --samplers thermal -i 1000 during a session and confirm thermal pressure climbs. First request runs hot; second runs throttled.

Specific learnings for Locara

Ship signposts in the runtime. The Locara runtime should emit os_signpost intervals for every model load, inference call, token generation phase, and Metal command-buffer encode. Apps inherit these for free. Users who report “slow” can produce a System Trace by pressing Shift+Ctrl+Option+Cmd+. (sysdiagnose) and the next debugging round starts with answers, not questions.
Build the in-app “Diagnostics” pane around footprint, not top. Activity Monitor and top show RSS, which lies about mmap’d weights. The runtime should expose footprint-equivalent numbers (“App Memory”, “Compressed”, “GPU-wired”) and explain what each means.
Detect thermal throttling at runtime and surface it. NSProcessInfo.thermalState for a coarse nominal/fair/serious/critical indicator (or the Darwin notification kOSThermalNotificationPressureLevelName for the 5-level signal — see mac-power-and-thermals.md §2.6). If thermal state goes past fair during inference, surface a “your Mac is hot; performance will be reduced” notice rather than letting the user infer a bug.
Default to System Trace for the Locara dev tools, not Time Profiler. The System Trace template is what we want for an unknown-symptom diagnostic — captures scheduler, signposts, syscalls, allocations, CPU samples simultaneously, and filtering happens at view time.
Refuse to ship apps that block the main thread on inference. Static analysis (a build-time lint) plus a runtime sanity check (Thread.isMainThread assertion in the inference entry point in debug builds) catches the most common AI-app pathology before it ships.
Document the SIP/DTrace situation in the dev guide. Engineers who try dtruss myapp and get “DTrace cannot control executables signed with restricted entitlements” need to know it’s a SIP+library-validation interaction, not a Locara bug. The right workaround is signed-but-not-hardened development builds, not csrutil disable.
The runtime owns the sysdiagnose workflow. When a user files a bug, the Locara CLI should be able to capture xctrace record --template "Time Profiler" --attach <pid> automatically with permission, package it with the manifest hash, and let the user attach it to the report. Reduces back-and-forth.
Lean on Gregg’s USE method as the diagnostic frame in docs. “Is your app slow? Check Utilization, Saturation, Errors for CPU / Memory / Disk / Network / GPU.” It’s the cheapest cognitive frame to teach, and it generalizes.
Use Muratori’s 100x framing when prioritizing perf work. A 20% perf bug is a profiling task. A 100× perf bug is an architecture task. Don’t confuse them in the dev-team rituals.
Production telemetry via MetricKit, not custom instrumentation. Apple’s MXMetricManager reports CPU, animation hitch, app launch, and energy metrics in a privacy-respecting daily payload. Wire it into Locara from v1 — the alternative (rolling your own) is invasive and inaccurate. See mac-power-and-thermals.md §8.

References

Brendan Gregg:

Systems Performance: Enterprise and the Cloud, 2nd ed., Addison-Wesley, 2020. Especially Ch. 2 (Methodology), Ch. 5 (Applications), Ch. 6 (CPUs), Ch. 7 (Memory).
USE Method — https://www.brendangregg.com/usemethod.html
Off-CPU Analysis — https://www.brendangregg.com/offcpuanalysis.html
Flame Graphs — https://www.brendangregg.com/flamegraphs.html
DTrace Tools — https://www.brendangregg.com/dtrace.html
DTraceToolkit — https://www.brendangregg.com/dtracetoolkit.html
FlameGraph repo — https://github.com/brendangregg/FlameGraph

Bryan Cantrill / Adam Leventhal / Mike Shapiro:

Cantrill, Shapiro, Leventhal, Dynamic Instrumentation of Production Systems, USENIX ATC 2004.
Cantrill, Hidden in Plain Sight, ACM Queue Feb 2006.
Cantrill, The Observation Deck blog — https://bcantrill.dtrace.org/
Cantrill, DTrace at 21: Reflections on Fully-grown Software, Speaker Deck 2024.

Mike Ash:

Friday Q&A archive — https://www.mikeash.com/pyblog/
Especially the entries on Mach time, ARC, retain cycles, and run loops.

Quinn “The Eskimo!” (Apple DTS):

Apple Developer Forums (search “Quinn” + topic). Notable threads on spindump interpretation, hang debugging, network instrument behaviors.

Jonathan Levin:

*MacOS and OS Internals, Volumes I (User Mode), II (Kernel Mode), III (Security). Technologeeks Press.
newosxbook.com — sample chapters and tools.

Casey Muratori:

“Clean” Code, Horrible Performance — https://www.computerenhance.com/p/clean-code-horrible-performance
Performance-Aware Programming course — https://www.computerenhance.com/
Handmade Hero archive.

Daniel Lemire:

Blog — https://lemire.me/blog/
Microbenchmarking calls for idealized conditions (2018).
Benchmarking is hard: processors learn to predict branches (2019).
Mispredicted branches can multiply your running times (2019).

Agner Fog:

Optimization manuals — https://www.agner.org/optimize/ (Optimizing software in C++; Optimizing subroutines in assembly; The microarchitecture of Intel, AMD and VIA CPUs; Instruction tables; Calling conventions).

Hennessy & Patterson:

Computer Architecture: A Quantitative Approach, 6th ed., Morgan Kaufmann, 2017.

Apple WWDC sessions (most relevant for this domain):

WWDC 2018 #608 Metal Shader Debugging and Profiling.
WWDC 2019 #411 Getting Started with Instruments.
WWDC 2019 #423 Optimizing App Launch.
WWDC 2020 #10603 Optimize Metal apps and games with GPU counters.
WWDC 2020 #10077 Eliminate animation hitches with XCTest.
WWDC 2021 #10157 Discover Metal debugging, profiling, and asset creation tools.
WWDC 2021 #10180 Detect and diagnose memory issues.
WWDC 2021 #10212 Analyze HTTP traffic in Instruments.
WWDC 2022 #10082 Track down hangs with Xcode and on-device detection.
WWDC 2022 #10106 Profile and optimize your game’s memory.
WWDC 2022 #110350 Visualize and optimize Swift concurrency.
WWDC 2023 #10160 Demystify SwiftUI performance.
WWDC 2023 #10248 Analyze hangs with Instruments.
WWDC 2024 #10173 Analyze heap memory.
WWDC 2025 #306 Optimize SwiftUI performance with Instruments.
Apple Tech Talks 10855/10856/10857 — Render-loop hitch series.

Apple documentation:

Instruments Help and the Xcode “Recording Performance Data” set.
Understanding hangs in your app — https://developer.apple.com/documentation/xcode/understanding-hangs-in-your-app
Understanding hitches in your app — https://developer.apple.com/documentation/xcode/understanding-hitches-in-your-app
Diagnosing performance issues early — https://developer.apple.com/documentation/xcode/diagnosing-performance-issues-early
xctrace(1) man page — Keith Smiley’s mirror at https://keith.github.io/xcode-man-pages/xctrace.1.html.
Capturing a Metal workload in Xcode — https://developer.apple.com/documentation/xcode/capturing-a-metal-workload-in-xcode.

Sadun / Hillegass / Dalrymple:

Hillegass & Preble, Cocoa Programming for Mac OS X, 4e (2011) — Cocoa-layer performance idioms.
Dalrymple & Hillegass, Advanced Mac OS X Programming — Mach VM, IOKit, run loops.

Marcin Krzyżanowski:

The Mysterious Swift Performance / Slow Swift — Speaker Deck, iOSConfSG 2017.
Swift Runtime Performance — https://blog.krzyzanowskim.com/2017/11/28/swift-runtime-performance/.

Other community resources:

Poweruser blog, Using dtrace on MacOS with SIP enabled — https://poweruser.blog/using-dtrace-with-sip-enabled-3826a352e64b.
Eclectic Light Co. — series on SIP, csrutil, RunningBoard, launchd.
Use Your Loaf — WWDC viewing guides; pragmatic write-ups on App Launch and Time Profiler.
Donny Wals, Measuring performance with os_signpost.

Contested / version-dependent:

DTrace usability on Apple Silicon — broken in places on Ventura+ (developer.apple.com/forums/thread/735939); use Instruments equivalents where possible.
recommendedMaxWorkingSetSize fraction (~66–75% of RAM, varies by macOS version).
Cache line size on Apple Silicon (commonly 64 bytes; not formally documented).
Branch mispredict penalty on Apple Silicon (~15–20 cycles, measured by community; Apple has not published).
Apple does not publish detailed Apple Silicon microarchitecture data; treat numeric ranges as community-measured, not vendor-confirmed.