Locara

Chip Fundamentals — How Silicon Computes (and Why LLMs Run the Way They Do)

What this is: A primer on integrated circuits — from transistors and CMOS through architecture, memory hierarchy, and the LLM-specific bottlenecks. Written at the level of detail needed to make informed Locara decisions, not for a hardware-engineer audience. Why it matters: Every “will this app run on the user’s machine?” question reduces to chip-level constraints. Memory bandwidth, capacity, quantization choices, and accelerator support all flow from physics and architecture decisions. Without a working mental model, our targeting decisions are guesses. Most relevant to Locara: Foundational. Pairs with modern-chip-landscape.md for the current product lineup and computing-history.md for how we got here.

The physical layer (transistors)

A modern chip is monocrystalline silicon, roughly 1cm² (mobile) to 8cm² (data-center GPUs), with billions of transistors etched into ~100 patterned layers via lithography, doping, etching, and chemical-mechanical polishing. The “node” name (e.g., TSMC N3, Intel 18A) is largely marketing — actual smallest feature pitches are in the ~30–50nm range, much larger than the marketing number suggests.

The transistor is a switch. CMOS (Complementary Metal-Oxide-Semiconductor — the dominant logic family) uses a complementary NMOS+PMOS pair per gate. Logic 0 = ground, logic 1 = supply voltage (~0.7–1.1V on modern nodes). Switching takes picoseconds; each switch dissipates roughly CV² in dynamic energy, plus increasing static leakage as transistors shrink.

Three numbers shape every design tradeoff:

  • Frequency — switches per second on the slowest critical path. ~3–6 GHz for modern CPUs, lower for GPUs and mobile (~1–3 GHz).
  • Power — frequency × voltage² × switching capacitance + static leakage. Doubles roughly with V², which is why high-clock parts get hot fast.
  • Area — transistors per mm². More area = higher cost, but cost-per-transistor dropped ~30%/year for decades — Moore’s Law (Gordon Moore, Intel co-founder, 1965).

Two underlying physical “laws” governed the long boom:

  1. Moore’s Law — transistor density doubles every ~2 years. Slowing as of ~2015; TSMC’s roadmap still gets ~15–20% density gains per node, but per-transistor cost is now flat-to-rising.
  2. Dennard scaling (Robert Dennard, IBM, 1974) — power-per-area stayed constant as transistors shrank, so chips got faster and more efficient simultaneously. Broke around 2005. This is why CPU clocks plateaued (~4 GHz wall) and we shifted to multicore: free clock-speed lunches ended.

Hennessy and Patterson, in their canonical Computer Architecture: A Quantitative Approach, frame the post-Dennard era as the “iron triangle” of latency, throughput, and energy — every modern design trades across all three rather than getting all for free.

Microarchitecture and ISA

Above transistors sits microarchitecture: how transistors are organized into ALUs, register files, decoders, caches, branch predictors, and how those units coordinate. Above microarchitecture sits the Instruction Set Architecture (ISA) — the contract with software:

  • x86-64 (Intel/AMD) — CISC-derived, dominant on desktop/server for 30+ years; complex decoder, larger transistor budget for legacy support.
  • ARMv8 / ARMv9 (Apple, Qualcomm, MediaTek, Ampere) — RISC-derived, dominant on mobile, gaining on laptop and server; simpler decoder, more transistor budget left for cores and cache.
  • RISC-V — open ISA, growing in embedded/specialized; not yet at parity for general-purpose performance but trajectory is steady (Tenstorrent and SiFive are betting on it).

Key microarchitectural levers:

  • Pipelining — multiple instructions in flight simultaneously, each at a different stage.
  • Out-of-order (OoO) execution — reorder instructions dynamically to keep functional units busy.
  • Superscalar — multiple instructions issued/retired per cycle (modern cores: 4–10 wide).
  • Branch prediction — speculate at branches; rewind on mispredict. Modern predictors >95% accurate; the deeper the pipeline, the more this matters.
  • SIMD — single instruction, multiple data (Intel AVX-512, ARM NEON / SVE2). One vector op handles 8/16/32 elements.
  • SMT (hyperthreading) — two hardware threads share an OoO core, hiding stalls.

Apple’s recent cores (Firestorm, Avalanche, Everest, …) and Qualcomm’s Oryon are 8–10-wide decode/issue, larger than typical x86 cores (4–6 wide). Wider issue + more reorder buffer + huge L1 caches = more performance per clock, at the cost of area and power. Apple Silicon’s IPC lead is mostly explained by aggressive width plus huge L1 caches and SLC, not by a magic ISA advantage.

Memory hierarchy — the actual bottleneck

The chip core can switch in picoseconds; main memory is ~100ns away. The hierarchy bridges this gap:

LevelLatencyCapacityBandwidth (per core)
Registers~1 cycle~256 Bterabytes/s effective
L1 cache~3–5 cycles32–192 KBhundreds of GB/s
L2 cache~10–15 cycles256 KB – 4 MB~100 GB/s
L3 / SLC~30–50 cycles8–128 MBshared
Main memory~80–200 cycles8 GB – 1 TB50–1000 GB/s
NVMe SSD~100,000 cyclesTBs5–15 GB/s

For LLM inference at small batch sizes (the typical local case), memory bandwidth and capacity dominate, not raw FLOPS. Decoder-only transformer inference at batch=1 has very low arithmetic intensity — ~2 FLOPs per byte loaded for the matmul-heavy decode path. A chip with 10 TFLOPS and 100 GB/s bandwidth runs at ~5% peak FLOPS during token generation; the GPU mostly waits on DRAM.

This is why Apple Silicon’s unified memory architecture is a structural win for LLMs: CPU, GPU, and Neural Engine share one DRAM pool with full bandwidth. No PCIe transfer (~64 GB/s) bottleneck. M3 Max delivers ~400 GB/s; M3 Ultra ~800 GB/s; M4 Max ~546 GB/s. By comparison, an RTX 4090 has 1008 GB/s but only 24 GB capacity — it can run a 13B model fast but cannot fit a 70B model at FP16 without offload, while an M2/M3 Ultra with 192 GB unified memory can.

Tim Dettmers and Tri Dao have written the most accessible material on this. Dao’s FlashAttention papers (2022, 2023, 2024) quantify why attention is memory-bound and how to fuse kernels to keep activations in SRAM rather than round-tripping to HBM/DRAM.

Specialized accelerators

CPU cores are general-purpose, latency-optimized, branchy. Other compute classes trade differently:

  • GPU — thousands of small in-order SIMT cores, wider SIMD lanes (NVIDIA “warp” = 32 lanes; AMD “wavefront” = 32–64), and high-bandwidth memory (HBM stacks at TB/s). Optimized for throughput on dense parallel work. Started for graphics, generalized via CUDA (NVIDIA, 2007) — Jensen Huang’s bet on programmable GPUs predates the deep-learning era by years. Bill Dally (NVIDIA Chief Scientist) is the canonical voice on parallel architecture.
  • TPU (Google, designed by Norm Jouppi, Jonathan Ross et al., 2015–) — systolic arrays of multiply-accumulate units that pipeline matrix multiplication spatially. Highest matmul efficiency per watt of any commercial chip; less flexible than GPUs.
  • NPU (Apple Neural Engine, Qualcomm Hexagon, Intel NPU, AMD XDNA) — small-scale matrix accelerators on consumer SoCs, designed for INT8/INT4 inference at low power. Excellent for fixed-graph CNN/transformer ops at single-digit watts; poor for novel ops or rapidly-changing graphs. Tooling fragmented across vendors.
  • ASICs (Groq LPU, Cerebras WSE, Tenstorrent Wormhole, Etched Sohu) — purpose-built for transformer inference at extreme efficiency. Cerebras puts the entire chip on a wafer (~85K cores, no off-chip DRAM); Groq optimizes for low-latency single-stream inference. Mostly inference-side bets.

The pattern: the workload defines the chip. CPUs for branchy logic, GPUs for arbitrary parallelism, NPUs for low-power tensor inference, ASICs for one specific architecture. There is no general “AI chip” winner; there is a portfolio.

The fab and its geopolitics

Designing chips ≠ making them. Fabless companies (Apple, NVIDIA, AMD, Qualcomm, MediaTek, Tesla) design and contract fabrication out. Foundries (TSMC, GlobalFoundries, Samsung Foundry) make chips for hire. IDMs (Intel, Samsung) do both.

Leading-edge fabs cost $20B+ to build. As of 2026, only TSMC, Samsung, and Intel operate at the leading edge. TSMC dominates — ~60% of overall foundry revenue, ~90% of leading-edge. ASML (Netherlands) is the only supplier of EUV lithography machines, the equipment required for sub-7nm production — a structural choke point in the entire industry.

Carver Mead and Lynn Conway’s Introduction to VLSI Systems (1980) is the textbook that made design-fab separation possible (and thus the fabless industry). Mead also coined the term “Moore’s Law.” Morris Chang founded TSMC in 1987 on the explicit thesis that pure-play foundry was a viable business model — the bet that built modern Taiwan.

Chris Miller’s Chip War (2022) is the canonical treatment of the geopolitics: geographic concentration (Taiwan, South Korea, US, Netherlands, Japan), US-China trade restrictions on EUV access, the CHIPS Act response. This is now a strategic-policy domain as much as a tech one.

ARM, RISC-V, and the open-ISA story

ARM (Sophie Wilson and Steve Furber, Acorn Computers, 1985, originally for the BBC Micro successor) became the dominant mobile ISA via licensing: ARM Holdings sells the ISA + reference cores; licensees integrate. Now under SoftBank (acquired 2016), publicly listed since 2023. Apple’s Mac transition to ARM (2020, M1) was the canonical proof point that ARM could match or beat x86 on laptop performance-per-watt — Apple uses an architectural license to design their own cores from scratch.

RISC-V (Krste Asanović, David Patterson et al., UC Berkeley, 2010–) is the open-ISA counterpart. No royalties, no license. Embedded use widespread; high-performance use lagging but trajectory steady. Tenstorrent (Jim Keller) is betting on RISC-V for AI cores. Long-term, a credible third pole.

Specific learnings for Locara

  1. Memory bandwidth is the headline LLM metric, not FLOPS. Locara’s per-device “fits + speed” estimate should be: min(memory_capacity_after_overhead / model_size, memory_bandwidth × utilization / bytes_per_token). Publish a “tok/s on Llama 3 8B Q4_K_M” table per supported device. Bandwidth-derived numbers match real-world results within ~20%.
  2. Apple Silicon’s unified memory is a structural advantage. No PCIe bottleneck, GPU sees full DRAM bandwidth, and a 96/128/192/256 GB Mac can run 70B at Q4. Locara’s Mac-first decision is correct on technical merits, not just on market posture.
  3. NPUs are not the path. Apple Neural Engine, Qualcomm Hexagon, Intel/AMD NPUs are fragmented in tooling, slow to support new ops, and great mainly for low-power steady-state inference of fixed graphs. Locara should treat them as opportunistic acceleration, not as the abstraction. GPU (Metal/CUDA/Vulkan) + CPU fallback is the durable kernel-layer split.
  4. Treat “the device” as a memory budget first. Manifest declares working-set requirement (requires.memoryGB: 12); runtime checks against sysctl hw.memsize minus a reserved overhead. Refusing to load with a clear error message beats slow-and-thrashing every time.
  5. Quantization is a bandwidth play more than a capacity play. Q4 weights are 4× smaller → 4× less data to move per token → roughly 4× speedup on memory-bound inference. Not free (quality drops, particularly on math/code). The Locara model manifest should pin a specific quant per device class, validated.
  6. Don’t pretend to abstract the chip away. Apps that target M-series get fused Metal kernels (MLX, llama.cpp Metal); on x86 + NVIDIA they get CUDA. The kernel layer is the runtime’s job; the framework’s job is to expose the device class honestly so apps can declare requirements and fail loud rather than fail slow.
  7. The fab roadmap is solid through ~2030. TSMC N3 (current), N2 (2026), A16 (2027), A14 (2028) are committed, with continued density and efficiency gains. Locara’s bet that consumer hardware keeps getting better assumes this trajectory, which is currently strong even as classical Moore’s Law slows.
  8. Hennessy & Patterson is the shared vocabulary. When Locara’s docs need to talk about latency, throughput, or bandwidth tradeoffs, use H&P language. Don’t invent new terms when the textbook ones exist.

References

  • John Hennessy & David Patterson, Computer Architecture: A Quantitative Approach (6th ed., 2017) — the canonical textbook. Both authors are Turing Award winners (2017) for RISC.
  • David Patterson & John Hennessy, Computer Organization and Design — gentler intro version, used in most undergrad systems courses.
  • Carver Mead & Lynn Conway, Introduction to VLSI Systems (1980) — origin text of the design methodology that enabled the fabless industry.
  • Chris Miller, Chip War (2022) — geopolitics, economics, and strategic history of semiconductors.
  • T.R. Reid, The Chip (1985, 2001 ed.) — the invention of the IC, Kilby vs. Noyce.
  • Jon Gertner, The Idea Factory (2012) — Bell Labs and the transistor.
  • Bill Dally, NVIDIA architecture talks (GTC keynotes, archived on YouTube).
  • Jim Keller interviews on Lex Fridman (#194, #312) — chip design philosophy from one of the field’s most respected designers (AMD K7/K8, Apple A4/A5, AMD Zen, Tesla FSD, Tenstorrent).
  • Lisa Su (AMD CEO) — public talks on CPU design and the AMD turnaround. Approachable lectures at MIT and Stanford.
  • Tim Dettmers blog (https://timdettmers.com) — practical LLM hardware reasoning.
  • Tri Dao, FlashAttention / FlashAttention-2 / FlashAttention-3 (2022, 2023, 2024) — quantifying memory-bound attention.
  • SemiAnalysis (Dylan Patel, https://semianalysis.com) — current industry analysis with deep technical credibility.
  • Fabricated Knowledge (Doug O’Laughlin) — semiconductor business and supply chain.
  • Chips and Cheese (https://chipsandcheese.com) — deep CPU microarchitecture analysis (successor in spirit to AnandTech’s RIP-era core deep-dives).