Open-Source vs Closed LLMs — Trends, Capability, and the Size-to-Quality Curve

What this is: The state of open-weights LLMs vs. closed/API-only LLMs as of early 2026 — capability gaps, training-data and licensing differences, and the specific question of how small models keep getting smarter. Why it matters: Locara’s whole pitch is “fully local.” That depends on open-weights models being capable enough to do useful work. The frontier-vs-open gap, the trajectory of small models, and the licensing landscape together determine what Locara apps can credibly promise. Most relevant to Locara: Direct. Determines model selection, what app categories are feasible, and what the realistic performance ceiling is over the next 1–3 years.

The two camps

Closed / API-only

OpenAI: GPT-4 (Mar 2023) → GPT-4 Turbo → GPT-4o (May 2024, multimodal) → GPT-4.1, GPT-4.5 (2025); reasoning models o1 (Sept 2024) → o3 (Dec 2024) → o3-pro / o4 / GPT-5 (2025).
Anthropic: Claude 3 (Mar 2024) → 3.5 Sonnet (June 2024) → 3.7 Sonnet (Feb 2025) → Claude 4 family (May 2025) → Claude 4.5 → Opus 4.7 / Haiku 4.5 (early 2026 — Opus 4.7 is the model writing this note, FYI).
Google DeepMind: Gemini 1.5 (Feb 2024, 1M–2M token context) → Gemini 2.0 (Dec 2024, agentic) → Gemini 2.5 (early 2025, reasoning) → Gemini 3.0 (late 2025).
xAI: Grok 2, 3, 4 — competitive on benchmarks, less independently verified.
Mistral — hybrid: some weights open, the largest models API-only.

Open-weights

Meta Llama: Llama 1 (Feb 2023, leaked) → Llama 2 (July 2023, the watershed permissive-ish license) → Llama 3 (April 2024, 8B/70B/405B) → 3.1 (July 2024, 405B) → 3.2 (Sept 2024, multimodal + 1B/3B small) → 3.3 (Dec 2024, 70B at 405B-class quality) → Llama 4 (mid-2025, MoE).
Alibaba Qwen: Qwen 1 → Qwen 2 (June 2024) → Qwen 2.5 (Sept 2024 — series of 0.5B / 1.5B / 3B / 7B / 14B / 32B / 72B + Coder + Math variants) → Qwen 3 (mid-late 2025).
DeepSeek: V2 (May 2024, 236B MoE) → V2.5 → V3 (Dec 2024, 671B MoE, ~$5.6M training-cost claim) → R1 (Jan 2025, reasoning, MIT license) — the watershed for OSS reasoning.
Mistral: Mistral 7B, Mixtral 8x7B (Dec 2023), Mixtral 8x22B (April 2024), Mistral Small / NeMo, Codestral.
Google Gemma: Gemma 1 (Feb 2024), Gemma 2 (June 2024, 9B/27B), Gemma 3 (early 2025).
Microsoft Phi: Phi-2, Phi-3 (April 2024), Phi-3.5, Phi-4 (Dec 2024) — small-model exemplars built on synthetic + curated data.
Independent / smaller: Yi (01.AI), Cohere Command R, InternLM, DBRX (Databricks), OLMo (AI2 — fully open including data), SmolLM (Hugging Face), Falcon (TII).

The capability gap (state of play, early 2026)

Rough heuristic: open weights trail closed frontier by 6–12 months on equivalent capability, narrowing.

Key inflection points:

GPT-4 → Llama 3 70B (Mar 2023 → April 2024) — ~13 months. Llama 3 70B at original GPT-4 parity on most benchmarks.
Llama 3.3 70B (Dec 2024) at ~Llama 3.1 405B parity — distillation + better training data. The 70B regaining parity with a 405B model was the most striking single mid-2020s small-model result.
DeepSeek R1 (Jan 2025) — open-weights reasoning model competitive with OpenAI o1 on math/code benchmarks. First serious dent in the closed-frontier monopoly on reasoning. Reproduced training pipeline (~$5–6M cost claim, debated). Massive industry shock; Nvidia’s stock dropped ~17% on the news as the market re-evaluated AI capex assumptions.
Qwen 2.5 72B (Sept 2024) competitive with Claude 3.5 Sonnet on many tasks, deployable locally for free.

As of early 2026:

Best frontier closed: Claude Opus 4.7, GPT-5/o4, Gemini 3.0 — ahead on agentic/tool-use tasks, large-context coherence, and integrated multimodal.
Best open: Llama 4 family, Qwen 3, DeepSeek-R2 — within a few months on most benchmarks; at or ahead on math/code in some configurations.

The gap persists and matters for highest-end use cases (frontier coding, deep multi-step reasoning, agentic workflows with many tool calls, very-long-context). For the use cases Locara cares most about (Q&A, transcription post-processing, local document analysis, personal-data tasks), the open models are already enough.

The size-to-quality curve

Two simultaneous trends:

Big-model capability frontier creeps forward — driven by more compute, better data, and reasoning RL.
Small-model capability has improved much faster than big-model capability. A modern 3B or 7B model is dramatically better than a year-ago 7B or 13B. Drivers:
- Distillation from large teachers (Phi family is the textbook example).
- Better data — synthetic-data pipelines, curated/filtered web data (FineWeb, DCLM), removing low-value tokens.
- More training tokens per parameter — Chinchilla-optimal (Hoffmann et al., 2022) was ~20 tokens/param; modern small models train on 1000+ tokens/param (“over-training”) to squeeze more quality at fixed inference cost. Llama 3 8B was trained on 15T tokens.
- Quantization-aware training and better post-training quantization (PTQ) — Q4 of a strong 7B is now a real product, not a demo.

Concrete: Llama 3.2 1B and 3B are competitive with last-year’s 7B on many benchmarks; Qwen 2.5 3B punches above its weight; Phi-4 14B competes with general models 5–10× its size on reasoning. For consumer hardware, the ~3–7B Q4 model is the sweet spot for most local applications, and that sweet spot is rising in capability roughly 1.5–2× per year (eyeballed, not a clean benchmark).

This is the strongest tailwind for local AI. The relevant question is no longer “is open good enough?” — it’s “for which tasks is open already good enough?”

What’s driving the open camp

Meta’s strategic open-weights bet. Mark Zuckerberg has been explicit: open-weights commoditize the model layer, weakening competitors who depend on model differentiation, while letting Meta benefit from external improvements + ecosystem integration. Yann LeCun (Meta Chief AI Scientist; 2018 Turing Award with Hinton and Bengio) is the public-intellectual voice for open AI and the loudest skeptic of pure-LLM scaling narratives.
Chinese labs. Qwen (Alibaba), DeepSeek, Yi (01.AI), GLM (Zhipu), InternLM (Shanghai AI Lab) — collectively the most prolific open-weights publishers as of late 2024–2025. US export controls created strong incentives for Chinese AI self-sufficiency, and open-weights releases were one byproduct.
Allen Institute (AI2 OLMo) — fully-open including training data and code. Smaller-impact in raw capability, but important for research reproducibility.
Mistral — French, open-weights for smaller models, hybrid commercial strategy.
Microsoft Phi — small-model exemplars proving that data quality beats scale for the lower bands.
Hugging Face as the distribution layer — weights, demos (Spaces), evaluation infra (Open LLM Leaderboard, lm-evaluation-harness).

What’s driving the closed camp

Frontier capability — the leading reasoning models and best multimodal models have been closed first. Training cost (~$100M+ for a frontier run) and inference cost (RL-heavy reasoning is expensive) are real moats.
Safety / alignment — Anthropic’s and OpenAI’s positioning leans on safety work being expensive and best done with revenue.
Productized integrations — ChatGPT, Claude.ai, Gemini in Google Workspace, Microsoft Copilot — the consumer surfaces are closed-API-driven.
Training data — closed labs claim training-data advantages (web crawl, partnerships, RLHF data); how true this is at the margin is debated.

Key voices

Andrej Karpathy — formerly OpenAI / Tesla. nanoGPT, llm.c, “Let’s Build GPT From Scratch” lecture series. The most accessible public technical voice on how LLMs work, and a credible voice for “the open-source gap is narrowing.”
Yann LeCun — Meta Chief AI Scientist. Loud and consistent advocate for open weights.
Geoffrey Hinton — godfather of deep learning. Nobel Prize in Physics (2024) for foundational neural-network work. Has shifted toward AI-safety advocacy in recent years.
Ilya Sutskever — OpenAI cofounder, then Safe Superintelligence (SSI) since mid-2024.
Demis Hassabis — DeepMind/Google. Nobel Prize in Chemistry (2024) for AlphaFold.
Dario Amodei (Anthropic CEO) — Machines of Loving Grace (essay, Oct 2024) is the most-cited statement of the bullish-but-cautious closed-AI worldview.
Sam Altman (OpenAI CEO) — public communicator-in-chief for the closed frontier.
Mark Zuckerberg — “Open Source AI Is the Path Forward” (Meta blog, July 2024) is the Meta-strategy primer.
Nathan Lambert (Allen AI / Interconnects newsletter) — best running analysis of open-vs-closed releases, RLHF/RLAIF.
Sebastian Raschka — practical LLM educator. Build a Large Language Model (From Scratch) book (2024).
Tri Dao — FlashAttention, Mamba; runs much of the systems-side work on faster training/inference.
Tim Dettmers — quantization pioneer (bitsandbytes, QLoRA), runs the canonical local-LLM hardware blog.
Hugo Touvron, Guillaume Lample, et al. — Llama paper authors; many have moved across labs (Meta → Mistral and others).
Soumith Chintala — PyTorch / Meta.
Jeremy Howard — fast.ai, answerini style of pragmatic AI engineering. ULMFiT co-developer.

Trends to watch

Reasoning-model open replication. Post-DeepSeek R1, multiple open labs are racing to match/beat the closed reasoning frontier. Expect 3–6 month lag on reasoning capability rather than 12+ month.
Mixture-of-Experts (MoE) becomes default. DeepSeek V3 (671B total / 37B active), Llama 4 (rumored MoE), Qwen MAX, Mixtral. MoE delivers “size-class capability” with smaller active-parameter inference cost — better fit for local hardware than equivalent dense, if total weights fit in RAM.
Small-model capability keeps rising. Phi-5, Llama 3.x small, Gemma 3 small — distillation pipelines from frontier teachers keep producing better 1–7B models. This is Locara’s most important tailwind.
Reasoning-trace efficiency. Compressing the “thinking” of reasoning models (long chains of thought) so they’re feasible on local hardware. Currently r1-distill style.
Agentic open models still trail closed (computer use, long tool chains). Closing slowly.
Local fine-tuning becomes mainstream. QLoRA, Unsloth, Apple’s MLX-LM — fine-tuning a 7–13B on a single Mac is now real. Big-tail of personalization possible.
Licensing remains messy. Llama community license (not OSI-approved), Qwen’s similar, DeepSeek MIT. Locara’s app store has to enforce/tag license correctness on a per-model basis.

Specific learnings for Locara

Bet on open weights across all critical paths. Closed APIs are great for the capability frontier, but Locara’s pitch is local-first; the framework, runtime, and reference apps must all run on open weights.
Pin a small set of validated models per app category. Locara’s model manifest is curated. As of early 2026 the strongest defaults:
- General chat / Q&A: Qwen 2.5 7B Instruct Q4_K_M, Llama 3.2 3B Instruct, or Llama 3.3 70B Q4 for high-end Macs.
- Coding: Qwen 2.5 Coder 7B / 14B / 32B.
- Reasoning (where local can run it): R1-distill 7B / 14B; growing list.
- Vision: Qwen 2.5-VL 7B, Llama 3.2 11B Vision.
- Embeddings: bge-m3, nomic-embed-text-v1.5.
- STT: Whisper large-v3 or distil-whisper.
Track Hugging Face leaderboards but don’t trust them. Open LLM Leaderboard, LiveBench, Chatbot Arena — useful triangulation, but real product evaluation should be task-specific (e.g., “summarize this user’s documents accurately”). Build internal evals tied to Locara reference apps.
Plan for the small-model curve to keep rising. The 1–3B class will be approximately where 7B is today within ~12–18 months. Locara apps that work on a 7B today should plan for “good-enough-on-3B” within a year or two — that’s what opens up phone targets.
MoE complicates the device matrix. A 671B-total / 37B-active MoE model can’t be run on a 64 GB Mac at full-weight load even though active-param compute is reasonable. Capacity constraint is total params; speed constraint is active. Manifest must declare both.
License hygiene matters. The Llama community license has a “700M MAU” carve-out. Qwen’s commercial use is mostly fine but with caveats. Locara’s manifest must declare and propagate license; the app store must surface it to developers and end users.
The relevant comparison is “open local now” vs. “closed cloud now.” For privacy-sensitive use cases the question isn’t “is the open model as good as Claude Opus?” — it’s “is the open model good enough that not sending my data to a third party is worth it?” Increasingly: yes, for many tasks.
Don’t wrap closed APIs into Locara core. The OpenAI-API-compatible escape hatch is fine for user-supplied API keys (their data, their choice). The framework, runtime, and reference apps should not depend on closed APIs for any required path.

References

Yann LeCun — public talks and X/Twitter posts; A Path Towards Autonomous Machine Intelligence (position paper, 2022).
Andrej Karpathy — “Let’s Build GPT From Scratch” lecture series (YouTube, 2023); nanoGPT and llm.c repos; X posts on local-AI viability.
Mark Zuckerberg, “Open Source AI Is the Path Forward” (Meta blog, July 2024).
Dario Amodei, Machines of Loving Grace (essay, Oct 2024).
Nathan Lambert, Interconnects newsletter (https://interconnects.ai). Best running analysis of model releases.
Sebastian Raschka, Build a Large Language Model (From Scratch) (book, 2024); blog at sebastianraschka.com.
DeepSeek-V3 and DeepSeek-R1 technical reports (Dec 2024 / Jan 2025) — required reading for understanding the OSS reasoning watershed.
Llama 3 paper (Meta, 2024) — The Llama 3 Herd of Models. The most detailed published recipe for a frontier-class open model.
Qwen 2.5 technical report (Alibaba, 2024).
Phi-3 / Phi-4 technical reports (Microsoft).
Hoffmann et al., Training Compute-Optimal Large Language Models (Chinchilla, 2022) — the scaling-law paper that shaped the field.
Kaplan et al., Scaling Laws for Neural Language Models (OpenAI, 2020).
Hugging Face Open LLM Leaderboard, LiveBench (https://livebench.ai), Chatbot Arena (https://chat.lmsys.org).
r/LocalLLaMA — the community signal layer.