Locara

Open-Source vs Closed LLMs — Trends, Capability, and the Size-to-Quality Curve

What this is: The state of open-weights LLMs vs. closed/API-only LLMs as of early 2026 — capability gaps, training-data and licensing differences, and the specific question of how small models keep getting smarter. Why it matters: Locara’s whole pitch is “fully local.” That depends on open-weights models being capable enough to do useful work. The frontier-vs-open gap, the trajectory of small models, and the licensing landscape together determine what Locara apps can credibly promise. Most relevant to Locara: Direct. Determines model selection, what app categories are feasible, and what the realistic performance ceiling is over the next 1–3 years.

The two camps

Closed / API-only

  • OpenAI: GPT-4 (Mar 2023) → GPT-4 Turbo → GPT-4o (May 2024, multimodal) → GPT-4.1, GPT-4.5 (2025); reasoning models o1 (Sept 2024) → o3 (Dec 2024) → o3-pro / o4 / GPT-5 (2025).
  • Anthropic: Claude 3 (Mar 2024) → 3.5 Sonnet (June 2024) → 3.7 Sonnet (Feb 2025) → Claude 4 family (May 2025) → Claude 4.5 → Opus 4.7 / Haiku 4.5 (early 2026 — Opus 4.7 is the model writing this note, FYI).
  • Google DeepMind: Gemini 1.5 (Feb 2024, 1M–2M token context) → Gemini 2.0 (Dec 2024, agentic) → Gemini 2.5 (early 2025, reasoning) → Gemini 3.0 (late 2025).
  • xAI: Grok 2, 3, 4 — competitive on benchmarks, less independently verified.
  • Mistral — hybrid: some weights open, the largest models API-only.

Open-weights

  • Meta Llama: Llama 1 (Feb 2023, leaked) → Llama 2 (July 2023, the watershed permissive-ish license) → Llama 3 (April 2024, 8B/70B/405B) → 3.1 (July 2024, 405B) → 3.2 (Sept 2024, multimodal + 1B/3B small) → 3.3 (Dec 2024, 70B at 405B-class quality) → Llama 4 (mid-2025, MoE).
  • Alibaba Qwen: Qwen 1 → Qwen 2 (June 2024) → Qwen 2.5 (Sept 2024 — series of 0.5B / 1.5B / 3B / 7B / 14B / 32B / 72B + Coder + Math variants) → Qwen 3 (mid-late 2025).
  • DeepSeek: V2 (May 2024, 236B MoE) → V2.5 → V3 (Dec 2024, 671B MoE, ~$5.6M training-cost claim) → R1 (Jan 2025, reasoning, MIT license) — the watershed for OSS reasoning.
  • Mistral: Mistral 7B, Mixtral 8x7B (Dec 2023), Mixtral 8x22B (April 2024), Mistral Small / NeMo, Codestral.
  • Google Gemma: Gemma 1 (Feb 2024), Gemma 2 (June 2024, 9B/27B), Gemma 3 (early 2025).
  • Microsoft Phi: Phi-2, Phi-3 (April 2024), Phi-3.5, Phi-4 (Dec 2024) — small-model exemplars built on synthetic + curated data.
  • Independent / smaller: Yi (01.AI), Cohere Command R, InternLM, DBRX (Databricks), OLMo (AI2 — fully open including data), SmolLM (Hugging Face), Falcon (TII).

The capability gap (state of play, early 2026)

Rough heuristic: open weights trail closed frontier by 6–12 months on equivalent capability, narrowing.

Key inflection points:

  • GPT-4 → Llama 3 70B (Mar 2023 → April 2024) — ~13 months. Llama 3 70B at original GPT-4 parity on most benchmarks.
  • Llama 3.3 70B (Dec 2024) at ~Llama 3.1 405B parity — distillation + better training data. The 70B regaining parity with a 405B model was the most striking single mid-2020s small-model result.
  • DeepSeek R1 (Jan 2025) — open-weights reasoning model competitive with OpenAI o1 on math/code benchmarks. First serious dent in the closed-frontier monopoly on reasoning. Reproduced training pipeline (~$5–6M cost claim, debated). Massive industry shock; Nvidia’s stock dropped ~17% on the news as the market re-evaluated AI capex assumptions.
  • Qwen 2.5 72B (Sept 2024) competitive with Claude 3.5 Sonnet on many tasks, deployable locally for free.

As of early 2026:

  • Best frontier closed: Claude Opus 4.7, GPT-5/o4, Gemini 3.0 — ahead on agentic/tool-use tasks, large-context coherence, and integrated multimodal.
  • Best open: Llama 4 family, Qwen 3, DeepSeek-R2 — within a few months on most benchmarks; at or ahead on math/code in some configurations.

The gap persists and matters for highest-end use cases (frontier coding, deep multi-step reasoning, agentic workflows with many tool calls, very-long-context). For the use cases Locara cares most about (Q&A, transcription post-processing, local document analysis, personal-data tasks), the open models are already enough.

The size-to-quality curve

Two simultaneous trends:

  1. Big-model capability frontier creeps forward — driven by more compute, better data, and reasoning RL.
  2. Small-model capability has improved much faster than big-model capability. A modern 3B or 7B model is dramatically better than a year-ago 7B or 13B. Drivers:
    • Distillation from large teachers (Phi family is the textbook example).
    • Better data — synthetic-data pipelines, curated/filtered web data (FineWeb, DCLM), removing low-value tokens.
    • More training tokens per parameter — Chinchilla-optimal (Hoffmann et al., 2022) was ~20 tokens/param; modern small models train on 1000+ tokens/param (“over-training”) to squeeze more quality at fixed inference cost. Llama 3 8B was trained on 15T tokens.
    • Quantization-aware training and better post-training quantization (PTQ) — Q4 of a strong 7B is now a real product, not a demo.

Concrete: Llama 3.2 1B and 3B are competitive with last-year’s 7B on many benchmarks; Qwen 2.5 3B punches above its weight; Phi-4 14B competes with general models 5–10× its size on reasoning. For consumer hardware, the ~3–7B Q4 model is the sweet spot for most local applications, and that sweet spot is rising in capability roughly 1.5–2× per year (eyeballed, not a clean benchmark).

This is the strongest tailwind for local AI. The relevant question is no longer “is open good enough?” — it’s “for which tasks is open already good enough?”

What’s driving the open camp

  • Meta’s strategic open-weights bet. Mark Zuckerberg has been explicit: open-weights commoditize the model layer, weakening competitors who depend on model differentiation, while letting Meta benefit from external improvements + ecosystem integration. Yann LeCun (Meta Chief AI Scientist; 2018 Turing Award with Hinton and Bengio) is the public-intellectual voice for open AI and the loudest skeptic of pure-LLM scaling narratives.
  • Chinese labs. Qwen (Alibaba), DeepSeek, Yi (01.AI), GLM (Zhipu), InternLM (Shanghai AI Lab) — collectively the most prolific open-weights publishers as of late 2024–2025. US export controls created strong incentives for Chinese AI self-sufficiency, and open-weights releases were one byproduct.
  • Allen Institute (AI2 OLMo) — fully-open including training data and code. Smaller-impact in raw capability, but important for research reproducibility.
  • Mistral — French, open-weights for smaller models, hybrid commercial strategy.
  • Microsoft Phi — small-model exemplars proving that data quality beats scale for the lower bands.
  • Hugging Face as the distribution layer — weights, demos (Spaces), evaluation infra (Open LLM Leaderboard, lm-evaluation-harness).

What’s driving the closed camp

  • Frontier capability — the leading reasoning models and best multimodal models have been closed first. Training cost (~$100M+ for a frontier run) and inference cost (RL-heavy reasoning is expensive) are real moats.
  • Safety / alignment — Anthropic’s and OpenAI’s positioning leans on safety work being expensive and best done with revenue.
  • Productized integrations — ChatGPT, Claude.ai, Gemini in Google Workspace, Microsoft Copilot — the consumer surfaces are closed-API-driven.
  • Training data — closed labs claim training-data advantages (web crawl, partnerships, RLHF data); how true this is at the margin is debated.

Key voices

  • Andrej Karpathy — formerly OpenAI / Tesla. nanoGPT, llm.c, “Let’s Build GPT From Scratch” lecture series. The most accessible public technical voice on how LLMs work, and a credible voice for “the open-source gap is narrowing.”
  • Yann LeCun — Meta Chief AI Scientist. Loud and consistent advocate for open weights.
  • Geoffrey Hinton — godfather of deep learning. Nobel Prize in Physics (2024) for foundational neural-network work. Has shifted toward AI-safety advocacy in recent years.
  • Ilya Sutskever — OpenAI cofounder, then Safe Superintelligence (SSI) since mid-2024.
  • Demis Hassabis — DeepMind/Google. Nobel Prize in Chemistry (2024) for AlphaFold.
  • Dario Amodei (Anthropic CEO)Machines of Loving Grace (essay, Oct 2024) is the most-cited statement of the bullish-but-cautious closed-AI worldview.
  • Sam Altman (OpenAI CEO) — public communicator-in-chief for the closed frontier.
  • Mark Zuckerberg — “Open Source AI Is the Path Forward” (Meta blog, July 2024) is the Meta-strategy primer.
  • Nathan Lambert (Allen AI / Interconnects newsletter) — best running analysis of open-vs-closed releases, RLHF/RLAIF.
  • Sebastian Raschka — practical LLM educator. Build a Large Language Model (From Scratch) book (2024).
  • Tri Dao — FlashAttention, Mamba; runs much of the systems-side work on faster training/inference.
  • Tim Dettmers — quantization pioneer (bitsandbytes, QLoRA), runs the canonical local-LLM hardware blog.
  • Hugo Touvron, Guillaume Lample, et al. — Llama paper authors; many have moved across labs (Meta → Mistral and others).
  • Soumith Chintala — PyTorch / Meta.
  • Jeremy Howard — fast.ai, answerini style of pragmatic AI engineering. ULMFiT co-developer.
  1. Reasoning-model open replication. Post-DeepSeek R1, multiple open labs are racing to match/beat the closed reasoning frontier. Expect 3–6 month lag on reasoning capability rather than 12+ month.
  2. Mixture-of-Experts (MoE) becomes default. DeepSeek V3 (671B total / 37B active), Llama 4 (rumored MoE), Qwen MAX, Mixtral. MoE delivers “size-class capability” with smaller active-parameter inference cost — better fit for local hardware than equivalent dense, if total weights fit in RAM.
  3. Small-model capability keeps rising. Phi-5, Llama 3.x small, Gemma 3 small — distillation pipelines from frontier teachers keep producing better 1–7B models. This is Locara’s most important tailwind.
  4. Reasoning-trace efficiency. Compressing the “thinking” of reasoning models (long chains of thought) so they’re feasible on local hardware. Currently r1-distill style.
  5. Agentic open models still trail closed (computer use, long tool chains). Closing slowly.
  6. Local fine-tuning becomes mainstream. QLoRA, Unsloth, Apple’s MLX-LM — fine-tuning a 7–13B on a single Mac is now real. Big-tail of personalization possible.
  7. Licensing remains messy. Llama community license (not OSI-approved), Qwen’s similar, DeepSeek MIT. Locara’s app store has to enforce/tag license correctness on a per-model basis.

Specific learnings for Locara

  1. Bet on open weights across all critical paths. Closed APIs are great for the capability frontier, but Locara’s pitch is local-first; the framework, runtime, and reference apps must all run on open weights.
  2. Pin a small set of validated models per app category. Locara’s model manifest is curated. As of early 2026 the strongest defaults:
    • General chat / Q&A: Qwen 2.5 7B Instruct Q4_K_M, Llama 3.2 3B Instruct, or Llama 3.3 70B Q4 for high-end Macs.
    • Coding: Qwen 2.5 Coder 7B / 14B / 32B.
    • Reasoning (where local can run it): R1-distill 7B / 14B; growing list.
    • Vision: Qwen 2.5-VL 7B, Llama 3.2 11B Vision.
    • Embeddings: bge-m3, nomic-embed-text-v1.5.
    • STT: Whisper large-v3 or distil-whisper.
  3. Track Hugging Face leaderboards but don’t trust them. Open LLM Leaderboard, LiveBench, Chatbot Arena — useful triangulation, but real product evaluation should be task-specific (e.g., “summarize this user’s documents accurately”). Build internal evals tied to Locara reference apps.
  4. Plan for the small-model curve to keep rising. The 1–3B class will be approximately where 7B is today within ~12–18 months. Locara apps that work on a 7B today should plan for “good-enough-on-3B” within a year or two — that’s what opens up phone targets.
  5. MoE complicates the device matrix. A 671B-total / 37B-active MoE model can’t be run on a 64 GB Mac at full-weight load even though active-param compute is reasonable. Capacity constraint is total params; speed constraint is active. Manifest must declare both.
  6. License hygiene matters. The Llama community license has a “700M MAU” carve-out. Qwen’s commercial use is mostly fine but with caveats. Locara’s manifest must declare and propagate license; the app store must surface it to developers and end users.
  7. The relevant comparison is “open local now” vs. “closed cloud now.” For privacy-sensitive use cases the question isn’t “is the open model as good as Claude Opus?” — it’s “is the open model good enough that not sending my data to a third party is worth it?” Increasingly: yes, for many tasks.
  8. Don’t wrap closed APIs into Locara core. The OpenAI-API-compatible escape hatch is fine for user-supplied API keys (their data, their choice). The framework, runtime, and reference apps should not depend on closed APIs for any required path.

References

  • Yann LeCun — public talks and X/Twitter posts; A Path Towards Autonomous Machine Intelligence (position paper, 2022).
  • Andrej Karpathy — “Let’s Build GPT From Scratch” lecture series (YouTube, 2023); nanoGPT and llm.c repos; X posts on local-AI viability.
  • Mark Zuckerberg, “Open Source AI Is the Path Forward” (Meta blog, July 2024).
  • Dario Amodei, Machines of Loving Grace (essay, Oct 2024).
  • Nathan Lambert, Interconnects newsletter (https://interconnects.ai). Best running analysis of model releases.
  • Sebastian Raschka, Build a Large Language Model (From Scratch) (book, 2024); blog at sebastianraschka.com.
  • DeepSeek-V3 and DeepSeek-R1 technical reports (Dec 2024 / Jan 2025) — required reading for understanding the OSS reasoning watershed.
  • Llama 3 paper (Meta, 2024) — The Llama 3 Herd of Models. The most detailed published recipe for a frontier-class open model.
  • Qwen 2.5 technical report (Alibaba, 2024).
  • Phi-3 / Phi-4 technical reports (Microsoft).
  • Hoffmann et al., Training Compute-Optimal Large Language Models (Chinchilla, 2022) — the scaling-law paper that shaped the field.
  • Kaplan et al., Scaling Laws for Neural Language Models (OpenAI, 2020).
  • Hugging Face Open LLM Leaderboard, LiveBench (https://livebench.ai), Chatbot Arena (https://chat.lmsys.org).
  • r/LocalLLaMA — the community signal layer.