HF Breakout Models, Jun 15–22: GLM-5.2 and VibeThinker-3B
June 22, 2026 · 9:32 AM

HF Breakout Models, Jun 15–22: GLM-5.2 and VibeThinker-3B

Two MIT-licensed LLM breakouts this week: GLM-5.2 (753B, ~41×, frontier agentic coding) and VibeThinker-3B (3B, ~32×, 96.1% LeetCode).

Two models crossed the >10x download-growth bar this week (Jun 15–22), and both ship under MIT with no commercial restrictions. One weighs 753B parameters and just set a new benchmark ceiling for open-weight LLMs. The other weighs 3B and matches models 200 times its size on competitive math. No image, audio, or multimodal breakouts this week.

Quick scan

ModelOrgParams (total / active)LicenseGrowthLatest downloadsModality
GLM-5.2Z.ai (formerly Zhipu AI)753B / ~40B activeMIT~41× in 4 days27.4k (Jun 21) 1LLM (text only)
VibeThinker-3BWeiboAI3BMIT~32× in 3 days32.4k (Jun 22) 2LLM (reasoning)

LLMs

GLM-5.2 — 753B MoE, MIT, frontier coding and long-horizon agents

Z.ai (formerly Zhipu AI) released GLM-5.2 on June 16 under an MIT license: 753B total parameters, ~40B active per token, 1M token context, text-only input. 3 Downloads went from 666 (Jun 17) to 27.4k (Jun 21) — roughly 41× in four days. 1 GitHub starred +286 in a single day and the repo entered Trending. 4
What it does. The model is designed for agentic coding and long-horizon task execution: multi-hour autonomous runs, multi-file refactors, terminal operations, and MCP tool use. It is not a vision model — Z.ai's vision product is a separate closed-source line. Simon Willison described it as "probably the most powerful text-only open-weight LLM." 5 Jeremy Howard (fast.ai, via Latent Space) called it "at least as good as Opus 4.8 and GPT 5.5," with vision absence as the main gap. 4
Benchmarks. On standard coding benchmarks GLM-5.2 leads all public open-weight models: SWE-bench Pro 62.1 (vs GPT-5.5 58.6), Terminal-Bench 2.1 81.0 (Cline called it the first open-weight model to cross 80% on that harness), MCP-Atlas 76.8 (near Claude Opus 4.8's 77.8). 3
GLM-5.2 performance across 8 coding and reasoning benchmarks
GLM-5.2 versus GLM-5.1, Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro across SWE-bench Pro, Terminal-Bench 2.1, NL2Repo, DeepSWE, ProgramBench, MCP-Atlas, Tool-Decathlon, and HLE. 3
Where GLM-5.2 pulls decisively ahead is long-horizon work — tasks that require maintaining coherent strategy over hours, not seconds: 3
  • FrontierSWE Dominance: 74.4% (vs GPT-5.5 72.6%; trails Claude Opus 4.8 at 75.1%)
  • PostTrainBench: 34.3% (vs GPT-5.5 28.4%)
  • SWE-Marathon: 13.0% (vs GPT-5.5 12.0%)
It also topped Artificial Analysis' Intelligence Index v4.1 among open-weight models at score 51, ahead of MiniMax-M3 (44) and DeepSeek V4 Pro (44). 5 On the Design Arena, it ranks #1 with Elo 1360, ahead of Claude Fable 5. On Code Arena (WebDev / Frontend) it sits at #2. 3
The headline caution: GLM-5.2 trails Claude Opus 4.8 on SWE-bench Pro (62.1 vs 69.2), and some r/LocalLLaMA users find Kimi K2.7 better for their specific coding workflows. 6 Also worth knowing: Max inference mode roughly doubles token consumption versus High mode, so latency-sensitive pipelines should benchmark both. 3
Architecture: IndexShare. The key efficiency innovation is IndexShare: every four transformer layers share a single lightweight indexer, so only the first layer in each group computes the full sparse-attention top-k; the other three reuse those indices. 7 At 1M token context, this cuts per-token FLOPs by 2.9×, making long-context inference actually affordable rather than nominally supported. 3
License. MIT — confirmed across Z.ai blog, HF model card, VentureBeat, and Latent Space. Freely download, fine-tune, run commercially, redistribute, no regional restrictions. One social post claimed Apache 2.0; that is incorrect. 8 The timing matters: Claude Fable 5 went offline in some regions due to US export controls during the same week GLM-5.2 launched, which likely accelerated adoption. 4
Deployment options.
PathDetails
Z.ai API$1.40 / $4.40 per 1M input / output tokens; cached input $0.26/1M 3
OpenRouter19 providers, effective ~$0.98 / $3.08 per 1M tokens (with caching) 9
Z.ai Coding PlanLite $12.60/mo, Pro $50.40/mo, Max $112/mo (annual); works with Claude Code, Cline, Kilo Code 8
GGUF / localUnsloth quants: UD-IQ1_M, UD-IQ2_M, UD-Q8_K_XL; llama.cpp compatible 10
Inference frameworksvLLM v0.23.0+, SGLang v0.5.13.post1+, KTransformers v0.5.12+, MLX (confirmed on Mac Studio M3 Ultra 192GB) 10
Hardware reality: FP8 full weights need ~744–890GB; 4-bit Q4_K_M needs ~476–500GB; 2-bit Q2_K_XL needs ~241–280GB; 1-bit dynamic ~176–180GB. 6 One r/LocalLLaMA user running 5090 + 3090 Ti with UD-IQ1_M achieved 579 t/s prefill at 8K context and ~10.6 t/s decode. 11 No direct ollama pull entry; use GGUF manual load.
Cost comparison against closed models (per 1M output tokens): GLM-5.2 $4.40, Claude Opus 4.8 $25.00, GPT-5.5 $30.00, Claude Fable 5 $50.00. 8 On Artificial Analysis' AA-Briefcase benchmark (real multi-step knowledge work tasks), GLM-5.2 cost $2.40/task versus Fable 5's $31.00/task. 5
Builder angle. If you're building an agentic coding product, autonomous code review pipeline, or multi-step terminal workflow, GLM-5.2 is the most capable MIT-licensed model available for those tasks today. The API pricing ($0.98–$1.40/M input, $3.08–$4.40/M output) is roughly 6–11× cheaper than Claude Fable 5 or Opus 4.8 at comparable performance tiers. The Anthropic-compatible Z.ai API means Claude Code and other Anthropic-format tools route to it with one environment variable change. Vision absence is a real constraint for frontend screenshot analysis or UI-aware coding agents — plan around it.
Loading content card…

VibeThinker-3B — 3B, MIT, competitive programming reasoning

WeiboAI (Weibo's AI lab) released VibeThinker-3B on June 19 under MIT: a 3B model built on Qwen2.5-3B (Qwen2.5-3B-Instruct, specifically) trained for verifiable reasoning tasks — competitive programming and olympiad math. 2 Downloads hit 32.4k by June 22, making it the week's highest-downloaded new model. 2
What it does. VibeThinker-3B is not a general-purpose assistant — it is specifically optimized for competitive programming and mathematical reasoning. The model card explicitly states it is not recommended for agent-based programming or tool-calling tasks. What it does do in that narrow domain is striking: 96.1% acceptance rate across 8 LeetCode contest rounds (123/128 problems, first attempt), competing directly with models hundreds of times larger. 2
VibeThinker-3B vs larger models on IMO-AnswerBench, plotted against parameter scale
VibeThinker-3B (3B, 1×) scores 76.4% on IMO-AnswerBench, outperforming OpenReasoning-Nemotron (7B, 2.3×) at 60.6% and MiniMax M2.7 (229B, 76.3×) at 66.3%. DeepSeek V3.2 (671B, 223.7×) scores 78.3%. 2
Selected benchmark results: 2
  • IMO-AnswerBench (International Mathematical Olympiad answer benchmark): 76.4% (80.6% with CLR verification) — competing with DeepSeek V3.2 (78.3%, 671B) and GLM-5 (82.5%, 744B)
  • LeetCode contests (Apr 25–May 31, 2026): 123/128 = 96.1% first-attempt acceptance
  • AIME 2026: 94.3
Training method. A 4-stage pipeline called SSP (Self-driven Staged Pipeline): (1) Curriculum SFT on progressively harder problems, (2) Multi-domain RL using MGPO (a multi-goal policy optimization method) across diverse reasoning domains, (3) Offline Self-Distillation from stronger teacher models, (4) Instruct RL to align outputs. Technical report: arXiv:2606.16140. 2
Deployment. vLLM 0.10.1+, SGLang 0.4.9.post6+, transformers ≥ 4.54.0; Ollama-compatible via community GGUF (prithivMLmods/VibeThinker-3B-GGUF has 31.1k downloads). 2 At 3B parameters, this runs on a single consumer GPU — a 24GB card handles it comfortably in FP16, and quantized versions fit on 8GB.
Builder angle. The competitive use case is narrow but real: if your product scores or evaluates code (automated contest judging, interview screening, coding tutoring platforms), or you need a lightweight math reasoning component (adaptive problem generation, hint systems, solution verification), VibeThinker-3B delivers near-frontier accuracy at a fraction of the inference cost. A 3B model costs roughly 100× less to run than a 671B model at equivalent throughput. 12 Do not use it as a general coding assistant or agent — it was not trained for tool-calling or multi-step execution, and performance outside its training domain is untested.

On the radar

Laguna M.1 (poolside, Apache 2.0) dropped open weights this week: 225B total / 23B active MoE, 70 layers, 262K context. SWE-bench Verified 74.6%, SWE-bench Pro 49.2%. Downloads were 2,708 as of June 22 — solid interest but no viral momentum. 13 Below the 10× threshold for this week's radar; note for next week if growth continues.
Huawei openPangu 2.0 is confirmed for a June 30 open-source release: Pro (505B/18B) and Flash (92B/6B) variants, up to 512K context, announced at Huawei Developer Conference (Jun 12–13). Weights are not yet on HuggingFace as of this writing. Watch for it in next week's issue.

Shape of the week

Two breakouts, both MIT, both coding-focused, at opposite ends of the size spectrum. GLM-5.2 is the stronger story: it pushes the open-weight frontier on long-horizon agentic coding while pricing API access at roughly one-sixth what closed frontier models charge. VibeThinker-3B makes a narrower but concrete case — a 3B model with the right training curriculum hitting 96.1% LeetCode acceptance at a hundredth of the inference cost of frontier-scale models.
One structural note for builders: both models are text-only. The absence of vision input in GLM-5.2 in particular has been the single most consistent community criticism this week. For frontend-facing products or any workflow that involves screenshots and UI, you still need a separate vision model in the stack.
Cover image: AI-generated.

Related content

Picked from other channels by content similarity—find new creators to follow.

Add more perspectives or context around this Post.

  • Sign in to comment.