


1/3
June 12, 2026 · 4:25 PM
TheoremBench: Classical Math Exposes New Gaps in Lean 4 Provers
TheoremBench (arXiv:2606.09450, Jun 8 2026) introduces a new Lean 4 benchmark built from ~100 classical theorems (Wiedijk list), expanded into 1,142 instances in two formats: plain-main and premised (with explicit supporting subtheorems). Key result: explicit premises produce a 5× pass@64 lift for DeepSeek-Prover-V2-7B (5.3% → 26.3%) but zero gain for the non-reasoning SFT baseline. New finding: provers write 8–16× more tokens than reference proofs — brute-force tactic traces, not compact proof plans. Lean 4 kernel verification throughout. Single-lab evaluation; no SOTA agentic systems (LEAP, Goedel-Architect full scale) tested.
Gallery
A new Lean 4 benchmark from Skoltech, HSE University, AIRI, and Sberbank targets the regime between competition-style sprints and real-world Lean projects — classical mathematical theorems. The results reveal that current specialized provers struggle even more than competition numbers suggest.
What happened
TheoremBench (arXiv:2606.09450, submitted June 8 2026) introduces a Lean 4 benchmark built from ~100 classical theorems drawn from Freek Wiedijk's canonical list, expanded into 1,142 instances across two complementary dataset views:
- Plain-main: one standalone target theorem per instance
- Premised: each theorem expanded into a structured group including the main theorem plus automatically extracted supporting subtheorems, with prior results exposed as explicit premise binders
How it works
The benchmark construction pipeline parses raw Lean 4 formalizations of classical results, reconstructs required compilation context for each extracted instance, and verifies every instance against the Lean 4 kernel before inclusion. The "premised" format converts relevant prior Lean results from surrounding developments into explicit binder assumptions in theorem declarations — producing self-contained snippets that can be solved without reconstructing the full file context.
Four provers were evaluated: DeepSeek-Prover-V2-7B, Goedel-Prover-V2-8B, Kimina-Prover-Distill-8B, and Goedel-Prover-SFT (non-reasoning baseline). All candidates were sampled up to k=64, with every attempt checked by Lean 4.
New metrics introduced:
- Theorem-level coverage: fraction of supporting subtheorems proved within a parent theorem group
- Token-efficiency: ratio of generated proof tokens to ground-truth proof tokens (measures verbosity)
Key results
Loading stats card…
| Model | Plain-main pass@64 | Premised pass@64 | Median token-efficiency |
|---|---|---|---|
| DeepSeek-Prover-V2-7B | 5.3% | 26.3% | 7.8× |
| Goedel-Prover-V2-8B | 5.3% | 12.3% | 16× |
| Kimina-Prover-Distill-8B | 5.3% | 7.0% | 3.6× |
| Goedel-Prover-SFT | 3.5% | 3.5% | 1.44× |
Explicit premises produce a 5× lift for DeepSeek-Prover-V2-7B (5.3% → 26.3%) but essentially no gain for the non-reasoning SFT model — confirming the benefit is model-dependent, not mechanical.
Token-efficiency ratios are striking: Goedel-Prover-V2-8B generates proofs with a median 16× the token count of the reference proof. These are valid Lean proofs, but they are padded, inefficient tactic traces rather than compact proof plans.
Claim audit
| Dimension | Assessment |
|---|---|
| Verification | Lean 4 kernel throughout — machine-checkable, no human review substitution |
| Benchmark type | Classical theorems (Wiedijk list) — structural departure from miniF2F / PutnamBench competition style |
| Autonomy | Fully automated evaluation; no human-assisted proof attempts |
| Coverage | ~100 parent theorems → 1,142 instances; algebra, number theory, analysis, topology, combinatorics, probability |
| Independence | Single lab evaluation (Skoltech/HSE/AIRI/Sberbank) — no external replication yet reported |
| Key limitation | Models tested are smaller parameter variants (7B–8B); frontier agentic systems (LEAP, Goedel-Architect at full scale) not evaluated |
Context
TheoremBench sits between two already-covered benchmarks on this channel:
- MiniF2F / PutnamBench: competition problems, typically self-contained, Goedel-Architect already hits 99.2% and 88.8% respectively
- SorryDB: real-world sorry-closure from live Lean 4 projects, where Kimina hits only 1.0% pass@1
TheoremBench fills the gap: classical mathematical developments with dependency structure, not adversarially crafted competition problems and not messy real-world sorry holes. The premised variant mirrors how a real Lean development is structured — you have intermediate lemmas, and a prover that cannot exploit them is navigating blind.
The token-efficiency finding adds a new diagnostic axis: passing is necessary but not sufficient. Provers that write 16× the reference proof length are using brute-force tactic enumeration, not genuine proof understanding.
Sources
- arXiv preprint: https://arxiv.org/abs/2606.09450
- HTML version (figures): https://arxiv.org/html/2606.09450v1




Comments
Sign in to comment.