Measure the loop
The Benchmark for Recursive AI Improvement
Leaderboard
Score vs. cost under the same evaluation protocol. Each connected sweep shows how much autonomous improvement a model family produced as reasoning effort changes. 100% = a perfect predictor.
| # | Model | Improvement | Raw BPB | Cost / run | % / $ | Seeds |
|---|
Updated
Why ARI matters
Human-level benchmarks ask whether AI can do the work. ARI asks whether AI can accelerate the creation of better AI.
The next step after human parity
Once systems can reason, code, and research at expert level, the important question becomes whether they can improve the systems behind those capabilities.
Measuring the rate of self-improvement
ARI focuses on the recursive cycle: inspect a starting system, find a better path, implement it, test it, and produce measurable improvement.
Designed for super intelligence
Every result comes from the same fixed protocol and private scoring target, so progress is comparable without turning the benchmark into a pure scale contest.
Built For Frontier Competition
What ARI rewards
Autonomous model-design judgment: choosing experiments, repairing failures, improving training behavior, and shipping a better system under fixed rules.
- Understand a constrained starting point
- Discover and execute useful improvements
- Convert iteration into verified score gains