Measure the loop

The Benchmark for Recursive AI Improvement

Leaderboard

Score vs. cost under the same evaluation protocol. Each connected sweep shows how much autonomous improvement a model family produced as reasoning effort changes. 100% = a perfect predictor.

Models ranked by recursive improvement score. Column headers are sortable (press Enter or Space).
# Model Improvement Raw BPB Cost / run % / $ Seeds

Updated

Why ARI matters

Human-level benchmarks ask whether AI can do the work. ARI asks whether AI can accelerate the creation of better AI.

01

The next step after human parity

Once systems can reason, code, and research at expert level, the important question becomes whether they can improve the systems behind those capabilities.

02

Measuring the rate of self-improvement

ARI focuses on the recursive cycle: inspect a starting system, find a better path, implement it, test it, and produce measurable improvement.

03

Designed for super intelligence

Every result comes from the same fixed protocol and private scoring target, so progress is comparable without turning the benchmark into a pure scale contest.

Built For Frontier Competition

What ARI rewards

Autonomous model-design judgment: choosing experiments, repairing failures, improving training behavior, and shipping a better system under fixed rules.

  • Understand a constrained starting point
  • Discover and execute useful improvements
  • Convert iteration into verified score gains