04
Agent Lab
Same task. One variable. Measured results. A multi-dimensional sweep of agentic configuration space.
1,736
Experiments run
11
Models tested
23
Config axes
$0.68
Avg cost/run
What is this?
Same task, same codebase, hundreds of configurations. Run until confidence intervals are small. The frontier keeps moving, so the research is evergreen. Every time models change or new tools ship, we run the sweep again.
> Config axes
Model, effort level, prompt style, tools, linters, sub-agents, context, token budget, and more.
> Tasks
Tetris (full game build with gameplay verification, code analysis, and SonarQube scoring).
> Metrics
Gameplay correctness, code quality, structural integrity, cost, speed, and an overall composite score.
> Insights
Which variables move the needle?
Tornado chart showing the main effect of each config axis on outcomes,
sorted by impact size. Click any axis to see the per-value breakdown.
> Explorer
Browse all 1736 experiment runs
Filter by model, strategy, language, effort level. Sort by score, cost,
or time. Expand any run to see the full configuration and all sub-scores.
> Efficiency
Cost vs. quality tradeoff
Scatter plot of every cell (averaged across runs), with the Pareto frontier
highlighted. Find the configurations that give the best results for the money.
> Methodology
How the experiments work
The experiment grid, evaluation pipeline, scoring formula, and what the
agent sees when it runs. Full transparency on the benchmark design.