Agent Lab

Same task. One variable. Measured results. A multi-dimensional sweep of agentic configuration space.

1,736

Experiments run

Models tested

Config axes

$0.68

Avg cost/run

What is this?

Same task, same codebase, hundreds of configurations. Run until confidence intervals are small. The frontier keeps moving, so the research is evergreen. Every time models change or new tools ship, we run the sweep again.

> Config axes

Model, effort level, prompt style, tools, linters, sub-agents, context, token budget, and more.

> Tasks

Tetris (full game build with gameplay verification, code analysis, and SonarQube scoring).

> Metrics

Gameplay correctness, code quality, structural integrity, cost, speed, and an overall composite score.

> Insights

Agent Lab

What is this?

> Config axes

> Tasks

> Metrics

Which variables move the needle?

Browse all 1736 experiment runs

Cost vs. quality tradeoff

How the experiments work