04

Agent Lab

Same task. One variable. Measured results. A multi-dimensional sweep of agentic configuration space.

1,736
Experiments run
11
Models tested
23
Config axes
$0.68
Avg cost/run

What is this?

Same task, same codebase, hundreds of configurations. Run until confidence intervals are small. The frontier keeps moving, so the research is evergreen. Every time models change or new tools ship, we run the sweep again.

> Config axes

Model, effort level, prompt style, tools, linters, sub-agents, context, token budget, and more.

> Tasks

Tetris (full game build with gameplay verification, code analysis, and SonarQube scoring).

> Metrics

Gameplay correctness, code quality, structural integrity, cost, speed, and an overall composite score.