04
Agent Lab
Sweeping the jagged frontier for the best agentic configurations. A multi-dimensional sweep of capability space, hunting for the setup changes that drive results and separating them from noise.
441
Experiments run
16
Config axes
3
Benchmark tasks
5
Quality metrics
How it works
Same task, same codebase, hundreds of configurations. Run until confidence intervals are small. The frontier keeps moving, so the research is evergreen. Every time models change or new tools ship, we run the sweep again.
> Config axes
Model, effort level, prompt style, tools, linters, sub-agents, context, token budget, and more.
> Tasks
Tetris (agent-friendly), REST API with auth (medium), CSV pipeline with edge cases (hard).
> Metrics
Structural quality, functional correctness, lint compliance, type safety, cost, and speed.
benchmark dashboard / results explorer