04

Agent Lab

Sweeping the jagged frontier for the best agentic configurations. A multi-dimensional sweep of capability space, hunting for the setup changes that drive results and separating them from noise.

441
Experiments run
16
Config axes
3
Benchmark tasks
5
Quality metrics

How it works

Same task, same codebase, hundreds of configurations. Run until confidence intervals are small. The frontier keeps moving, so the research is evergreen. Every time models change or new tools ship, we run the sweep again.

> Config axes

Model, effort level, prompt style, tools, linters, sub-agents, context, token budget, and more.

> Tasks

Tetris (agent-friendly), REST API with auth (medium), CSV pipeline with edge cases (hard).

> Metrics

Structural quality, functional correctness, lint compliance, type safety, cost, and speed.

benchmark dashboard / results explorer