04

Agent Lab

How the experiments work and what the scores mean.

What we do

Agent Lab runs controlled experiments across the full space of agentic configuration. Each run gives an LLM the same task with a specific combination of settings: model, effort level, prompt style, available tools, context strategy, and 18 other axes. Every combination is run multiple times so the results are statistically meaningful.

The output of each run is evaluated automatically on multiple dimensions: does the generated code work, is it well-structured, does it pass linting, and what does the code look like to an external quality tool? The composite score combines these into a single number that's comparable across runs.

The experiment grid

The full grid has 23 configuration axes. Not all combinations are valid (some models don't support extended thinking, some tool combos are redundant), so the grid includes exclusion rules that skip impossible configurations.

Axis Values What it tests
model 10 models across 3 providers Raw capability differences between LLMs
effort high, max Extended thinking (max) vs standard reasoning
prompt_style simple, detailed Minimal prompt vs structured spec
language typescript, javascript, unspecified Whether specifying a language helps
strategy none, plan_first, iterate, creative_validate, use_subagents, delegate, review, split_work High-level approach instruction
tools (5 axes) on/off for read, write, edit, glob, grep Which tools the agent can use
linter on, off Whether linting feedback is available
playwright off, available, instructed Browser automation access level
context_file none, provided Pre-loaded context about the task
web_search on, off Internet access during the run
max_budget low ($2), high ($10) Token budget ceiling
tests_provided none, a_few, many Pre-written test suites
design_guidance none, vague, specific UI/UX direction level
architecture none, separation, best_practices Code structure guidance
error_checking none, self_verify Self-testing instruction
context_noise clean + 14 noise levels Irrelevant context injection
renderer none, canvas, svg, dom, webgl Rendering approach instruction
provider anthropic, zai, openrouter API provider routing

The task

The current benchmark task is Tetris: build a fully playable browser-based Tetris game from scratch. This was chosen because it requires multiple capabilities simultaneously: game logic, rendering, user input handling, state management, and real-time animation. A 5-line function won't pass. The agent needs to build a complete, working application.

Future tasks (REST API with authentication, CSV pipeline with edge cases) will expand the benchmark surface, but the methodology stays the same: one task, one variable at a time, measured results.

Scoring

Each run produces a composite outcome score. The components are:

Gameplay (50% of composite)
A Playwright-based gameplay bot loads the built game, attempts to play it, and checks 26 functional criteria: does the game load, can you start it, do pieces move, do lines clear, does the score update, does game-over work, and so on. The bot actually plays the game for 30 seconds and verifies real-time behavior.
SonarQube (50% of composite)
The generated code is analyzed by SonarQube for bugs, code smells, security issues, and maintainability. The SonarQube score is derived from the quality gate results.
Structural
Basic structural checks: does index.html exist, does package.json exist, does the build succeed, does TypeScript compile without errors? Not weighted into the composite but tracked separately.
Code Quality
Static analysis: lines of code, dependency count, function length, nesting depth, naming consistency, duplication, separation of concerns. Scored independently.
Transcript
Analysis of the agent's own behavior: how many turns it took, wasted turns (documentation generation, ASCII art, unnecessary server starts), error rate, and tool usage patterns.

Main effects analysis

The Insights page shows the main effect of each axis. For a given metric (score, cost, etc.), we compute the per-cell average, then group cells by each axis value. The "effect" of a value is its group mean minus the grand mean. The "spread" of an axis is the range between its highest and lowest group means.

A large spread means that axis matters a lot. A small spread means changing that axis doesn't move the needle. The tornado chart sorts axes by spread so you can see at a glance which configuration choices have the biggest impact.

Efficiency frontier

The Efficiency page plots every cell on a scatter chart (one dot per unique configuration, averaged across its runs). By default: X is cost, Y is score. The Pareto frontier connects the cells that aren't dominated on both axes. A cell is dominated if another cell has both higher score and lower cost.

The frontier tells you: for any given budget, these are the configurations that maximize your outcome. Or conversely: for any target score, these are the cheapest configurations that achieve it.

Limitations

Single task. The current benchmark only runs Tetris. Results may not generalize to other task types. More tasks are planned.

No interaction effects. The main effects analysis treats each axis independently. In reality, some axis combinations interact (model + strategy, effort + budget). The Explorer lets you filter to specific combinations, but the tornado chart doesn't show interactions.

Models change. These results are snapshots. When providers update their models, the numbers shift. The date range in the summary shows when the data was collected.

Low-n axes. Some axis values have very few cells (3 or fewer). Their effects are noisy and shouldn't be over-interpreted. The Explorer shows the raw run count so you can judge confidence yourself.