Agent Lab: Methodology · Ship the Loop

What we do

Agent Lab runs controlled experiments across the full space of agentic configuration. Each run gives an LLM the same task with a specific combination of settings: model, effort level, prompt style, available tools, context strategy, and 18 other axes. Every combination is run multiple times so the results are statistically meaningful.

The output of each run is evaluated automatically on multiple dimensions: does the generated code work, is it well-structured, does it pass linting, and what does the code look like to an external quality tool? The composite score combines these into a single number that's comparable across runs.

The experiment grid

The full grid has 23 configuration axes. Not all combinations are valid (some models don't support extended thinking, some tool combos are redundant), so the grid includes exclusion rules that skip impossible configurations.

Axis	Values	What it tests
model	10 models across 3 providers	Raw capability differences between LLMs
effort	high, max	Extended thinking (max) vs standard reasoning
prompt_style	simple, detailed	Minimal prompt vs structured spec
language	typescript, javascript, unspecified	Whether specifying a language helps
strategy	none, plan_first, iterate, creative_validate, use_subagents, delegate, review, split_work	High-level approach instruction
tools (5 axes)	on/off for read, write, edit, glob, grep	Which tools the agent can use
linter	on, off	Whether linting feedback is available
playwright	off, available, instructed	Browser automation access level
context_file	none, provided	Pre-loaded context about the task
web_search	on, off	Internet access during the run
max_budget	low ($2), high ($10)	Token budget ceiling
tests_provided	none, a_few, many	Pre-written test suites
design_guidance	none, vague, specific	UI/UX direction level
architecture	none, separation, best_practices	Code structure guidance
error_checking	none, self_verify	Self-testing instruction
context_noise	clean + 14 noise levels	Irrelevant context injection
renderer	none, canvas, svg, dom, webgl	Rendering approach instruction
provider	anthropic, zai, openrouter	API provider routing

The task

The current benchmark task is Tetris: build a fully playable browser-based Tetris game from scratch. This was chosen because it requires multiple capabilities simultaneously: game logic, rendering, user input handling, state management, and real-time animation. A 5-line function won't pass. The agent needs to build a complete, working application.

Future tasks (REST API with authentication, CSV pipeline with edge cases) will expand the benchmark surface, but the methodology stays the same: one task, one variable at a time, measured results.

Scoring

Each run produces a composite outcome score. The components are:

Gameplay (50% of composite): A Playwright-based gameplay bot loads the built game, attempts to play it, and checks 26 functional criteria: does the game load, can you start it, do pieces move, do lines clear, does the score update, does game-over work, and so on. The bot actually plays the game for 30 seconds and verifies real-time behavior.
SonarQube (50% of composite): The generated code is analyzed by SonarQube for bugs, code smells, security issues, and maintainability. The SonarQube score is derived from the quality gate results.
Structural: Basic structural checks: does index.html exist, does package.json exist, does the build succeed, does TypeScript compile without errors? Not weighted into the composite but tracked separately.
Code Quality: Static analysis: lines of code, dependency count, function length, nesting depth, naming consistency, duplication, separation of concerns. Scored independently.
Transcript: Analysis of the agent's own behavior: how many turns it took, wasted turns (documentation generation, ASCII art, unnecessary server starts), error rate, and tool usage patterns.

Main effects analysis

The Insights page shows the main effect of each axis. For a given metric (score, cost, etc.), we compute the per-cell average, then group cells by each axis value. The "effect" of a value is its group mean minus the grand mean. The "spread" of an axis is the range between its highest and lowest group means.

A large spread means that axis matters a lot. A small spread means changing that axis doesn't move the needle. The tornado chart sorts axes by spread so you can see at a glance which configuration choices have the biggest impact.

Efficiency frontier

The Efficiency page plots every cell on a scatter chart (one dot per unique configuration, averaged across its runs). By default: X is cost, Y is score. The Pareto frontier connects the cells that aren't dominated on both axes. A cell is dominated if another cell has both higher score and lower cost.

The frontier tells you: for any given budget, these are the configurations that maximize your outcome. Or conversely: for any target score, these are the cheapest configurations that achieve it.

Limitations

Single task. The current benchmark only runs Tetris. Results may not generalize to other task types. More tasks are planned.

No interaction effects. The main effects analysis treats each axis independently. In reality, some axis combinations interact (model + strategy, effort + budget). The Explorer lets you filter to specific combinations, but the tornado chart doesn't show interactions.

Models change. These results are snapshots. When providers update their models, the numbers shift. The date range in the summary shows when the data was collected.

Low-n axes. Some axis values have very few cells (3 or fewer). Their effects are noisy and shouldn't be over-interpreted. The Explorer shows the raw run count so you can judge confidence yourself.