What we do
Agent Lab runs controlled experiments across the full space of agentic configuration. Each run gives an LLM the same task with a specific combination of settings: model, effort level, prompt style, available tools, context strategy, and 18 other axes. Every combination is run multiple times so the results are statistically meaningful.
The output of each run is evaluated automatically on multiple dimensions: does the generated code work, is it well-structured, does it pass linting, and what does the code look like to an external quality tool? The composite score combines these into a single number that's comparable across runs.
The experiment grid
The full grid has 23 configuration axes. Not all combinations are valid (some models don't support extended thinking, some tool combos are redundant), so the grid includes exclusion rules that skip impossible configurations.
| Axis | Values | What it tests |
|---|---|---|
| model | 10 models across 3 providers | Raw capability differences between LLMs |
| effort | high, max | Extended thinking (max) vs standard reasoning |
| prompt_style | simple, detailed | Minimal prompt vs structured spec |
| language | typescript, javascript, unspecified | Whether specifying a language helps |
| strategy | none, plan_first, iterate, creative_validate, use_subagents, delegate, review, split_work | High-level approach instruction |
| tools (5 axes) | on/off for read, write, edit, glob, grep | Which tools the agent can use |
| linter | on, off | Whether linting feedback is available |
| playwright | off, available, instructed | Browser automation access level |
| context_file | none, provided | Pre-loaded context about the task |
| web_search | on, off | Internet access during the run |
| max_budget | low ($2), high ($10) | Token budget ceiling |
| tests_provided | none, a_few, many | Pre-written test suites |
| design_guidance | none, vague, specific | UI/UX direction level |
| architecture | none, separation, best_practices | Code structure guidance |
| error_checking | none, self_verify | Self-testing instruction |
| context_noise | clean + 14 noise levels | Irrelevant context injection |
| renderer | none, canvas, svg, dom, webgl | Rendering approach instruction |
| provider | anthropic, zai, openrouter | API provider routing |
The task
The current benchmark task is Tetris: build a fully playable browser-based Tetris game from scratch. This was chosen because it requires multiple capabilities simultaneously: game logic, rendering, user input handling, state management, and real-time animation. A 5-line function won't pass. The agent needs to build a complete, working application.
Future tasks (REST API with authentication, CSV pipeline with edge cases) will expand the benchmark surface, but the methodology stays the same: one task, one variable at a time, measured results.
Scoring
Each run produces a composite outcome score. The components are:
- Gameplay (50% of composite)
- A Playwright-based gameplay bot loads the built game, attempts to play it, and checks 26 functional criteria: does the game load, can you start it, do pieces move, do lines clear, does the score update, does game-over work, and so on. The bot actually plays the game for 30 seconds and verifies real-time behavior.
- SonarQube (50% of composite)
- The generated code is analyzed by SonarQube for bugs, code smells, security issues, and maintainability. The SonarQube score is derived from the quality gate results.
- Structural
- Basic structural checks: does index.html exist, does package.json exist, does the build succeed, does TypeScript compile without errors? Not weighted into the composite but tracked separately.
- Code Quality
- Static analysis: lines of code, dependency count, function length, nesting depth, naming consistency, duplication, separation of concerns. Scored independently.
- Transcript
- Analysis of the agent's own behavior: how many turns it took, wasted turns (documentation generation, ASCII art, unnecessary server starts), error rate, and tool usage patterns.
Main effects analysis
The Insights page shows the main effect of each axis. For a given metric (score, cost, etc.), we compute the per-cell average, then group cells by each axis value. The "effect" of a value is its group mean minus the grand mean. The "spread" of an axis is the range between its highest and lowest group means.
A large spread means that axis matters a lot. A small spread means changing that axis doesn't move the needle. The tornado chart sorts axes by spread so you can see at a glance which configuration choices have the biggest impact.
Efficiency frontier
The Efficiency page plots every cell on a scatter chart (one dot per unique configuration, averaged across its runs). By default: X is cost, Y is score. The Pareto frontier connects the cells that aren't dominated on both axes. A cell is dominated if another cell has both higher score and lower cost.
The frontier tells you: for any given budget, these are the configurations that maximize your outcome. Or conversely: for any target score, these are the cheapest configurations that achieve it.
Limitations
Single task. The current benchmark only runs Tetris. Results may not generalize to other task types. More tasks are planned.
No interaction effects. The main effects analysis treats each axis independently. In reality, some axis combinations interact (model + strategy, effort + budget). The Explorer lets you filter to specific combinations, but the tornado chart doesn't show interactions.
Models change. These results are snapshots. When providers update their models, the numbers shift. The date range in the summary shows when the data was collected.
Low-n axes. Some axis values have very few cells (3 or fewer). Their effects are noisy and shouldn't be over-interpreted. The Explorer shows the raw run count so you can judge confidence yourself.