What we do
Every paper in this survey is read in full and evaluated against a structured 50+ question checklist. Each question is a boolean: does this paper satisfy this criterion or not? No subjective scales. No "3 out of 5." Just yes or no, with a written justification citing specific sections of the paper.
The questions cover 14 categories: artifacts, statistical methodology, evaluation design, claims and evidence, setup transparency, limitations, data integrity, conflicts of interest, contamination, human studies, cost and practicality, plus conditional modules for experimental rigor, data leakage, and survey methodology.
The composite score you see on each paper is computed deterministically from these boolean counts. Category scores are the percentage of applicable questions passed within each group. Categories are grouped into four dimensions (Rigor, Transparency, Claims, Integrity), and the composite is the harmonic mean of those dimension scores.
Why
Most claims in agentic AI trace back to papers with serious methodological gaps. A model "achieves 95% on SWE-bench" and the internet runs with it. Nobody checks whether the evaluation was contaminated, whether the scaffold did the heavy lifting, or whether the same task was run enough times to get a stable number.
75% of papers in this corpus use big numbers without error bars. 66% overclaim relative to their evidence. 54% commit open-source theater (claim openness but don't release enough to reproduce). These aren't edge cases. They're the baseline.
The survey exists so you can check the primary source before making decisions based on it. Every claim gets a paper. Every paper gets a structured assessment. The scores let you sort signal from noise at scale.
What the scores mean
Each paper gets four dimension scores and one composite. The composite is the harmonic mean of the dimension scores (with a floor of 5 on each input to prevent a single zero from collapsing the whole score). Corpus composite median is around 33.
A composite of 60+ indicates strong, balanced reporting across all four dimensions. The paper released artifacts, disclosed its setup, supported its claims with evidence, and reported limitations honestly. That's rare.
A composite of 30 means the paper is passing in some dimensions but falling short in others. The harmonic mean punishes asymmetry: a paper that's strong on three dimensions but collapsed on one will score lower than one that's consistently mediocre across all four.
Below 15 indicates critical gaps in at least one dimension. The claims might still be true, but you can't verify them from what's published.
What the scores do NOT mean
The score is not a measure of whether the paper is correct, important, or interesting. A high-scoring paper can be methodologically sound and still wrong. A low-scoring paper can contain genuine insights buried under poor reporting.
The score doesn't measure novelty, impact, or relevance to your specific problem. It measures one thing: how well did the authors support their claims with transparent, verifiable methodology?
Non-empirical papers (surveys, position papers, frameworks) are scored on a reduced set of criteria. Their scores are marked with an asterisk and shouldn't be compared directly to empirical papers.
How scanning works
The primary evaluation instrument is Claude Opus. Each paper is read in full (not just the abstract) and evaluated against the complete checklist. Every answer includes a justification citing specific sections, tables, or figures. These justifications are visible in the paper detail view so you can audit any assessment.
This is an LLM-assisted systematic review, adapted from medical research methodology (Cochrane reviews, PRISMA guidelines). The key adaptation: we're assessing methodological quality, not synthesizing effect sizes. We're not asking "does AI help?" We're asking "did this study support its claims?"
The rubric has grown over time. The original version had 50 questions across 11 categories. v2 kept all 50 unchanged and added 3 conditional modules (experimental rigor, data leakage, survey methodology) plus one more question in claims-and-evidence, for a total of 57. Earlier scans are included in aggregate statistics: their 50 questions are scored normally, and the 7 additions are treated as absent (the same way v2 papers handle conditional modules that don't apply to their paper type).
Calibration
The checklist instrument went through three calibration rounds measuring inter-rater agreement between independent evaluations.
| Round | Papers | Agreement | What changed |
|---|---|---|---|
| 1 | 8 | 93.2% | Found NA boundary confusion (56% of disagreements) and generosity bias |
| 2 | 10 | 96.2% | Added explicit NA guidance, but boundary errors still 47% of disagreements |
| 3 | 60 | 97.0% | Redesigned to two-field format (applies + answer), eliminating the conflation |
The two-field design (applies: boolean, answer: boolean) was the breakthrough. The original yes/no/na format let the evaluator confuse "the paper didn't do this" (answer: no) with "this doesn't apply to this paper type" (applies: false). Splitting into two separate decisions structurally eliminated the dominant error mode.
Dimension scoring
The 14 rubric categories are grouped into four dimensions. Each dimension score is the unweighted mean of the applicable category pass rates within that group:
| Dimension | Categories |
|---|---|
| Rigor | Statistical methodology, evaluation design, experimental rigor, data leakage |
| Transparency | Artifacts, setup transparency, cost and practicality |
| Claims | Claims and evidence, limitations, contamination |
| Integrity | Data integrity, conflicts of interest, human studies, survey methodology |
The composite score is the harmonic mean of the four dimension scores, with a floor of 5 applied to each input before the calculation. The floor prevents a single zero-scoring dimension from collapsing the composite entirely while still punishing it hard.
Why harmonic mean? Because it penalizes asymmetry. A paper that scores 80 on three dimensions but 5 on the fourth should not get a 60. With the harmonic mean, it doesn't. A strong composite requires balanced strength across all applicable dimensions.
Paper types determine which dimensions apply. Empirical papers (the majority) are scored on all four. Analytical papers skip Transparency (they don't produce artifacts or experimental setups). Theoretical papers skip Rigor (they don't run experiments). The harmonic mean adjusts: it's computed over however many dimensions apply.
Pair validation
The rubric is validated against 16 ordering constraints: known-bad papers (retracted, methodologically empty) must score below known-good papers (landmark, methodology-reference). All 16 hold under the current rubric.
No target scores. No optimizer. The only way to improve a bad score is to improve the paper (or improve the rubric questions). If a score looks wrong, you don't tune weights. You examine the rubric questions and decide whether they're asking the right things.
Why no weights?
Earlier versions of the scoring system used per-category weights learned from a labeled anchor set via numeric optimization. That approach treated the rubric like a trained classifier: if a score looked wrong, you adjusted the weights until it didn't. The problem: you're tuning outputs to match expectations rather than fixing the instrument.
The current system borrows from Cochrane RoB 2, GRADE, and AMSTAR-2. These are the standard frameworks for systematic reviews in medical research, and none of them use numeric optimization. Their philosophy: the rubric is a measurement instrument, not a model. Each question either captures a real methodological property or it doesn't. If a score doesn't reflect reality, the questions are wrong, not the weights.
This means every category counts equally within its dimension, and every dimension counts equally in the composite. There are no knobs to turn. That's the point.
Calibration specimens
A handful of papers from outside agentic AI sit in the corpus as reference specimens. They're scored with the same rubric but partitioned out of the aggregates above (they're not agentic AI papers and mixing them in would skew the numbers). Their purpose is to stress-test the instrument: if the rubric can't flag a known-bad paper or credit a known-landmark one, it isn't tough enough or sensitive enough for the real corpus either.
- Score: 36.3 Mixed
Foundational transformer architecture paper. Pre-dates the agentic AI corpus; included as a reference-benchmark for the rubric. Not an example of bad methodology (unlike Wakefield) - more a landmark anchor.
- Score: 14.3 Theater
Added as reference benchmark - famously fraudulent paper to calibrate the scan instrument against. Out of normal inclusion scope.
Benchmarks in the corpus
These papers introduce benchmarks used by the field rather than subjects evaluated against the rubric in the same way. A benchmark paper's job is to define a measurement instrument; the methodology rubric grades how well studies report their use of instruments. Scoring the instruments themselves mixes categories, so they're partitioned out of the aggregates and listed here instead.
- Score: 25.8 Builder
- Score: 13.4 Mixed
- Score: 21.6 Mixed
- Score: 22.1 Mixed
- Score: 49.4 Builder
Limitations
The LLM cannot verify claims against external sources. If a paper says "our code is available at github.com/example/repo" the checklist marks code_released as true, but nobody checked whether that repo actually works.
Fabricated data would not be detected. The checklist assesses transparency and methodology, not the truthfulness of reported numbers.
The LLM's training data includes many of the papers being reviewed. This creates a potential bias toward charitable interpretation of familiar work. The strict "absence of evidence is false" rule and the calibration process mitigate but do not eliminate this.
This is a purposive sample, not a random one. The goal is coverage of the most influential and most cited papers in agentic AI, not statistical representativeness of everything published. Papers are sourced from arXiv, major ML conferences, Semantic Scholar, and community channels.
Archetypes
Papers are classified into archetypes based on their category score patterns. These are descriptive labels, not value judgments.
- Complete
- Passes most criteria across all categories. Rare (about 7% of scored papers).
- Builder
- Strong on artifacts and setup transparency, weaker on statistical methodology. Ships code, light on rigor.
- Theater
- Claims openness and rigor but fails the specific checks. Says "we release our code" without actually doing it.
- Mixed
- The most common archetype. Some categories strong, others weak. No consistent pattern.
- Minimal
- Fails most criteria. Often short papers, workshop papers, or papers from venues without strong review standards.