SIGINT

Paper detail

EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

2026 arXiv Builder Composite: 25.8

R 40 T 50 C 20 I 17

benchmark-eval

Key findings

EcoGym introduces three economic simulation environments (Vending, Freelance, Operation) for evaluating LLM long-horizon planning over 365-day horizons. Evaluating 11 LLMs reveals no single model dominates across all scenarios, with performance gaps stemming from trade-offs between strategic prioritization and execution efficiency. Context window expansion does not consistently improve performance, memory module effectiveness is highly model- and task-dependent, and top-tier models surpass a sparsely-described human baseline in the Operation environment.

Claims (6)

strongNo single model consistently achieves superior performance across all three EcoGym scenarios.

moderateModels exhibit significant suboptimality in either high-level strategies or efficient action execution.

moderateExpanding the context window does not yield consistent performance gains.

moderateMemory integration generally enhances performance but is not universally beneficial, and optimal memory type is model- and task-dependent.

weakThinking mode catalyzes universal performance elevation across model variants.

weakCurrent SOTA LLMs have achieved super-human performance in specific long-horizon economic planning scenarios.

Red flags (6)

Single-run main results for 2/3 environments: Freelance and Operation main results (Table 2) are from single runs despite the paper acknowledging 'inherent instability' in long-horizon environments. Variance analysis in appendix is only for a subset of models.

Vague human baseline with super-human claims: Human expert comparison lacks critical details: number of participants, expertise qualifications, recruitment method, and practice time are all unreported. A single average DAU number (1,404) from uncharacterized 'human experts' supports the claim of 'super-human performance.'

No statistical tests for comparative claims: All claims of model superiority are based on raw number comparisons without any significance testing, despite the paper demonstrating high variance in at least one environment (Vending).

Conflict of interest: company-designed benchmark: OPPO AI Agent Team designed EcoGym, selected the evaluation environments, and evaluates all models. No competing interests statement is provided and no benchmark designer bias is acknowledged.

No limitations section: The paper has no limitations, threats to validity, or scope boundaries section. Simulated economic environments are presented as 'realistic economic settings' without discussing the gap between simulation and real-world economic dynamics.

No cost reporting for large-scale evaluation: Evaluating 11 models across 3 environments with 365-day horizons, multiple ablations, and multiple runs requires substantial API costs, but no cost data is reported.

Games detected

Big Numbers No Error BarsOverclaimingOpen Source TheaterContamination Dodge

Dimension scores

Rigor

40.0

Transparency

50.0

Claims

20.0

Integrity

16.7

Composite: 25.8(harmonic mean)

Checklist (26/61 passed)

Category scores

artifacts

statistical methodology

evaluation design

100

claims and evidence

setup transparency

100

limitations and scope

data integrity

conflicts of interest

contamination

human studies

cost and practicality

experimental rigor

data leakage

arXiv · PDF · Code

Permalink →