03
SIGINT
Paper detail
EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies
R 40 T 50 C 20 I 17
benchmark-eval
Key findings
EcoGym introduces three economic simulation environments (Vending, Freelance, Operation) for evaluating LLM long-horizon planning over 365-day horizons. Evaluating 11 LLMs reveals no single model dominates across all scenarios, with performance gaps stemming from trade-offs between strategic prioritization and execution efficiency. Context window expansion does not consistently improve performance, memory module effectiveness is highly model- and task-dependent, and top-tier models surpass a sparsely-described human baseline in the Operation environment.
Claims (6)
strongNo single model consistently achieves superior performance across all three EcoGym scenarios.
moderateModels exhibit significant suboptimality in either high-level strategies or efficient action execution.
moderateExpanding the context window does not yield consistent performance gains.
moderateMemory integration generally enhances performance but is not universally beneficial, and optimal memory type is model- and task-dependent.
weakThinking mode catalyzes universal performance elevation across model variants.
weakCurrent SOTA LLMs have achieved super-human performance in specific long-horizon economic planning scenarios.
Red flags (6)
Single-run main results for 2/3 environments: Freelance and Operation main results (Table 2) are from single runs despite the paper acknowledging 'inherent instability' in long-horizon environments. Variance analysis in appendix is only for a subset of models.
Vague human baseline with super-human claims: Human expert comparison lacks critical details: number of participants, expertise qualifications, recruitment method, and practice time are all unreported. A single average DAU number (1,404) from uncharacterized 'human experts' supports the claim of 'super-human performance.'
No statistical tests for comparative claims: All claims of model superiority are based on raw number comparisons without any significance testing, despite the paper demonstrating high variance in at least one environment (Vending).
Conflict of interest: company-designed benchmark: OPPO AI Agent Team designed EcoGym, selected the evaluation environments, and evaluates all models. No competing interests statement is provided and no benchmark designer bias is acknowledged.
No limitations section: The paper has no limitations, threats to validity, or scope boundaries section. Simulated economic environments are presented as 'realistic economic settings' without discussing the gap between simulation and real-world economic dynamics.
No cost reporting for large-scale evaluation: Evaluating 11 models across 3 environments with 365-day horizons, multiple ablations, and multiple runs requires substantial API costs, but no cost data is reported.
Games detected
Big Numbers No Error BarsOverclaimingOpen Source TheaterContamination Dodge
Dimension scores
Composite: 25.8(harmonic mean)
Checklist (26/61 passed)
Category scores
artifacts
50
statistical methodology
0
evaluation design
100
claims and evidence
40
setup transparency
100
limitations and scope
0
data integrity
50
conflicts of interest
25
contamination
0
human studies
0
cost and practicality
0
experimental rigor
50
data leakage
25