03
SIGINT
Paper detail
TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models
R 59 T 42 C 53 I 47
benchmark-eval
Key findings
TestBench evaluates CodeLlama-13b, GPT-3.5, and GPT-4 on class-level Java test generation across 108 functions from 9 open-source projects. Larger models produce fewer syntax and compilation errors, with GPT-4 achieving 92.51% line coverage and 26.10% mutation kill rate on passing tests. Providing class context improves compilation pass rates, but only GPT-4 benefits from full context - smaller models regress due to noise. A heuristic repair strategy reduces GPT-3.5's syntax error rate from 97.84% to 4.38%.
Claims (5)
moderateLarger models produce fewer syntax and compilation errors in generated test cases.
moderateGPT-4 significantly outperforms other models in code coverage and mutation kill rate on passing test cases.
moderateProviding context improves compilation pass rates, but only larger models benefit from richer (full) context.
strongThe heuristic repair strategy significantly reduces syntax errors in generated test cases.
strongLLMs' ability to detect defects through generated test cases is somewhat limited.
Red flags (6)
No statistical tests despite comparative claims: The paper claims GPT-4 'significantly outperformed' others and makes multiple comparative claims across models and contexts, but uses no statistical significance tests. All conclusions are based on raw percentage comparisons.
No variance reported across 10 generations: Ten test cases are generated per function to 'minimize errors caused by incidental factors' but no variance, standard deviation, or confidence intervals are reported. The reader cannot assess result stability.
Missing hyperparameters: No temperature, top-p, or max_tokens settings are reported for any model. These significantly affect generation quality and make reproduction impossible.
Confounded model size claims: The paper attributes performance differences to model 'parameter size' but the three models differ in architecture, training data, instruction-tuning methods, and more. Model size is confounded with these factors.
No comparison with traditional test generation tools: EvoSuite and Randoop are discussed in related work but never compared against. This omits the most relevant baselines for test generation quality.
Manual function selection bias: Functions are 'manually selected' based on whether they 'frequently appear in real-world development scenarios.' This subjective criterion could bias the benchmark toward functions that favor LLM generation.
Games detected
Big Numbers No Error BarsOverclaimingOpen Source TheaterMoving Goalpost
Dimension scores
Composite: 49.4(harmonic mean)
Checklist (29/55 passed)
Category scores
artifacts
50
statistical methodology
20
evaluation design
88.9
claims and evidence
40
setup transparency
75
limitations and scope
66.7
data integrity
100
conflicts of interest
75
contamination
66.7
cost and practicality
0
experimental rigor
25
data leakage
0