SIGINT

Paper detail

TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models

2024 arXiv Builder Composite: 49.4

R 59 T 42 C 53 I 47

benchmark-eval

Key findings

TestBench evaluates CodeLlama-13b, GPT-3.5, and GPT-4 on class-level Java test generation across 108 functions from 9 open-source projects. Larger models produce fewer syntax and compilation errors, with GPT-4 achieving 92.51% line coverage and 26.10% mutation kill rate on passing tests. Providing class context improves compilation pass rates, but only GPT-4 benefits from full context - smaller models regress due to noise. A heuristic repair strategy reduces GPT-3.5's syntax error rate from 97.84% to 4.38%.

Claims (5)

moderateLarger models produce fewer syntax and compilation errors in generated test cases.

moderateGPT-4 significantly outperforms other models in code coverage and mutation kill rate on passing test cases.

moderateProviding context improves compilation pass rates, but only larger models benefit from richer (full) context.

strongThe heuristic repair strategy significantly reduces syntax errors in generated test cases.

strongLLMs' ability to detect defects through generated test cases is somewhat limited.

Red flags (6)

No statistical tests despite comparative claims: The paper claims GPT-4 'significantly outperformed' others and makes multiple comparative claims across models and contexts, but uses no statistical significance tests. All conclusions are based on raw percentage comparisons.

No variance reported across 10 generations: Ten test cases are generated per function to 'minimize errors caused by incidental factors' but no variance, standard deviation, or confidence intervals are reported. The reader cannot assess result stability.

Missing hyperparameters: No temperature, top-p, or max_tokens settings are reported for any model. These significantly affect generation quality and make reproduction impossible.

Confounded model size claims: The paper attributes performance differences to model 'parameter size' but the three models differ in architecture, training data, instruction-tuning methods, and more. Model size is confounded with these factors.

No comparison with traditional test generation tools: EvoSuite and Randoop are discussed in related work but never compared against. This omits the most relevant baselines for test generation quality.

Manual function selection bias: Functions are 'manually selected' based on whether they 'frequently appear in real-world development scenarios.' This subjective criterion could bias the benchmark toward functions that favor LLM generation.

Games detected

Big Numbers No Error BarsOverclaimingOpen Source TheaterMoving Goalpost

Dimension scores

Rigor

58.5

Transparency

41.7

Claims

53.3

Integrity

47.2

Composite: 49.4(harmonic mean)

Checklist (29/55 passed)

Category scores

artifacts

statistical methodology

evaluation design

88.9

claims and evidence

setup transparency

limitations and scope

66.7

data integrity

100

conflicts of interest

contamination

66.7

cost and practicality

experimental rigor

data leakage

arXiv · PDF · DOI · Code

Permalink →