SIGINT

Paper detail

Interactive Code Generation via Test-Driven User-Intent Formalization

2022 arXiv Mixed Composite: 21.6

R 45 T 33 C 80 I 8

benchmark-eval

Key findings

TiCoder demonstrates that interactive test-driven code generation using LLM-generated tests can substantially improve pass@1 accuracy - by 22.49% on MBPP (48.24% → 70.73%) and 24.79% on HumanEval (30.49% → 55.28%) with a single simulated user query. Discriminative test ranking and dynamic test mutation are the most impactful components in ablation. The approach generates a user-accepted test within an average of 1.5–1.7 queries for 87–96% of benchmark examples.

Claims (6)

strongTiCoder improves pass@1 from 48.24% to 70.73% on MBPP and from 30.49% to 55.28% on HumanEval with a single simulated user query.

strongWith 5 user queries, TiCoder achieves pass@1 of 85.95% on MBPP and 84.47% on HumanEval.

strongTiCoder generates a user-accepted test within an average of 1.7 queries for 87.12% of MBPP and 1.5 queries for 95.73% of HumanEval examples.

moderateEach component (test ranking, dynamic mutation, code ranking, static mutation, prompt design) contributes to TiCoder's effectiveness.

moderateTiCoder outperforms CodeT, which uses no user interaction, on MBPP with a single query and on HumanEval with 2 queries.

strongTest mutation techniques improve the pool of tests over purely LLM-generated tests, shown by 10.72% (MBPP) and 13.74% (HumanEval) accuracy gain after 5 interactions.

Red flags (5)

Simulated user evaluation only: All experiments use an oracle (reference implementation) instead of real users. The cognitive cost and accuracy of real user responses is unknown. The paper acknowledges this but provides no user study, making the claimed 'interactive' benefit unvalidated with actual users.

No error bars or variance on any result: All results come from a single cached Codex query. No repeated runs, no confidence intervals, no statistical tests. The stochasticity of Codex (acknowledged in Section VI) means results could vary substantially across runs.

No contamination analysis: Both MBPP and HumanEval were publicly available before Codex training. The high pass@100 values (89.70% MBPP, 90.85% HumanEval) could partly reflect memorization. No contamination check is performed.

Microsoft-OpenAI financial relationship: The majority of authors are Microsoft Research employees evaluating OpenAI's Codex model. Microsoft is a major investor in OpenAI. This conflict of interest is not disclosed or acknowledged.

Discontinued model: Codex code-davinci-002 was discontinued by OpenAI in March 2023. Results cannot be verified or replicated with the same model, and the cached outputs are not released.

Games detected

Big Numbers No Error BarsOverclaimingContamination DodgeTrust Us

Dimension scores

Rigor

44.6

Transparency

33.3

Claims

80.0

Integrity

8.3

Composite: 21.6(harmonic mean)

Checklist (22/54 passed)

Category scores

artifacts

statistical methodology

evaluation design

66.7

claims and evidence

setup transparency

100

limitations and scope

100

data integrity

66.7

conflicts of interest

contamination

cost and practicality

experimental rigor

data leakage

arXiv · PDF · DOI

Permalink →