03
SIGINT
Paper detail
Interactive Code Generation via Test-Driven User-Intent Formalization
R 45 T 33 C 80 I 8
benchmark-eval
Key findings
TiCoder demonstrates that interactive test-driven code generation using LLM-generated tests can substantially improve pass@1 accuracy - by 22.49% on MBPP (48.24% → 70.73%) and 24.79% on HumanEval (30.49% → 55.28%) with a single simulated user query. Discriminative test ranking and dynamic test mutation are the most impactful components in ablation. The approach generates a user-accepted test within an average of 1.5–1.7 queries for 87–96% of benchmark examples.
Claims (6)
strongTiCoder improves pass@1 from 48.24% to 70.73% on MBPP and from 30.49% to 55.28% on HumanEval with a single simulated user query.
strongWith 5 user queries, TiCoder achieves pass@1 of 85.95% on MBPP and 84.47% on HumanEval.
strongTiCoder generates a user-accepted test within an average of 1.7 queries for 87.12% of MBPP and 1.5 queries for 95.73% of HumanEval examples.
moderateEach component (test ranking, dynamic mutation, code ranking, static mutation, prompt design) contributes to TiCoder's effectiveness.
moderateTiCoder outperforms CodeT, which uses no user interaction, on MBPP with a single query and on HumanEval with 2 queries.
strongTest mutation techniques improve the pool of tests over purely LLM-generated tests, shown by 10.72% (MBPP) and 13.74% (HumanEval) accuracy gain after 5 interactions.
Red flags (5)
Simulated user evaluation only: All experiments use an oracle (reference implementation) instead of real users. The cognitive cost and accuracy of real user responses is unknown. The paper acknowledges this but provides no user study, making the claimed 'interactive' benefit unvalidated with actual users.
No error bars or variance on any result: All results come from a single cached Codex query. No repeated runs, no confidence intervals, no statistical tests. The stochasticity of Codex (acknowledged in Section VI) means results could vary substantially across runs.
No contamination analysis: Both MBPP and HumanEval were publicly available before Codex training. The high pass@100 values (89.70% MBPP, 90.85% HumanEval) could partly reflect memorization. No contamination check is performed.
Microsoft-OpenAI financial relationship: The majority of authors are Microsoft Research employees evaluating OpenAI's Codex model. Microsoft is a major investor in OpenAI. This conflict of interest is not disclosed or acknowledged.
Discontinued model: Codex code-davinci-002 was discontinued by OpenAI in March 2023. Results cannot be verified or replicated with the same model, and the cached outputs are not released.
Games detected
Big Numbers No Error BarsOverclaimingContamination DodgeTrust Us
Dimension scores
Composite: 21.6(harmonic mean)
Checklist (22/54 passed)
Category scores
artifacts
0
statistical methodology
20
evaluation design
66.7
claims and evidence
60
setup transparency
100
limitations and scope
100
data integrity
66.7
conflicts of interest
25
contamination
0
cost and practicality
0
experimental rigor
25
data leakage
0