03

SIGINT

Paper detail

Granite Code Models: A Family of Open Foundation Models for Code Intelligence

2024 arXiv Mixed Composite: 13.4
R 45 T 58 C 0 I 17
benchmark-eval

Key findings

Granite Code models (3B-34B parameters) trained on 3.5-4.5T tokens across 116 programming languages achieve competitive or state-of-the-art performance among open-source code LLMs across diverse tasks including code generation, fixing, explanation, editing, and translation. Granite-8B-Code-Base outperforms CodeGemma-8B by ~12 points on HumanEvalPack average despite training on fewer tokens (4.5T vs 7.5T). The models show particular strength in code explanation and fixing tasks where specialized code models like StarCoder2 and CodeGemma fall behind. All models are released under Apache 2.0 license.

Claims (7)

strongGranite-8B-Code-Base outperforms CodeGemma-8B by almost 12 points on HumanEvalPack average (33.2% vs 21.3%), despite being trained on significantly fewer tokens (4.5T vs 7.5T).
moderateGranite Code models consistently reach state-of-the-art performance among available open-source code LLMs.
strongGranite-8B-Code-Base outperforms Llama-3-8B-Base by ~12 points on GSM8K and ~6 points on MATH.
strongGranite Code base models significantly outperform other SOTA base code LLMs on code explanation and fixing tasks.
weakDomain-specific code models are more suitable for cost- and performance-sensitive enterprise environments than larger general-purpose models.
weakDepth upscaling from the 20B model is effective for training the 34B model, with small initial performance drop that is quickly recovered.
strongInstruction tuning consistently improves function calling performance, with +17.88% overall accuracy from Granite-8B-Code-Base to Granite-8B-Code-Instruct.

Red flags (6)

Company evaluating own product: All 46 authors are IBM Research employees evaluating IBM's Granite Code models, which are part of IBM's commercial WatsonX product line. The paper acknowledges WatsonX Code Assistant in the references. This is a textbook conflict of interest where the funder has a direct commercial stake in positive results.
No contamination analysis: Training data includes GitHub code and publicly available datasets. Evaluation benchmarks (HumanEval 2021, MBPP 2021, etc.) have solutions widely available on GitHub. No decontamination, overlap analysis, or temporal leakage discussion is provided despite this being a known and serious issue for code LLM evaluation.
No error bars or statistical tests: All results are point estimates despite sampling-based evaluations. Numerous 'outperforms' claims are made by comparing raw numbers without any statistical significance testing. Small differences (e.g., 0.1% on HumanEvalSynthesize between Granite-20B and StarCoder2-15B) are treated as meaningful.
No limitations section: The paper has no limitations section, no threats to validity, and no scope boundaries. For a paper with 46 authors and strong enterprise positioning claims, the complete absence of self-critical analysis is concerning.
Enterprise claims without enterprise evaluation: The paper repeatedly claims models are 'optimized for enterprise software development workflows' and suitable for 'enterprise environments,' but all evaluation is on academic benchmarks. No real-world enterprise tasks, developer studies, or deployment metrics are provided.
Selective baseline presentation: Some tables include different model sets, making cross-table comparison difficult. CodeGemma-2B is not included in all evaluations where Granite-3B appears. The Llama-3-70B Python generation issue (footnote in Table 3) is flagged but no investigation is described.

Games detected

Big Numbers No Error BarsOverclaimingOpen Source TheaterContamination DodgeMoving Goalpost

Dimension scores

Rigor
44.6
Transparency
58.3
Claims
0.0
Integrity
16.7
Composite: 13.4(harmonic mean)

Checklist (19/54 passed)

Category scores

artifacts
50
statistical methodology
20
evaluation design
66.7
claims and evidence
0
setup transparency
75
limitations and scope
0
data integrity
66.7
conflicts of interest
50
contamination
0
cost and practicality
50
experimental rigor
25
data leakage
0
arXiv · PDF · DOI · Code