Code Evaluation LLM Leaderboard (Open) — A transparent, community-driven benchmark for evaluating LLM code generation quality with a focus on open-source models.
Ranked by composite score across 7 evaluation dimensions. All evaluations use OSI-approved open-source tools and real rustc compilation.
📊 New: Detailed evaluation reports now available — view C source, generated Rust code, compilation errors, and full scoring breakdowns.
| # | Model | Type | Completion | Bug Detect | Code Review | Security | Docs | Refactor | Cross-Lang | Overall ▼ |
|---|
⚠️ Demo Data: Scores shown are illustrative proof-of-concept data. Production evaluations will use automated pipelines with full reproducibility.
Visual comparison of model performance across evaluation dimensions.
CELLO evaluates C → Rust transpilation quality using real compiler verification and multi-dimensional scoring.
Real infrastructure code
Claude & Gemini generate Rust
Real compiler verification
6 dimensions (0-100 pts)
Full transparency (MD + HTML)
Does it compile with rustc? Pass/fail verification.
Memory safety: unsafe blocks, error handling, Option types.
Idiomatic Rust patterns, documentation, naming conventions.
Functional equivalence, edge case handling, unit tests.
Code organization, public API clarity, modularity.
Efficiency patterns: minimal cloning, reference usage.
Current proof-of-concept: 3 C infrastructure projects (1.7KB - 3.3KB) for C → Rust transpilation evaluation.
⚠️ Note: The D3.js, matplotlib projects below are aspirational future goals. Current evaluations use the 3 small C test files (string_utils, buffer, hashmap) available in the GitHub repository.
The definitive data visualization library for the web. Complex functional API, SVG/DOM manipulation, sophisticated mathematical algorithms for scales, projections, and force simulations.
D3.js-compatible data visualization library for Makepad's GPU-accelerated rendering. 40+ chart types including Bar, Line, Pie, Sankey, Force Graph, Treemap, Globe Map.
Python's comprehensive 2D plotting library. Massive OOP codebase with NumPy integration, extensive customization, and decades of battle-tested code.
Matplotlib-compatible plotting library for Makepad GPU framework. Translates Python's OOP plotting paradigm into Rust's ownership model.
JS↔Rust (D3↔makepad-d3) and Python↔Rust (matplotlib↔makepad-matplot) test translation understanding
Functional (D3), OOP (matplotlib), Systems (Rust) — tests versatility across programming styles
Scales, projections, force layouts, tree algorithms — requires genuine mathematical understanding
Output can be visually verified — a chart either looks right or it doesn't
How CELLO fills the gaps left by current LLM code evaluation approaches.
| Feature | SonarSource | HumanEval | SWE-bench | BigCodeBench | CELLO |
|---|---|---|---|---|---|
| Test Cases | ~4,000 | 164 | 2,294 | 1,140 | 10,000+ |
| Open-Source Focus | ✗ | ✗ | ✗ | ~ | ✓ |
| Multi-Language | ✓ | ✗ | ✗ | ~ | ✓ |
| Code Quality Metrics | ✓ | ✗ | ✗ | ✗ | ✓ |
| Security Analysis | ✓ | ✗ | ✗ | ✗ | ✓ |
| Performance Eval | ✗ | ✗ | ✗ | ✗ | ✓ |
| Real-World Projects | ✗ | ✗ | ✓ | ✗ | ✓ |
| Community-Driven | ✗ | ✗ | ~ | ~ | ✓ |
| Fully Transparent | ✗ | ✓ | ✓ | ✓ | ✓ |
| Live/Continuous | ✗ | ✗ | ✗ | ✗ | ✓ |
| Cross-Language Eval | ✗ | ✗ | ✗ | ✗ | ✓ |
Existing benchmarks were designed for a world where LLMs were primarily closed-source products. The open-source AI revolution has created a vibrant ecosystem of models that developers can self-host, fine-tune, and integrate — but they lack reliable quality benchmarks.
SonarSource's benchmark, while pioneering, serves primarily as a marketing tool for their commercial SonarQube product. It uses a proprietary evaluation pipeline that cannot be independently reproduced, and focuses on models most developers cannot modify or self-host.
CELLO is built for the open-source community, by the open-source community. Every tool in the evaluation pipeline is OSI-approved. Every test case, prompt, and result is publicly available. Anyone can contribute, verify, and build upon the benchmark.