Live Benchmark — Open Source

CELLO

Code Evaluation LLM Leaderboard (Open) — A transparent, community-driven benchmark for evaluating LLM code generation quality with a focus on open-source models.

2
Models Evaluated
18
Evaluation Metrics
2
Languages
6
Quality Dimensions
Last updated: Feb 16, 2026 (with real compilation) v0.2.0-beta 100% Open Source Tools + Detailed Reports

🏆 Model Leaderboard

Ranked by composite score across 7 evaluation dimensions. All evaluations use OSI-approved open-source tools and real rustc compilation.

📊 New: Detailed evaluation reports now available — view C source, generated Rust code, compilation errors, and full scoring breakdowns.

Type:
Size:
Sort:
# Model Type Completion Bug Detect Code Review Security Docs Refactor Cross-Lang Overall ▼

⚠️ Demo Data: Scores shown are illustrative proof-of-concept data. Production evaluations will use automated pipelines with full reproducibility.

📊 Visualizations

Visual comparison of model performance across evaluation dimensions.

Overall Scores — All Models

Top 5 Models — Radar Comparison

Open Source vs Closed Source — Average by Category

Model Size vs. Performance

🔬 Evaluation Methodology

CELLO evaluates C → Rust transpilation quality using real compiler verification and multi-dimensional scoring.

📊 Evaluation Workflow

1

C Source Code

Real infrastructure code

2

LLM Transpilation

Claude & Gemini generate Rust

3

Rustc Compilation

Real compiler verification

4

Quality Scoring

6 dimensions (0-100 pts)

5

Detailed Reports

Full transparency (MD + HTML)

🎯 Scoring Dimensions (100 Points Total)

Compilation (25%)

Does it compile with rustc? Pass/fail verification.

🛡️

Safety (20%)

Memory safety: unsafe blocks, error handling, Option types.

Quality (20%)

Idiomatic Rust patterns, documentation, naming conventions.

✔️

Correctness (15%)

Functional equivalence, edge case handling, unit tests.

🔧

Maintainability (10%)

Code organization, public API clarity, modularity.

Performance (10%)

Efficiency patterns: minimal cloning, reference usage.

🔑 Key Principles

Reproducibility — All prompts, configs, and evaluation scripts are public
Real Compilation — Uses rustc 1.93+ for actual verification (not syntax check)
Transparency — Full source code, generated output, and errors in every report
Config-Driven — Add models/projects by editing YAML files (no code changes)
Open Source — Apache 2.0 license, community contributions welcome
Extensible — Modular provider pattern supports any LLM API

📁 Test Projects

Current proof-of-concept: 3 C infrastructure projects (1.7KB - 3.3KB) for C → Rust transpilation evaluation.

⚠️ Note: The D3.js, matplotlib projects below are aspirational future goals. Current evaluations use the 3 small C test files (string_utils, buffer, hashmap) available in the GitHub repository.

D3.js JavaScript

The definitive data visualization library for the web. Complex functional API, SVG/DOM manipulation, sophisticated mathematical algorithms for scales, projections, and force simulations.

  • ⭐ 109k+ stars • 📦 Modular architecture • 🧮 Heavy algorithmic content

makepad-d3 Rust

D3.js-compatible data visualization library for Makepad's GPU-accelerated rendering. 40+ chart types including Bar, Line, Pie, Sankey, Force Graph, Treemap, Globe Map.

  • 🔄 JS→Rust port • 🖥️ GPU rendering • 📊 D3-compatible API

matplotlib Python

Python's comprehensive 2D plotting library. Massive OOP codebase with NumPy integration, extensive customization, and decades of battle-tested code.

  • ⭐ 20k+ stars • 🐍 Pure Python + C extensions • 📈 Gold standard for scientific vis

makepad-matplot Rust

Matplotlib-compatible plotting library for Makepad GPU framework. Translates Python's OOP plotting paradigm into Rust's ownership model.

  • 🔄 Python→Rust port • 🖥️ GPU rendering • 📊 matplotlib-compatible API

🎯 Why These Projects?

Cross-Language Pairs

JS↔Rust (D3↔makepad-d3) and Python↔Rust (matplotlib↔makepad-matplot) test translation understanding

Multiple Paradigms

Functional (D3), OOP (matplotlib), Systems (Rust) — tests versatility across programming styles

Algorithmic Depth

Scales, projections, force layouts, tree algorithms — requires genuine mathematical understanding

Visual Verification

Output can be visually verified — a chart either looks right or it doesn't

⚖️ CELLO vs. Existing Benchmarks

How CELLO fills the gaps left by current LLM code evaluation approaches.

Feature SonarSource HumanEval SWE-bench BigCodeBench CELLO
Test Cases~4,0001642,2941,14010,000+
Open-Source Focus~
Multi-Language~
Code Quality Metrics
Security Analysis
Performance Eval
Real-World Projects
Community-Driven~~
Fully Transparent
Live/Continuous
Cross-Language Eval

💡 Why CELLO Matters

Existing benchmarks were designed for a world where LLMs were primarily closed-source products. The open-source AI revolution has created a vibrant ecosystem of models that developers can self-host, fine-tune, and integrate — but they lack reliable quality benchmarks.

SonarSource's benchmark, while pioneering, serves primarily as a marketing tool for their commercial SonarQube product. It uses a proprietary evaluation pipeline that cannot be independently reproduced, and focuses on models most developers cannot modify or self-host.

CELLO is built for the open-source community, by the open-source community. Every tool in the evaluation pipeline is OSI-approved. Every test case, prompt, and result is publicly available. Anyone can contribute, verify, and build upon the benchmark.