CELLO — Code Evaluation LLM Leaderboard (Open)

🏆 Model Leaderboard

Ranked by composite score across 7 evaluation dimensions. All evaluations use OSI-approved open-source tools and real rustc compilation.

📊 New: Detailed evaluation reports now available — view C source, generated Rust code, compilation errors, and full scoring breakdowns.

#	Model	Type	Completion	Bug Detect	Code Review	Security	Docs	Refactor	Cross-Lang	Overall ▼

⚠️ Demo Data: Scores shown are illustrative proof-of-concept data. Production evaluations will use automated pipelines with full reproducibility.

📊 Visualizations

Visual comparison of model performance across evaluation dimensions.

Overall Scores — All Models

Top 5 Models — Radar Comparison

Open Source vs Closed Source — Average by Category

Model Size vs. Performance

🔬 Evaluation Methodology

CELLO evaluates C → Rust transpilation quality using real compiler verification and multi-dimensional scoring.

📊 Evaluation Workflow

1

C Source Code

Real infrastructure code

→

2

LLM Transpilation

Claude & Gemini generate Rust

→

3

Rustc Compilation

Real compiler verification

→

4

Quality Scoring

6 dimensions (0-100 pts)

→

5

Detailed Reports

Full transparency (MD + HTML)

🎯 Scoring Dimensions (100 Points Total)

✓

Compilation (25%)

Does it compile with rustc? Pass/fail verification.

🛡️

Safety (20%)

Memory safety: unsafe blocks, error handling, Option types.

✨

Quality (20%)

Idiomatic Rust patterns, documentation, naming conventions.

✔️

Correctness (15%)

Functional equivalence, edge case handling, unit tests.

🔧

Maintainability (10%)

Code organization, public API clarity, modularity.

⚡

Performance (10%)

Efficiency patterns: minimal cloning, reference usage.

🔑 Key Principles

Reproducibility — All prompts, configs, and evaluation scripts are public

Real Compilation — Uses rustc 1.93+ for actual verification (not syntax check)

Transparency — Full source code, generated output, and errors in every report

Config-Driven — Add models/projects by editing YAML files (no code changes)

Open Source — Apache 2.0 license, community contributions welcome

Extensible — Modular provider pattern supports any LLM API

📁 Test Projects

Current proof-of-concept: 3 C infrastructure projects (1.7KB - 3.3KB) for C → Rust transpilation evaluation.

⚠️ Note: The D3.js, matplotlib projects below are aspirational future goals. Current evaluations use the 3 small C test files (string_utils, buffer, hashmap) available in the GitHub repository.

D3.js JavaScript

The definitive data visualization library for the web. Complex functional API, SVG/DOM manipulation, sophisticated mathematical algorithms for scales, projections, and force simulations.

⭐ 109k+ stars • 📦 Modular architecture • 🧮 Heavy algorithmic content

makepad-d3 Rust

D3.js-compatible data visualization library for Makepad's GPU-accelerated rendering. 40+ chart types including Bar, Line, Pie, Sankey, Force Graph, Treemap, Globe Map.

🔄 JS→Rust port • 🖥️ GPU rendering • 📊 D3-compatible API

matplotlib Python

Python's comprehensive 2D plotting library. Massive OOP codebase with NumPy integration, extensive customization, and decades of battle-tested code.

⭐ 20k+ stars • 🐍 Pure Python + C extensions • 📈 Gold standard for scientific vis

makepad-matplot Rust

Matplotlib-compatible plotting library for Makepad GPU framework. Translates Python's OOP plotting paradigm into Rust's ownership model.

🔄 Python→Rust port • 🖥️ GPU rendering • 📊 matplotlib-compatible API

🎯 Why These Projects?

Cross-Language Pairs

JS↔Rust (D3↔makepad-d3) and Python↔Rust (matplotlib↔makepad-matplot) test translation understanding

Multiple Paradigms

Functional (D3), OOP (matplotlib), Systems (Rust) — tests versatility across programming styles

Algorithmic Depth

Scales, projections, force layouts, tree algorithms — requires genuine mathematical understanding

Visual Verification

Output can be visually verified — a chart either looks right or it doesn't

⚖️ CELLO vs. Existing Benchmarks

How CELLO fills the gaps left by current LLM code evaluation approaches.

Feature	SonarSource	HumanEval	SWE-bench	BigCodeBench	CELLO
Test Cases	~4,000	164	2,294	1,140	10,000+
Open-Source Focus	✗	✗	✗	~	✓
Multi-Language	✓	✗	✗	~	✓
Code Quality Metrics	✓	✗	✗	✗	✓
Security Analysis	✓	✗	✗	✗	✓
Performance Eval	✗	✗	✗	✗	✓
Real-World Projects	✗	✗	✓	✗	✓
Community-Driven	✗	✗	~	~	✓
Fully Transparent	✗	✓	✓	✓	✓
Live/Continuous	✗	✗	✗	✗	✓
Cross-Language Eval	✗	✗	✗	✗	✓

💡 Why CELLO Matters

Existing benchmarks were designed for a world where LLMs were primarily closed-source products. The open-source AI revolution has created a vibrant ecosystem of models that developers can self-host, fine-tune, and integrate — but they lack reliable quality benchmarks.

SonarSource's benchmark, while pioneering, serves primarily as a marketing tool for their commercial SonarQube product. It uses a proprietary evaluation pipeline that cannot be independently reproduced, and focuses on models most developers cannot modify or self-host.

CELLO is built for the open-source community, by the open-source community. Every tool in the evaluation pipeline is OSI-approved. Every test case, prompt, and result is publicly available. Anyone can contribute, verify, and build upon the benchmark.