Benchmarks¶

SumoSpace ships with a reproducible evaluation framework. Every score is produced by a deterministic code-level verifier — no LLM grades another LLM's output.

What We Benchmark¶

The suite contains 5 tasks, each targeting a distinct class of autonomous coding work. All tasks run against the same fixture codebase — a small Python project in sumospace/benchmarks/fixtures/sample_project/.

#	Task	File	What the agent must do
1	`add_docstrings`	`utils.py`	Add a docstring to every undocumented function
2	`dead_code_removal`	`dead_code.py`	Remove unused functions and unused imports
3	`sync_to_async`	`sync_io.py`	Refactor all I/O functions to async/await
4	`fix_bugs`	`buggy.py`	Find and fix 4 distinct bugs
5	`explain_codebase`	(all)	Write a developer explanation of the codebase

Why these tasks? They cover the two main agent capabilities:

Write tasks (1–4) — the agent must produce a file mutation that passes a code-level check.
Read tasks (5) — the agent must produce a substantive, accurate explanation of real code.

How We Verify¶

No LLM scores LLM output. Every verifier is a pure Python function that inspects the AST or file contents directly.

Example: `add_docstrings` verifier¶

This is the exact code that produces the score:

sumospace/benchmarks/standalone.py

import ast

def verify_docstrings(workspace: Path) -> tuple[bool, float, str]:
    """Verify all functions in utils.py have docstrings."""
    target_file = workspace / "utils.py"
    if not target_file.exists():
        return False, 0.0, "utils.py not found"

    content = target_file.read_text(encoding="utf-8")
    tree = ast.parse(content)

    total_funcs = 0
    funcs_with_docs = 0

    for node in ast.walk(tree):
        if isinstance(node, ast.FunctionDef):
            total_funcs += 1
            if ast.get_docstring(node) is not None:
                funcs_with_docs += 1

    score = funcs_with_docs / total_funcs
    success = (funcs_with_docs == total_funcs)
    notes = f"{funcs_with_docs}/{total_funcs} functions have docstrings"
    return success, score, notes

The score is a float from 0.0 to 1.0. A score of 0.75 means 6 out of 8 functions received docstrings. success is only True when all functions have docstrings.

This approach means:

Results are reproducible — you can run the same benchmark on different hardware and get the same verdict.
Results are auditable — you can read the verifier code and understand exactly what passed.
Results are honest — a partial completion gets a partial score, not a binary pass/fail.

Run It Yourself¶

# Install
pip install sumospace

# Pull a model
ollama pull llama3:8b

# Run full benchmark (~2-3 hours on CPU, ~30 min on GPU)
sumo benchmark run --provider ollama --model llama3:8b

# Run one task only (~15 minutes)
sumo benchmark run --task add_docstrings --modes disabled,full --provider ollama --model llama3:8b

# View a saved result
cat benchmark_results/benchmark_*.md

# Re-render a JSON result as Markdown
sumo benchmark report benchmark_results/benchmark_20260507_120000.json

All results are saved to ./benchmark_results/ as both .md (human-readable) and .json (machine-readable).

Hardware Requirements¶

Hardware	Estimated time	Recommended model
CPU only, 8 GB RAM	2–3 hours	`phi3:mini`
CPU only, 16 GB RAM	~90 minutes	`llama3:8b`
GPU (RTX 3090+)	20–30 minutes	`llama3:8b`
GPU (A100)	8–12 minutes	`llama3:70b`

Small models

Models under 7B parameters (e.g. phi3:mini) lack the instruction-following fidelity to reliably pass committee JSON schema validation. Use --modes disabled to evaluate small models on raw task completion, bypassing the committee pipeline.

Results¶

Community validation in progress

We are collecting benchmark results across hardware configurations before publishing official numbers. We do not publish scores we cannot independently reproduce.

Once results are available, they will appear here in this format:

Task	disabled	plan_only	critique_only	full
`add_docstrings`	—	—	—	—
`dead_code_removal`	—	—	—	—
`sync_to_async`	—	—	—	—
`fix_bugs`	—	—	—	—
`explain_codebase`	—	—	—	—

Community Results¶

Run the benchmark and share your results:

Run the benchmark:

sumo benchmark run --provider ollama --model <your-model>

Open a GitHub issue titled [Benchmark] <model> on <hardware>.
Attach the full output .md file and the .json file from ./benchmark_results/.

Please include:

Model name and size
Hardware (CPU / GPU, RAM)
SumoSpace version (sumo --version)
The full output file