Link Search Menu Expand Document

Authoring a CUBE

So you want to wrap a benchmark as a CUBE. This guide walks through what that involves, the easiest way to start, and who to ping if your benchmark isn’t a clean fit.

The short version: you implement four Python classes (tool, task, benchmark, debug agent), run cube test to prove it works, and submit one YAML file to the registry. Most benchmarks fit this shape naturally once it clicks.

Three Claude Code skills cover the workflow end to end:

Skill Phase What it does
/new-cube Scaffold Interviews you and writes the four classes + registry entry. (skill)
/review-cube Audit Installs the package, runs cube test, audits against invariants, produces a Blocking/Suggestions report. (skill)
/auto-cube Iterate Runs real LLMs against the cube, classifies failures (infra / scaffold / model / benchmark), ships fixes — the deep-debug pass cube test cannot do. (skill, README)

/auto-cube lives in cube-harness (the runtime); the other two live here. Each phase has its own section below.

What you’re building

A CUBE package exposes a benchmark through a uniform protocol so any CUBE-compatible harness can run it without custom integration. You implement four things:

Layer What it answers
Tool What actions can the agent take? (e.g. click, type, run_shell)
Task What’s the initial observation, and how do I score a solution?
Benchmark What’s the list of tasks, and what shared setup do they need?
Debug One deterministic solution per task — proves everything wires up correctly.

Each layer has a small, well-defined contract. Formal per-layer specs live in openspec/specs/ when you want the contract-level detail.

Three ways to start

If you have Claude Code installed, open a workspace with cube-standard checked out and run:

/new-cube

The skill interviews you (what the agent does, what counts as a solved task, what resources you need), scaffolds the package, fills in the template TODOs, runs the debug suite, and produces a registry entry. You review and correct as it goes — it handles the boilerplate so you focus on the benchmark logic.

Option 2 — Copy the reference implementation

cp -r examples/counter-cube my-bench
cd my-bench && uv sync

counter-cube is the minimal real cube — increment a counter to reach a target. Every layer has a comment explaining its role. Rename the placeholders, replace the logic with yours.

Option 3 — Scaffold from the template

cube init my-bench
cd my-bench && uv sync

Blank slate with TODO markers at every decision point. Best if your benchmark’s shape doesn’t resemble counter-cube.

Implementation order

Work through the layers top-down. Each file has TODO comments pointing at what needs to change.

# File What to fill in
1 tool.py Check reusable tools first — for web agents use cube-browser-tool, for desktop/CUA use cube-computer-tool. Import directly or subclass to add benchmark-specific actions. Only subclass Tool from scratch if neither fits; mark methods with @tool_action.
2 task.py Implement reset() (opening observation) and evaluate() (reward on termination)
3 benchmark.py Fill BenchmarkMetadata and task_metadata (inline or CSV/JSON); implement _setup() / close()
4 debug.py One deterministic action sequence per task — must reach reward == 1.0
5 pyproject.toml Update name, description, and the cube.benchmarks entry-point

A note on task metadata. If you have more than a handful of tasks, you’ll load them from task_metadata.csv or .json rather than inlining them in benchmark.py. If your task data starts life in a different shape — scraped from a website, exported from an existing benchmark repo, hand-curated in a spreadsheet — expect to write a small one-off conversion script as a pre-step. The /new-cube skill walks you through this; following options 2 or 3 you’ll handle it manually.

Framework invariants are in the layer specs — return types and action wrapping in tool/spec.md, serialization and reward semantics in task/spec.md. Read them once before you’re deep in the code.

Validate

cube test my-bench

Every debug task must hit reward == 1.0. If one doesn’t, either the debug action sequence is wrong or the task’s evaluate() is wrong — catching this locally is the whole point of the debug suite.

Before you open a registry PR, self-audit with the /review-cube skill:

/review-cube ./my-bench

/review-cube installs your package, runs pytest, runs cube test, audits against cube-standard invariants, and produces a Blocking / Suggestions report. Resolve everything in the Blocking section before submitting. Registry CI catches the same issues later, but locally is faster and less public.

Prompt hints and per-task clarifications

A benchmark may organize two optional, agent-facing prompt strings to steer how harnesses present it — without rewriting the benchmark itself. The benchmark only stores them; cube-standard loads them and a harness folds them into the agent config at experiment-design time.

Both live in an optional benchmark_clarifications.py sidecar next to the benchmark’s module (mirroring the task_metadata “files next to the module” convention), exposing two module-level names:

# my_cube/benchmark_clarifications.py
BENCHMARK_HINT = "Submit your final answer with final_step."

_SLIDER_TASKS = ["slider-1", "slider-2", "slider-3"]
TASK_CLARIFICATION = {tid: "After setting the values, click submit." for tid in _SLIDER_TASKS}

BENCHMARK_HINT — one concise paragraph for conventions a first-time reader would miss but that aren’t specific to any single task: a high-level workflow, the shape of a verifier, a recurring ambiguity. Keep it short and generic — a generalist agent should remain competitive without it; it exists so opt-in evaluations report on a level playing field, not so authors engineer prompts for one model.

TASK_CLARIFICATION — a {task_id: text} dict for individually brittle tasks whose original wording omits a step a reasonable LLM would not infer. Canonical example: a miniwob task whose objective reads “set slider to 32 and string value to ‘foo’” but whose verifier only rewards if the agent then clicks submit — a competent LLM would not click submit unprompted, and that is not really the LLM’s fault. Because it’s a .py, one clarification can be reused across many task ids (above) — that reuse is a deliberate regularizer, pushing clarifications to generalize rather than overfit per task.

A harness loads both via BenchmarkConfig.load_benchmark_clarifications() (returns (benchmark_hint, task_clarification); empty when no sidecar exists), then folds them into the agent config at experiment-design time — e.g.:

overlay = MyBenchmarkConfig.load_benchmark_clarifications()
agent = GennyConfig(
    benchmark_hint_prompt=overlay.benchmark_hint,
    task_clarification=overlay.task_clarification,  # pass {} to run without clarifications
    ...,
)

Applying the overlay is the recipe’s explicit choice — to run a clean baseline, simply don’t pass it. There is no separate on/off flag.

Because both fields are metadata, third-party harnesses see them through the same serialization the rest of TaskMetadata / BenchmarkMetadata uses — no extra plumbing on the cube side.

Iterate (real LLMs find what the debug suite can’t)

cube test validates with the Debug agent — deterministic action sequences, no LLM. Real LLMs find a different class of issue: infra flakes, scaffold bugs, tasks that are technically solvable but practically impossible, scoring that’s too strict or too lenient, hidden environmental assumptions, and sometimes problems in the benchmark itself (ambiguous prompts, broken ground truth, contaminated training data leaking into the task description).

/auto-cube

/auto-cube lives in cube-harness. It runs an iterative experiment loop: sweep models × tool configs across a task subset, dispatch the Investigator sub-agent on every trajectory, classify failures (infra / scaffold / model / benchmark), and ship fixes via the auto-fix methodology. You get back a REPORT.md session rollup, one Fix Report PR per issue, and design-debt issues for systemic signals.

Auto-CUBE outer loop: dispatch → per-experiment Investigator on each trajectory → analysis → interventions

Recommended once before registry submission, even if cube test and /review-cube are green — at least one real-LLM session against a fresh cube usually surfaces something. See the Auto-CUBE skill README for the prompt template and setup.

Publish

Once cube test and /review-cube both pass, submit to the registry with one command:

cube registry add --submit

This generates a cube-registry-entry.yaml from your pyproject.toml, forks cube-registry, commits the entry, and opens a PR. Registry CI runs three hard gates (ownership-check, quick-compliance, LLM semantic review) plus an informational pre-merge slow-check. On hard gates green and a path-isolated diff, the PR auto-merges. If the LLM review flags a CONCERN (typical causes: package not yet on PyPI so the page is empty, README doesn’t cover the cube subdirectory, author handles can’t be confirmed against the linked repo’s git history), the PR is labeled ready-for-review for a maintainer.

Run cube registry add without --submit first if you want to generate the YAML locally, edit it, and review the entry before opening the PR.

Your package also needs to land on PyPI for the registry’s compliance suite to install and run it — publish whenever you’re ready; it doesn’t have to come before the registry PR.

We’ll help you

Not every benchmark is a clean fit on first read. If you hit something awkward — the action space doesn’t compress into a single Tool, scoring needs human judgment, the infra requirements are unusual, the episode structure doesn’t match reset → step → evaluatetell us before wrangling it on your own. Common awkwardness usually has an idiomatic solution we can point you at, and if yours is genuinely new we’d rather evolve the protocol than watch you paper over it.

Ways to reach us:

Deeper references