Terminal-Bench 2

terminalbench2 · v0.1.0

coding os

Terminal-Bench 2 (Laude Institute / Harbor Framework) ported to the CUBE protocol — 89 real-world terminal tasks (compile, debug, deploy, query, modernize) with pytest-based validation. Each task hands the agent a Linux shell pre-loaded with a project, asks for a concrete deliverable (a fixed bug, a passing test, a compiled binary, an inferred answer), and verifies the result by running an upstream pytest test suite the agent never sees. Tasks span 16 categories with difficulty levels easy / medium / hard.

By: @recursix (Alexandre Lacoste)

Install

pip install terminalbench2-cube

Version: 0.1.0 · PyPI page

Tasks

local

Infra

Yes

Debug Task

Yes

Debug Agent

Feature Flags

— async

— streaming

— multi_agent

— multi_dim_reward

Legal

Wrapper license MIT

Benchmark license

Apache-2.0 Self-reported — verify before commercial use Source →

License information is self-reported by the cube developer and has not been verified by the AI Alliance. Always consult the source URL and original benchmark authors for authoritative terms.

Slow check not yet run. Stress test results will appear here after the async compliance check completes.

Reproducibility journal 2 submissions

How to submit →

This is a reproducibility journal — not a leaderboard.

Submissions document how reference agents and models score over time, across infrastructures, cube versions, and package versions. Use it to detect drift and validate environments. Not a place to publish a new agent or fine-tune to "win" — there is no ranking, scores are self-reported, and submissions are unverified. To showcase a new agent or model, use ATLAS / EEE / your own benchmark page.

✓ success · ✗ failure · ⏱ max-steps · 💥 system error · – missing · sort by clicking column headers · check 2+ rows to compare side-by-side

Registry Entry (YAML)

View on GitHub →

id: terminalbench2
name: "Terminal-Bench 2"
version: "0.1.0"
description: >
  Terminal-Bench 2 (Laude Institute / Harbor Framework) ported to the
  CUBE protocol — 89 real-world terminal tasks (compile, debug, deploy,
  query, modernize) with pytest-based validation. Each task hands the
  agent a Linux shell pre-loaded with a project, asks for a concrete
  deliverable (a fixed bug, a passing test, a compiled binary, an
  inferred answer), and verifies the result by running an upstream pytest
  test suite the agent never sees. Tasks span 16 categories with
  difficulty levels easy / medium / hard.
package: terminalbench2-cube
dev_install_url: "git+https://github.com/The-AI-Alliance/cube-harness#subdirectory=cubes/terminalbench2-cube"

authors:
- github: recursix
  name: Alexandre Lacoste

legal:
  wrapper_license: MIT
  benchmark_license:
    reported: Apache-2.0
    source_url: "https://github.com/harbor-framework/terminal-bench-2"
    verified_by_original_authors: false

getting_started_url: "https://github.com/harbor-framework/terminal-bench-2"
tags:
- coding
- os
status: degraded
resources: []
task_count: 89
has_debug_task: true
has_debug_agent: true
action_space: []
features:
  async: false
  streaming: false
  multi_agent: false
  multi_dim_reward: false

Terminal-Bench 2

Feature Flags

Legal

Reproducibility journal 2 submissions

Compare submissions

Registry Entry (YAML)