Terminal-Bench 2
terminalbench2 · v0.1.0
Terminal-Bench 2 (Laude Institute / Harbor Framework) ported to the CUBE protocol — 89 real-world terminal tasks (compile, debug, deploy, query, modernize) with pytest-based validation. Each task hands the agent a Linux shell pre-loaded with a project, asks for a concrete deliverable (a fixed bug, a passing test, a compiled binary, an inferred answer), and verifies the result by running an upstream pytest test suite the agent never sees. Tasks span 16 categories with difficulty levels easy / medium / hard.
Install
pip install terminalbench2-cube
Version: 0.1.0 · PyPI page
Feature Flags
Legal
Reproducibility journal 2 submissions
How to submit →This is a reproducibility journal — not a leaderboard.
Submissions document how reference agents and models score over time, across infrastructures, cube versions, and package versions. Use it to detect drift and validate environments. Not a place to publish a new agent or fine-tune to "win" — there is no ranking, scores are self-reported, and submissions are unverified. To showcase a new agent or model, use ATLAS / EEE / your own benchmark page.
✓ success · ✗ failure · ⏱ max-steps · 💥 system error · – missing · sort by clicking column headers · check 2+ rows to compare side-by-side
Registry Entry (YAML)
id: terminalbench2
name: "Terminal-Bench 2"
version: "0.1.0"
description: >
Terminal-Bench 2 (Laude Institute / Harbor Framework) ported to the
CUBE protocol — 89 real-world terminal tasks (compile, debug, deploy,
query, modernize) with pytest-based validation. Each task hands the
agent a Linux shell pre-loaded with a project, asks for a concrete
deliverable (a fixed bug, a passing test, a compiled binary, an
inferred answer), and verifies the result by running an upstream pytest
test suite the agent never sees. Tasks span 16 categories with
difficulty levels easy / medium / hard.
package: terminalbench2-cube
dev_install_url: "git+https://github.com/The-AI-Alliance/cube-harness#subdirectory=cubes/terminalbench2-cube"
authors:
- github: recursix
name: Alexandre Lacoste
legal:
wrapper_license: MIT
benchmark_license:
reported: Apache-2.0
source_url: "https://github.com/harbor-framework/terminal-bench-2"
verified_by_original_authors: false
getting_started_url: "https://github.com/harbor-framework/terminal-bench-2"
tags:
- coding
- os
status: degraded
resources: []
task_count: 89
has_debug_task: true
has_debug_agent: true
action_space: []
features:
async: false
streaming: false
multi_agent: false
multi_dim_reward: false