CUBE Registry — Benchmark Catalog

Degraded 125 tasks

MiniWob++

MiniWob++ (Mini World of Bits++) is a collection of 125 browser-based web-interaction tasks ranging from simple button clicks to multi-step form filling. The CUBE wrapper starts a local HTTP server th…

web gui

Details → Paper Docs PyPI

Degraded 368 tasks

OSWorld

OSWorld benchmarks multimodal agents on open-ended computer tasks executed inside a real Ubuntu 22.04 desktop environment. Tasks span 369 scenarios across applications such as Chrome, LibreOffice, Thu…

os gui desktop multimodal

Details → Paper PyPI

Degraded 1895 tasks

SWE-bench Live

SWE-bench Live ported to the CUBE protocol — 1,895 continuously-updated, contamination-resistant GitHub issue resolution tasks across many open-source repositories. Each task pairs a real issue with i…

coding science

Details → Paper Docs PyPI

Degraded 500 tasks

SWE-bench Verified

SWE-bench Verified ported to the CUBE protocol — 500 human-validated GitHub issues with test-based resolution criteria. Princeton + OpenAI's curated subset of the broader SWE-bench dataset where every…

coding

Details → Paper Docs PyPI

Degraded 89 tasks

Terminal-Bench 2

Terminal-Bench 2 (Laude Institute / Harbor Framework) ported to the CUBE protocol — 89 real-world terminal tasks (compile, debug, deploy, query, modernize) with pytest-based validation. Each task hand…

coding os

Details → Docs PyPI

Active 812 tasks

WebArena Verified

WebArena Verified benchmarks agents on 812 verified web automation tasks across 6 realistic web platforms: Magento shopping admin and storefront, a Reddit clone (Postmill), GitLab CE, Wikipedia (Kiwix…

web gui

Details → Paper Docs PyPI

Degraded 333 tasks

WorkArena

WorkArena evaluates agents on enterprise service-desk workflows inside a real ServiceNow Personal Developer Instance. Tasks are organized into three levels: L1 atomic tasks (~33 unique tasks x multipl…

web gui

Details → Paper PyPI

Benchmark	Version	Tags	Tasks	Infra	Debug	License	Status
MiniWob++ miniwob	1.0.0	web gui	125	local	✓	MIT	Degraded	Details →
OSWorld osworld	0.2.0	os gui desktop multimodal	368	aws	✓	CC-BY-4.0	Degraded	Details →
SWE-bench Live swebench-live	0.1.0	coding science	1895	local	✓	MIT	Degraded	Details →
SWE-bench Verified swebench-verified	0.1.0	coding	500	local	✓	MIT	Degraded	Details →
Terminal-Bench 2 terminalbench2	0.1.0	coding os	89	local	✓	Apache-2.0	Degraded	Details →
WebArena Verified webarena-verified	1.0.0	web gui	812	local	✓	Apache-2.0	Active	Details →
WorkArena workarena	1.0.0	web gui	333	local	✓	Apache-2.0	Degraded	Details →