MiniWob++
MiniWob++ (Mini World of Bits++) is a collection of 125 browser-based web-interaction tasks ranging from simple button clicks to multi-step form filling. The CUBE wrapper starts a local HTTP server th…
Any CUBE-compliant evaluation platform can discover and run these benchmarks without custom integration.
pip install <package>
→ ready to evaluate
MiniWob++ (Mini World of Bits++) is a collection of 125 browser-based web-interaction tasks ranging from simple button clicks to multi-step form filling. The CUBE wrapper starts a local HTTP server th…
OSWorld benchmarks multimodal agents on open-ended computer tasks executed inside a real Ubuntu 22.04 desktop environment. Tasks span 369 scenarios across applications such as Chrome, LibreOffice, Thu…
SWE-bench Live ported to the CUBE protocol — 1,895 continuously-updated, contamination-resistant GitHub issue resolution tasks across many open-source repositories. Each task pairs a real issue with i…
SWE-bench Verified ported to the CUBE protocol — 500 human-validated GitHub issues with test-based resolution criteria. Princeton + OpenAI's curated subset of the broader SWE-bench dataset where every…
Terminal-Bench 2 (Laude Institute / Harbor Framework) ported to the CUBE protocol — 89 real-world terminal tasks (compile, debug, deploy, query, modernize) with pytest-based validation. Each task hand…
WebArena Verified benchmarks agents on 812 verified web automation tasks across 6 realistic web platforms: Magento shopping admin and storefront, a Reddit clone (Postmill), GitLab CE, Wikipedia (Kiwix…
WorkArena evaluates agents on enterprise service-desk workflows inside a real ServiceNow Personal Developer Instance. Tasks are organized into three levels: L1 atomic tasks (~33 unique tasks x multipl…
No benchmarks match your filters.
| Benchmark | Version | Tags | Tasks | Infra | Debug | License | Status | |
|---|---|---|---|---|---|---|---|---|
|
MiniWob++
miniwob |
1.0.0 |
web
gui
|
125 | local | ✓ | MIT | Degraded | Details → |
|
OSWorld
osworld |
0.2.0 |
os
gui
desktop
multimodal
|
368 | aws | ✓ | CC-BY-4.0 | Degraded | Details → |
|
SWE-bench Live
swebench-live |
0.1.0 |
coding
science
|
1895 | local | ✓ | MIT | Degraded | Details → |
|
SWE-bench Verified
swebench-verified |
0.1.0 |
coding
|
500 | local | ✓ | MIT | Degraded | Details → |
|
Terminal-Bench 2
terminalbench2 |
0.1.0 |
coding
os
|
89 | local | ✓ | Apache-2.0 | Degraded | Details → |
|
WebArena Verified
webarena-verified |
1.0.0 |
web
gui
|
812 | local | ✓ | Apache-2.0 | Active | Details → |
|
WorkArena
workarena |
1.0.0 |
web
gui
|
333 | local | ✓ | Apache-2.0 | Degraded | Details → |
No benchmarks match your filters.