MiniWob++
MiniWob++ (Mini World of Bits++) is a collection of 125 browser-based web-interaction tasks ranging from simple button clicks to multi-step form filling. The CUBE wrapper starts a local HTTP server th…
Any CUBE-compliant evaluation platform can discover and run these benchmarks without custom integration.
pip install <package>
→ ready to evaluate
MiniWob++ (Mini World of Bits++) is a collection of 125 browser-based web-interaction tasks ranging from simple button clicks to multi-step form filling. The CUBE wrapper starts a local HTTP server th…
OSWorld benchmarks multimodal agents on open-ended computer tasks executed inside a real Ubuntu 22.04 desktop environment. Tasks span 369 scenarios across applications such as Chrome, LibreOffice, Thu…
WebArena Verified benchmarks agents on 812 verified web automation tasks across 6 realistic web platforms: Magento shopping admin and storefront, a Reddit clone (Postmill), GitLab CE, Wikipedia (Kiwix…
WorkArena evaluates agents on enterprise service-desk workflows inside a real ServiceNow Personal Developer Instance. Tasks are organized into three levels: L1 atomic tasks (~33 unique tasks x multipl…
No benchmarks match your filters.
| Benchmark | Version | Tags | Tasks | Infra | Debug | License | Status | |
|---|---|---|---|---|---|---|---|---|
|
MiniWob++
miniwob |
1.0.0 |
web
gui
|
125 | local | ✓ | MIT | Active | Details → |
|
OSWorld
osworld |
0.2.0 |
os
gui
desktop
multimodal
|
368 | aws | ✓ | CC-BY-4.0 | Active | Details → |
|
WebArena Verified
webarena-verified |
1.0.0 |
web
gui
|
812 | local | ✓ | Apache-2.0 | Active | Details → |
|
WorkArena
workarena |
1.0.0 |
web
gui
|
333 | local | ✓ | Apache-2.0 | Active | Details → |
No benchmarks match your filters.