CUBE Registry / WebArena Verified Active

WebArena Verified

webarena-verified · v1.0.0

📄 Paper 📖 Getting Started
web gui

WebArena Verified benchmarks agents on 812 verified web automation tasks across 6 realistic web platforms: Magento shopping admin and storefront, a Reddit clone (Postmill), GitLab CE, Wikipedia (Kiwix), and OpenStreetMap with routing. Tasks span three types: information retrieval, data mutation, and navigation. Agents interact via browser tools; network traces (HAR) are captured and used during evaluation.

By: @NicolasAG (Nicolas Gontier) , @younik (Omar Younis) , @recursix (Alexandre Lacoste) , @manuel-delverme (Manuel Delverme)

Install

pip install webarena-verified-cube

Version: 1.0.0 · PyPI page

812
Tasks
local
Infra
Yes
Debug Task
Yes
Debug Agent

Feature Flags

async
streaming
multi_agent
multi_dim_reward

Legal

Wrapper license MIT
Benchmark license
Apache-2.0 Self-reported — verify before commercial use Source →

Notices

Live Website Clone

Requires running Docker containers that clone 6 web platforms (Magento, GitLab, Reddit/Postmill, Wikipedia/Kiwix, OpenStreetMap/OSRM). Containers are ephemeral and isolated — no real user data is affected.

License information is self-reported by the cube developer and has not been verified by the AI Alliance. Always consult the source URL and original benchmark authors for authoritative terms.
Slow check not yet run. Stress test results will appear here after the async compliance check completes.

Parallelization

Mode

task-parallel

Max concurrent tasks

4

Registry Entry (YAML)

id: webarena-verified
name: "WebArena Verified"
version: "1.0.0"
description: >
  WebArena Verified benchmarks agents on 812 verified web automation tasks
  across 6 realistic web platforms: Magento shopping admin and storefront,
  a Reddit clone (Postmill), GitLab CE, Wikipedia (Kiwix), and OpenStreetMap
  with routing. Tasks span three types: information retrieval, data mutation,
  and navigation. Agents interact via browser tools; network traces (HAR) are
  captured and used during evaluation.
package: webarena-verified-cube
dev_install_url: "git+https://github.com/The-AI-Alliance/cube-harness#subdirectory=cubes/webarena-verified"

authors:
- github: NicolasAG
  name: Nicolas Gontier
- github: younik
  name: Omar Younis
- github: recursix
  name: Alexandre Lacoste
- github: manuel-delverme
  name: Manuel Delverme

legal:
  wrapper_license: MIT
  benchmark_license:
    reported: Apache-2.0
    source_url: "https://github.com/WebArena-Verified/webarena-verified/blob/main/LICENSE"
    verified_by_original_authors: false
  notices:
  - type: live_website_clone
    description: "Requires running Docker containers that clone 6 web platforms (Magento, GitLab, Reddit/Postmill, Wikipedia/Kiwix,
      OpenStreetMap/OSRM). Containers are ephemeral and isolated — no real user data is affected."

tags:
- web
- gui

paper: "https://arxiv.org/abs/2406.11955"
getting_started_url: "https://github.com/The-AI-Alliance/cube-harness/tree/main/cubes/webarena-verified"
parallelization_mode: task-parallel
max_concurrent_tasks: 4
status: active
resources: []
task_count: 812
has_debug_task: true
has_debug_agent: true
action_space: []
features:
  async: false
  streaming: false
  multi_agent: false
  multi_dim_reward: false