SWE-bench Verified

swebench-verified · v0.1.0

coding

SWE-bench Verified ported to the CUBE protocol — 500 human-validated GitHub issues with test-based resolution criteria. Princeton + OpenAI's curated subset of the broader SWE-bench dataset where every task was manually checked for an unambiguous problem statement and a reliable test-based reward signal. The agent receives the problem statement + a git checkout at the base commit and must produce a patch that makes the upstream fail_to_pass tests pass without breaking pass_to_pass.

By: @NicolasAG (Nicolas Gontier) , @recursix (Alexandre Lacoste) , @josancamon19 (Joan Cabezas)

Install

pip install swebench-verified-cube

Version: 0.1.0 · PyPI page

500

Tasks

local

Infra

Yes

Debug Task

Yes

Debug Agent

Feature Flags

— async

— streaming

— multi_agent

— multi_dim_reward

Legal

Wrapper license MIT

Benchmark license

MIT Self-reported — verify before commercial use Source →

License information is self-reported by the cube developer and has not been verified by the AI Alliance. Always consult the source URL and original benchmark authors for authoritative terms.

Slow check not yet run. Stress test results will appear here after the async compliance check completes.

Reproducibility journal

How to submit →

This is a reproducibility journal — not a leaderboard.

Submissions document how reference agents and models score over time, across infrastructures, cube versions, and package versions. Use it to detect drift and validate environments. Not a place to publish a new agent or fine-tune to "win" — there is no ranking, scores are self-reported, and submissions are unverified. To showcase a new agent or model, use ATLAS / EEE / your own benchmark page.

No submissions yet. Be the first — see how to submit.

Registry Entry (YAML)

View on GitHub →

id: swebench-verified
name: "SWE-bench Verified"
version: "0.1.0"
description: >
  SWE-bench Verified ported to the CUBE protocol — 500 human-validated
  GitHub issues with test-based resolution criteria. Princeton + OpenAI's
  curated subset of the broader SWE-bench dataset where every task was
  manually checked for an unambiguous problem statement and a reliable
  test-based reward signal. The agent receives the problem statement +
  a git checkout at the base commit and must produce a patch that makes
  the upstream fail_to_pass tests pass without breaking pass_to_pass.
package: swebench-verified-cube
dev_install_url: "git+https://github.com/The-AI-Alliance/cube-harness#subdirectory=cubes/swebench-verified-cube"

authors:
- github: NicolasAG
  name: Nicolas Gontier
- github: recursix
  name: Alexandre Lacoste
- github: josancamon19
  name: Joan Cabezas

legal:
  wrapper_license: MIT
  benchmark_license:
    reported: MIT
    source_url: "https://github.com/SWE-bench/SWE-bench/blob/main/LICENSE"
    verified_by_original_authors: false

paper: "https://arxiv.org/abs/2310.06770"
getting_started_url: "https://openai.com/index/introducing-swe-bench-verified/"
tags:
- coding
status: degraded
resources: []
task_count: 500
has_debug_task: true
has_debug_agent: true
action_space: []
features:
  async: false
  streaming: false
  multi_agent: false
  multi_dim_reward: false