Toolkit stable workaround — background + poll exec
Problem recap
eai job exec has a ~6–9% per-call hang rate with hangs clustered in bursts.
All observed hangs show the CLI stuck in a blocking read() syscall on the
server-response socket — command completed on the cluster side, response
never arrives. Retry helps for short commands (seconds) but wastes huge
time for long commands (pytest) because it re-runs the work.
Design
Replace single long-running eai job exec <long-cmd> with three short calls:
-
Kick off:
eai job exec <id> -- bash -c "(<cmd> > /tmp/cmd.out 2>&1; echo $? > /tmp/cmd.rc) & disown"— Returns in < 1 s. No ongoing read-response to hang on. -
Poll:
eai job exec <id> -- bash -c "test -f /tmp/cmd.rc && cat /tmp/cmd.rc"— Poll every 15–30 s. Each call is <1 s. Retry-safe. -
Collect:
eai job exec <id> -- bash -c "cat /tmp/cmd.out"— Single short exec for the output. If output is > 10 KB, stream it via chunked reads (tail -c+Nin pieces).
Idempotent kick (critical — observed in the wild)
An earlier version had the kick directly do:
(<cmd> > /tmp/cmd.out 2>&1; echo $? > /tmp/cmd.rc) & disown
When the kick RPC hangs, _run_eai retries. If the first kick actually
reached the container (which the bugreport’s H4 test confirms it often
does), the retry re-runs the command. For non-idempotent commands
like git apply this corrupts state — the second run fails because
the patch is already applied, and the cube sees rc≠0.
Observed failure: swebench_verified-toolkit passed with retry when
only pytest used bg+poll but git apply still used plain exec. On
re-run with the retry hitting during git apply, the retried second
apply failed and reward=0.
Fix — gate the kick with a POSIX-atomic mkdir lock:
if mkdir /tmp/cmd.lock 2>/dev/null; then
(<cmd> > /tmp/cmd.out 2>&1; echo $? > /tmp/cmd.rc) & disown
echo KICKED
else
echo ALREADY_STARTED
fi
mkdir is atomic across concurrent processes. First kick acquires
the lock and fires the command. Retries see the lock and no-op. The
command runs exactly once regardless of how many times the kick
RPC is retried. Cleanup at the end removes lock + marker + output
files together via rm -rf.
Why this is stable
- Each exec call is short and cheap to retry. With 6 % hang rate and 2 retries, per-call failure = 0.06³ ≈ 0.02 %.
- Long-running work is decoupled from any RPC channel. The container runs the command with its own lifecycle; the CLI’s transient state doesn’t matter.
- No data loss on hang: the
.rcsentinel + output file mean a failed poll call just means “check again later” — no state to recover.
Proposed abstraction (cube-standard side)
class Container(ABC):
# Existing
def exec(self, cmd, timeout=...) -> ExecResult: ...
# New (default impl for backends that don't need it)
def exec_long_running(self, cmd, *, timeout: int, poll_interval: int = 30) -> ExecResult:
"""Run `cmd` in the background in the container, polling for completion.
Designed for commands that take minutes (pytest, heavy git, builds)
where a single blocking exec would suffer RPC-layer timeouts or
hangs. Each underlying exec call is short and retry-safe.
Default: delegate to self.exec(cmd, timeout=timeout) — fine for
backends with reliable exec streaming (LocalContainer, DaytonaContainer).
Toolkit/slow backends override.
"""
return self.exec(cmd, timeout=timeout)
class ToolkitContainer(Container):
def exec_long_running(self, cmd, *, timeout, poll_interval=30):
marker = f"/tmp/cube_lr_{uuid.uuid4().hex[:8]}"
self.exec(
f"rm -f {marker}.rc {marker}.out; "
f"(timeout {timeout}s {cmd} > {marker}.out 2>&1; echo $? > {marker}.rc) & disown",
timeout=30,
)
deadline = time.monotonic() + timeout + 60
while time.monotonic() < deadline:
rc_result = self.exec(f"test -f {marker}.rc && cat {marker}.rc || echo PENDING", timeout=30)
rc_text = rc_result.stdout.strip()
if rc_text != "PENDING":
rc = int(rc_text)
out = self.exec(f"cat {marker}.out", timeout=60).stdout
self.exec(f"rm -f {marker}.rc {marker}.out", timeout=30)
return ExecResult(stdout=out, stderr="", exit_code=rc, duration_seconds=0.0)
time.sleep(poll_interval)
raise ContainerExecError(f"exec_long_running timed out after {timeout}s")
Where to use it
Cube code that uses self.tool.bash_unlimited(cmd, timeout=N) for large N:
swebench-verified-cube.task._run_tests(timeout=1800)swebench-live-cube.task._run_test_cmds(timeout=1800)terminalbench-cube.task.evaluate→test.sh(timeout=900)
The cube-side change is one-liner per call site: replace self.tool.bash
with self.tool.bash_long_running when the command is expected to take
minutes.
Tradeoffs
- Single-process parallelism: the
& disownforks a bg job inside the container’s sleep-infinity process. If many long-running commands fire concurrently against one container, they’ll race for output files. Not a concern in our single-task model. - Polling overhead: 30 s poll interval means up to 30 s delay between completion and observation. For a 15-min pytest that’s tolerable.
- Output size:
cat /tmp/cmd.outwould fail if output > 10 KB (shell arg limit comes back on the return path too). Mitigation: stream viadd if=/tmp/cmd.out bs=8192 count=N skip=Mchunks. - Not universally needed: LocalContainer / DaytonaContainer / ModalContainer don’t need this — their exec streams reliably. Only Toolkit opts in.
Estimated impact on test matrix
- swebench_live-toolkit: 30-min pytest hung the response channel in 2/2 attempts. With background+poll: 1 short exec (≈1 s) kicks off; ~60 polls × 1 s = 60 s CLI time across 15 min wall; each poll individually retryable. Expected success rate: > 99%.
- swebench_verified-toolkit: already borderline (passes with retry). Would become rock-solid.
- terminalbench-toolkit: still fails due to test.sh network issue, unrelated.
Open questions
- Does
& disownsurvive if the parent shell (thebash -cfrom eai exec) exits? Yes —disownremoves the bg job from the shell’s job table, and the process inheritssleep infinity(PID 1 of the container) as parent. - Should this be a Container abstraction or a cube-side helper? The Container layer is cleaner, and “long-running” is a legitimate generic notion — argues for Container method.
- Should polling use
inotifywaitinstead of busy-polling? Not all images ship inotifywait. Polling every 30 s is wasteful but universally available.
