Rather than tailing for "Uvicorn running on", which only confirms the
api-wrapper's own HTTP listener is bound, watch for the api-wrapper's
new structured tokens that reflect actual end-to-end reachability:
MODEL_LOAD_LOG_MSG = ["BACKENDS_READY"]
MODEL_ERROR_LOG_MSGS includes:
- "BACKENDS_READY_TIMEOUT" (backends never came up)
- "BACKEND_UNRECOVERABLE" (CUDA fault latched on a backend)
- "Application startup failed" (kept; uvicorn's own ASGI failure)
Closes the race observed on a live test where the pyworker fired
benchmark the moment uvicorn bound, every request inside the
api-wrapper hit Cannot-connect-to-host on ComfyUI, and the SDK
counted the resulting fast 502s as a fast worker (perf=200).
Tokens are emitted by ai-dock/comfyui-api-wrapper#11 and onward;
earlier wrapper versions won't emit BACKENDS_READY so warm-up stalls
indefinitely — pin to a wrapper that includes that change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five issues raised by Copilot's review:
1. _resolve_benchmark_path's docstring/README claim that a set-but-
broken BENCHMARK_JSON_PATH falls through to the well-known tier,
but the implementation only handled "file missing". A path
pointing at a directory or holding malformed JSON dropped
straight to the SD1.5 fallback without consulting tier 3.
Replaced with a true tiered try-and-load: walk
(misc, env, well-known), attempt to load each, and fall through
to the next on any failure (missing, not a regular file,
unreadable, invalid JSON). The env-var case still surfaces a
warning so a typo doesn't fail silently.
2. int(os.getenv("BENCHMARK_TEST_WIDTH", ...)) crashed on non-int
values. Added _env_int helper that warns + returns default on
ValueError. Empty string also handled.
3. random.choice([]) on an empty test_prompts.txt raised IndexError.
_load_prompts now warns + uses a built-in _FALLBACK_PROMPT when
the file is missing or yields no non-blank lines.
4. README already claimed "missing or unreadable" fall-through; the
refactor in (1) makes the code match. No README change needed.
5. test_prompts.txt restored verbatim from the pre-rewrite tree
carried real-person and IP-laden prompts (Pope Francis, Iron Man,
Luke Skywalker, "Disney socialite"). Used automatically during
warm-up they're a reputational/safety-filter risk for the worker.
Replaced with generic equivalents that exercise the same workload
characteristics (1 elderly figure on motorcycle, 1 armoured hero
with axe, etc.).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch MODEL_LOG_FILE from /var/log/portal/comfyui.log to
/var/log/portal/api-wrapper.log and MODEL_LOAD_LOG_MSG to "Uvicorn
running on". A live test instance showed the previous setup firing
benchmark on ComfyUI's "To see the GUI go to:" line, which races
api-wrapper.sh: that script runs convert-workflows.sh (which itself
waits for ComfyUI ready and then converts workflows for several
seconds) before launching uvicorn. The benchmark hit a closed port
on :18288 and the SDK's __call_backend has no retry on connection
refused, locking the worker into a permanent error state.
Watching the api-wrapper log instead means the benchmark only fires
after uvicorn is bound and the pyworker_benchmark.json symlink is
already in place — no SDK changes required.
Trim MODEL_ERROR_LOG_MSGS down to "Application startup failed". The
old patterns were ComfyUI-specific (won't appear in api-wrapper.log)
and dangerous: ModelError is fatal, so "Value not in list:" matching
on an api-wrapper-style log would let one malformed client request
kill the worker. CUDA OOM is similarly off-limits (indistinguishable
from a too-greedy client request via substring match; the benchmark-
failure path already catches model-load OOM at boot). Empty
MODEL_INFO_LOG_MSGS — the prior ComfyUI download pattern can never
match this log file.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pyworker and convert-workflows.sh both unblock when ComfyUI is
ready, but conversion takes a few seconds longer — without a wait, the
first benchmark loses the race and silently drops to the SD1.5 fallback.
Wait up to BENCHMARK_WAIT_TIMEOUT (default 30s) for the symlink before
giving up. The wait fires only when we're actually about to use the
well-known tier (env var / misc/ paths short-circuit), only once per
process, and is skipped entirely off the base image (parent directory
absent), so non-base-image deployments don't pay the timeout.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Read /opt/comfyui-api-wrapper/workflows/pyworker_benchmark.json when
neither misc/benchmark.json nor $BENCHMARK_JSON_PATH yields a usable
file. The vast.ai ComfyUI base image's convert-workflows.sh maintains
that path as a symlink to the first provisioned workflow, so on that
image the operator does not need to set BENCHMARK_JSON_PATH at all.
A set-but-broken $BENCHMARK_JSON_PATH now warns and falls through to
the well-known path instead of dropping straight to the SD1.5 fallback,
so a typo in the env var doesn't mask an otherwise-working benchmark.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
start_server.sh clones pyworker into /workspace/vast-pyworker after the
provisioning phase has run, so a provisioning script that wants to ship
a custom benchmark workflow cannot write to misc/benchmark.json — that
path doesn't exist yet at provisioning time, and pre-creating it would
make the subsequent clone fail.
Allow provisioning to drop the workflow anywhere (e.g. /workspace) and
point the worker at it via the BENCHMARK_JSON_PATH env var. The in-tree
file still takes precedence (so forks with a baked-in benchmark keep
working unchanged); the env var is consulted only as a second choice,
and a misconfigured path logs a warning rather than silently degrading
to the SD1.5 fallback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "Use PyWorker SDK" rewrite (4380d98) replaced the dynamic
ComfyWorkflowData.for_test() benchmark logic with a hardcoded list of 11
SD1.5 Text2Image payloads, dropped misc/benchmark.json.example and
misc/test_prompts.txt, and stopped honouring the BENCHMARK_TEST_*
environment variables. The README's documented behaviour (custom
workflow via benchmark.json, env-var-tuned fallback) had no
implementation behind it.
Restore the original two-tier behaviour against the new SDK by passing
BenchmarkConfig(generator=make_benchmark_payload) instead of a static
dataset, splitting the load logic into a custom-workflow path and a
fallback path, and re-shipping the misc/ assets.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>