comfyui-json: key readiness off api-wrapper's BACKENDS_READY token

Rather than tailing for "Uvicorn running on", which only confirms the
api-wrapper's own HTTP listener is bound, watch for the api-wrapper's
new structured tokens that reflect actual end-to-end reachability:

  MODEL_LOAD_LOG_MSG  = ["BACKENDS_READY"]
  MODEL_ERROR_LOG_MSGS includes:
    - "BACKENDS_READY_TIMEOUT"   (backends never came up)
    - "BACKEND_UNRECOVERABLE"    (CUDA fault latched on a backend)
    - "Application startup failed" (kept; uvicorn's own ASGI failure)

Closes the race observed on a live test where the pyworker fired
benchmark the moment uvicorn bound, every request inside the
api-wrapper hit Cannot-connect-to-host on ComfyUI, and the SDK
counted the resulting fast 502s as a fast worker (perf=200).

Tokens are emitted by ai-dock/comfyui-api-wrapper#11 and onward;
earlier wrapper versions won't emit BACKENDS_READY so warm-up stalls
indefinitely — pin to a wrapper that includes that change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Rob Ballantyne
2026-05-08 09:46:45 +01:00
parent a5bcc3de5e
commit b52c654f09
+38 -22
View File
@@ -33,44 +33,60 @@ from pathlib import Path
from vastai import Worker, WorkerConfig, HandlerConfig, LogActionConfig, BenchmarkConfig
# ComfyUI model configuration. The model server here is the ai-dock
# ComfyUI model configuration. The model server is ai-dock's
# comfyui-api-wrapper sitting in front of ComfyUI itself, not ComfyUI's
# own port (18188). We watch the api-wrapper's log rather than ComfyUI's
# because the api-wrapper runs convert-workflows.sh before launching
# uvicorn — by the time uvicorn logs "Uvicorn running on ...", the
# benchmark workflows are converted, the pyworker_benchmark.json symlink
# exists, and :18288 is accepting connections. Watching ComfyUI's log
# fires the benchmark too early (before the api-wrapper is reachable),
# which the SDK can't recover from since __call_backend doesn't retry
# connection-refused.
# own port (18188). We tail the api-wrapper's log rather than ComfyUI's
# and key off the api-wrapper's own structured readiness/fault signals:
#
# BACKENDS_READY — api-wrapper has confirmed every ComfyUI
# backend passes HTTP+WS probes. Until
# this fires, posting to /generate/sync
# can hit "Cannot connect to host" inside
# the api-wrapper, which the SDK can't
# recover from since __call_backend
# doesn't retry connection-refused.
# BACKENDS_READY_TIMEOUT — backends never reachable within
# api-wrapper's deadline. Worker is
# unrecoverable; mark errored.
# BACKEND_UNRECOVERABLE — CUDA fault / illegal memory access on a
# backend's GPU. Same fate.
# Application startup failed — uvicorn's own ASGI lifespan failed.
#
# These tokens are emitted by ai-dock/comfyui-api-wrapper >= the
# "feat/backend-readiness-log-signals" change. Older wrappers won't
# emit BACKENDS_READY, so warm-up will stall — pin the wrapper version
# accordingly.
MODEL_SERVER_URL = 'http://127.0.0.1'
MODEL_SERVER_PORT = 18288
MODEL_LOG_FILE = '/var/log/portal/api-wrapper.log'
MODEL_HEALTHCHECK_ENDPOINT = "/health"
# api-wrapper log messages
# Trigger benchmark only after the full stack (api-wrapper + ComfyUI
# backends) is reachable. See BACKENDS_READY in the comment above.
MODEL_LOAD_LOG_MSG = [
"Uvicorn running on"
"BACKENDS_READY",
]
# LogAction.ModelError is fatal: the SDK calls backend_errored() and the
# worker is locked into a permanent error state. Patterns must therefore
# only match conditions where the api-wrapper genuinely cannot serve any
# request — supervisord restarts on uvicorn exit, so a real failure
# self-heals rather than dragging the worker down.
# LogAction.ModelError is fatal: the SDK calls backend_errored() and
# locks the worker into a permanent error state. Patterns must
# therefore only match conditions where the api-wrapper genuinely
# cannot serve any request — supervisord restarts on uvicorn exit, so
# a real failure self-heals rather than dragging the worker down.
#
# Notably *not* matched here:
# - per-request errors (PreprocessWorker failures, ComfyUI workflow
# validation, "Value not in list:") — one malformed client payload
# would otherwise kill the worker
# - "CUDA out of memory" — surfaces both as misconfigured GPU (which
# the benchmark-failure path already catches via backend_errored)
# and as a too-greedy client request, which is indistinguishable
# from a substring match
# - "CUDA out of memory" — surfaces both as a misconfigured GPU
# (which the benchmark-failure path already catches via
# backend_errored) and as a too-greedy client request, which is
# indistinguishable from a substring match
# - convert-workflows.sh warnings — that script is not load-bearing
# for serving (uvicorn starts even if conversion partially failed)
# for serving
MODEL_ERROR_LOG_MSGS = [
"Application startup failed", # uvicorn ASGI lifespan startup failed -> uvicorn exits
"BACKENDS_READY_TIMEOUT", # backends never reachable
"BACKEND_UNRECOVERABLE", # CUDA fault latched per backend
"Application startup failed", # uvicorn ASGI lifespan startup failed
]
# LogAction.Info is purely informational (echoes log lines into the vast