Pass over all three files to drop verbose expository commentary that
duplicated either the code or the README. Net: -284 lines.
README now reads top-to-bottom in roughly the order someone would need
the info: use case → how it works → endpoint params → API → healthcheck
→ deploy → demo. Endpoint params table uses the values actually tested
on alpha (min_load=0, target_util=1, max_queue_time=1,
target_queue_time=0.5, inactivity_timeout=10). Dropped the
"known autoscaler quirk" section now that alpha addresses it; kept the
--session-cost flag as a debugging knob.
worker.py and client.py keep the same behavior but trim long block
comments and multi-line docstrings the code didn't need.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the held-/reserve approach in favour of the framework's session
primitive (max_sessions=1 + /session/create). Sessions are excluded from
the autoscaler's queue-wait math and don't suffer the cur_perf=0
degradation that a long-held request did, so this naturally produces the
"one request comes in and you get a worker; release and it scales back
down" model we were hand-rolling.
Server side:
- max_sessions=1; framework auto-registers /session/* routes
- Drop custom /reserve handler, _active_reservation event, max_queue_
time=0.0, MAX_RESERVATION_SECONDS, _perf_heartbeat
- Trivial /ping handler exists only to satisfy the framework's
"at least one handler with BenchmarkConfig" requirement (and to give
clients an extension/keepalive route)
- /release on the internal control port is kept as a convenience for
queue consumers that don't carry session_auth — calls the framework's
__close_session via name-mangling, which bypasses the session_auth
check but is fine for a localhost-only endpoint
- Workload/perf back to 100 (conventional)
Client side:
- Uses endpoint.session(cost, lifetime) instead of POST /reserve
- async with the SDK Session; close on exit posts /session/end with
proper auth → 200 success in metrics
- Demo and single modes both ride the same reserve() helper
Sessions landed in vastai-sdk 0.4.2 (commit ec9ef59, 2026-01-20).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
While a /reserve is held, no requests complete so workload_served stays
at 0 each metrics tick. The autoscaler sees cur_perf=0 against
max_perf=150, concludes the worker can't deliver claimed throughput,
downgrades it, and gets cautious about scaling up — so additional
/reserve requests pile up behind the held one instead of triggering a
new worker.
Add a 1Hz heartbeat coroutine that, while anything is in flight, sets
workload_served back to TARGET_PERF (150) and flags update_pending. The
metrics tick reads 150 and resets to 0; the heartbeat re-pins it before
the next tick. Net effect: the autoscaler sees a saturated worker
delivering at peak rate, which is the signal it needs to scale a new
worker up rather than queue.
The heartbeat needs the backend instance, which is only created inside
Worker(...) — stash a reference in a module-level dict between Worker()
and .run() so the lifecycle coroutine can reach it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
asyncio.sleep(1.0) takes slightly more than 1s due to event loop
scheduling, so workload/time landed at ~99.x instead of 100. Pre-populate
the framework's .has_benchmark cache file with "100" before the benchmark
runs — __run_benchmark short-circuits to the cached value and skips the
time-based calculation entirely.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Using 1 confused the serverless capacity math. Set workload_calculator,
benchmark target throughput, and client cost all to 100 — the conventional
default the rest of the system expects.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The startup benchmark previously returned instantly, producing
max_throughput around 339895. A null worker has no real throughput
concept (each reservation is a unitless slot), so sleep 1s during the
benchmark with workload=1 to record max_throughput ~= 1.0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A held reservation runs for up to MAX_RESERVATION_SECONDS (default 1h), so
queueing a second /reserve behind it makes no sense — the wait would dwarf
any sane timeout. Set max_queue_time=0.0 so the framework rejects 429 as
soon as another reservation is in flight, and serverless routes the request
to a free worker or scales a new one up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The held /reserve now waits on an asyncio.Event and resolves when the local
queue consumer POSTs /release on the internal control port (127.0.0.1:18999
by default). This produces a 200 success in metrics instead of the 499
cancellation you got from disconnecting the client. The duration cap stays
as a safety net for stuck consumers.
The internal aiohttp server is now unconditional and hosts /release always;
the stub /health route is added only when BACKEND_HEALTH_URL is unset.
NULL_STUB_HEALTH_PORT is renamed to NULL_CONTROL_PORT to reflect the
broader role.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an in-process aiohttp stub on 127.0.0.1:18999/health so the framework's
periodic healthcheck has something live to talk to. Operators can override
with BACKEND_HEALTH_URL to point at their queue consumer's /health
endpoint, so the autoscaler marks the worker errored if the consumer dies.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A PyWorker that does not forward to any model server. POST /reserve holds
the worker busy until the client disconnects (or the duration cap elapses),
so users with their own job queue can drive Vast autoscaling without
exposing inbound model traffic on the instance.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>