Previously the first release fired only 30s after the third reservation
started, so the autoscaler often hadn't even finished provisioning the
third worker yet. Default plateau to 300s so all three workers are
visibly running before scale-down begins; configurable via --plateau.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Using 1 confused the serverless capacity math. Set workload_calculator,
benchmark target throughput, and client cost all to 100 — the conventional
default the rest of the system expects.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three reservations 30s apart, each with a 90s duration. They end one at
a time, also 30s apart, then the client exits. Each reservation ends
via its duration cap (200 success) rather than the previous "cancel one,
leave two open" pattern that left two 499s pending.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Match the server-side workload_calculator (1.0) so the autoscaler routing
hint is consistent with what the worker reports. A null reservation is a
unitless slot — no reason for client cost to be 100.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three concurrent /reserve calls 30s apart, then cancel the first to show
the early-release path. The remaining two run until their duration cap.
Useful for watching scale-up/scale-down behaviour in the autoscaler
dashboard.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A PyWorker that does not forward to any model server. POST /reserve holds
the worker busy until the client disconnects (or the duration cap elapses),
so users with their own job queue can drive Vast autoscaling without
exposing inbound model traffic on the instance.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>