Reporting cost == max_perf puts an occupied worker at exactly 100% utilization, which the autoscaler reads as "at target, no action." The 3rd session_create then 429s on both active workers and stalls in the global queue instead of triggering a cold-worker activation (observed: 1→2 active scales fine, 2→3 does not). Bumping cost to 2 * max_perf makes each session look like more than one worker's work, so the autoscaler always keeps an extra active worker hot. Slight over-provisioning, but the 3rd reservation lands directly on a free worker rather than queueing. Expose --session-cost on the client so the value can be swept without edits. README documents the trade-off. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Null PyWorker
A PyWorker that does nothing — it does not forward requests to any model server. Reservations are modelled as framework sessions: a request comes in and you get a worker; release and it scales back down.
When to use it
Use this worker when you want to drive Vast Serverless autoscaling but you do not want inbound requests to reach a model on the instance. Typical setup:
- You already have a job queue on your own infrastructure (Redis, SQS, NATS, etc.).
- A separate worker process on the Vast instance pulls work from that queue directly. The Vast PyWorker is not involved in the request/response path. Your consumer can be any language — node, golang, python, a binary — this PyWorker is implementation-agnostic.
- You want one Vast worker per active queue consumer, and you want the Serverless autoscaler to spin instances up and down based on demand on your side.
How it works
- Reservations use the framework's session model. The SDK exposes
endpoint.session(cost, lifetime)which POSTs to/session/create(a built-in framework route) and returns aSessionobject usable asasync with. Closing the context (or callingawait session.close()) POSTs to/session/end— counted as a normal success in metrics. max_sessions=1on the worker side means a second/session/createagainst an already-occupied worker returns429. Serverless routes that request to a free worker or scales a new one up.- Sessions are excluded from queue-wait math (the framework filters
if not request.is_session), so an occupied worker doesn't look like it has a request queue piling up. The autoscaler treats a session as occupancy, not as work-in-progress. lifecycleis used instead ofmodel_log_file, so there is no log to tail and no model server to start. The worker reports itself ready immediately after a trivial benchmark.
Healthchecking
The framework periodically GETs a healthcheck URL after startup; if it ever fails after the first success, the worker is marked errored and the autoscaler can decommission it. Two modes:
- Stub (default) — the internal control server also answers
GET /healthwith200. Just enough to satisfy the framework while you wire up real consumers. - Point at your queue consumer (recommended) — set
BACKEND_HEALTH_URL=http://127.0.0.1:9090/health(absolute URL) and the pyworker will healthcheck your consumer instead. If the consumer process crashes, the autoscaler will see the worker as broken.
API
Reservation: POST /session/create (external, signed)
Not implemented here — the framework provides this route automatically on every PyWorker. Use the SDK:
from vastai import Serverless
async with Serverless() as client:
endpoint = await client.get_endpoint(name="my-null-endpoint")
async with endpoint.session(cost=100, lifetime=600) as s:
# Worker is now reserved. Your queue dispatcher does whatever it
# needs to do (typically: enqueue a job that mentions s.session_id).
...
# `async with` exit posts to /session/end → 200 success in metrics
Or raw HTTP (the SDK takes care of autoscaler signing for you, but the shape of the request is documented for non-Python clients):
POST /session/create
{
"auth_data": { /* signed by autoscaler */ },
"payload": {
"lifetime": 600,
"on_close_route": "https://your.callback/notify",
"on_close_payload": {"job_id": "..."}
}
}
Release from a local consumer: POST /release (internal, localhost-only)
Closes the active session, regardless of who created it. No body, no
auth. Use this when the queue consumer doesn't have (and shouldn't need)
the session's session_auth:
curl -X POST http://127.0.0.1:18999/release
Responses:
200 {"released": true, "session_ids": ["..."]}— closed; the held client-side/session/createcompletes and counts as a success.200 {"released": false, "reason": "no active session"}— nothing active, no-op.
For setups where the dispatcher can hand the consumer session_auth
(e.g. as part of the queue payload), the consumer can instead POST
/session/end on the framework's HTTP-only port
($WORKER_HTTP_PORT, default WORKER_PORT+1) — the standard, fully
authenticated release path.
Environment variables
BACKEND_HEALTH_URL— absolute URL the framework should healthcheck (e.g.http://127.0.0.1:9090/health). When set, the stub/healthroute is not registered on the internal server.NULL_CONTROL_PORT— port for the internal control server (hosts/releaseand optionally/health). Defaults to18999.
Deploying on Vast Serverless
- Create a Serverless endpoint and point
PYWORKER_REPOat this repository (or your fork). - Set
BACKEND=nullin the template sostart_server.shrunsworkers.null.worker. - There is no model server to configure; you can omit model-related env vars entirely.
- Run your own queue-consumer process on the instance alongside the
PyWorker. When it finishes its work:
curl -X POST http://127.0.0.1:18999/release
Endpoint scaling parameters
The null worker reports max_perf = 100 and each reservation is a
session of cost = 100. Set the endpoint accordingly:
-
target_util = 1.0— required. The default of0.9reserves ~11% spare capacity, which for a unit-occupancy worker rounds up to a whole extra worker (e.g.min_load = 100becomes100 / 0.9 = 111.1→ 2 active workers instead of 1). Withtarget_util = 1.0the math is clean:min_load = 100 * Nkeeps exactlyNworkers active. -
min_load— set to100 * NforNalways-on workers (withtarget_util = 1.0). -
max_workers— cap on total reservations the endpoint can ever serve concurrently. -
Session
cost = 2 × max_perf(e.g.200whenmax_perf = 100) — recommended. Reportingcost = max_perfputs each occupied worker at exactly 100% utilization, which the autoscaler reads as "at target, no action needed." The third reservation then gets 429'd by both occupied workers and stalls in the autoscaler's global queue indefinitely instead of activating a cold worker.Bumping
costabovemax_perfmakes each session look like more than one worker of work (cur_load / max_perf > 1.0), so the autoscaler keeps an extra active worker hot per session. Slight over-provisioning in exchange for predictable scale-up. The demo client defaults to--session-cost 200. -
max_queue_time = 0(or very small, e.g.0.1) — required. The per-workerwait_timeproperty used internally to reject requests filters sessions out, but the autoscaler computes its own queue-time estimate fromcur_load / max_perf— andcur_loaddoes include sessions. With defaults around 30s, an occupied null worker (cur_load = 100,max_perf = 100, queue estimate = 1s) looks "available" and the autoscaler keeps routing extra reservations there, getting 429s and queueing them instead of scaling up. Settingmax_queue_time = 0makes any in-flight load mark the worker "full" for routing. -
target_queue_time = 0— required. Aggressive scale-up trigger; withmax_queue_time = 0to keep occupied workers off the routing table, this ensures the autoscaler provisions a new worker the moment all existing ones are occupied rather than queueing on its side. The queue-time math conceptually assumes work completes in proportion to load, which doesn't hold for sessions (they last hours, notcur_load / max_perfseconds). Zeroing both knobs tells the autoscaler "don't estimate when this worker will free up; route to a free one or make a new one." -
inactivity_timeout— works as expected: idle (no active sessions) for N seconds → permitted to scale down pastmin_load.
Client example
Single reservation (holds for 180s):
python -m workers.null.client --endpoint <ENDPOINT_NAME>
Staggered demo:
python -m workers.null.client --endpoint <ENDPOINT_NAME> --demo
Starts three sessions 30s apart (all held concurrently), holds the
3-worker plateau for 5 minutes so the autoscaler has time to actually
provision the third worker before any scale-down starts, then closes
the sessions one at a time, also 30s apart, and exits. Every session
ends cleanly via the SDK's session.close() — 200 successes in
metrics, no cancellations.
Tune the timing with --interval and --plateau. To exercise the
local-release path, shell into a worker and run
curl -X POST http://127.0.0.1:18999/release.
Notes and caveats
- The reservation's lifetime caps how long the session can live without
client activity. Set it comfortably longer than the work you expect to
do, or have the client periodically POST
/pingwithsession_idto extend. - The
on_close_routepayload (passed at/session/create) is POSTed by the framework when the session ends. Useful for notifying your queue consumer that the reservation is closing. /releaseon the internal port is convenient but bypassessession_auth. If you need the standard authenticated release flow, passsession_authto your consumer (e.g. through the queue payload) and have it POST to/session/endon the framework's HTTP port instead.