Lets the demo target run-alpha.vast.ai (or candidate/local) without editing code. Defaults to prod; respects VAST_INSTANCE env var. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Null PyWorker
A PyWorker that does nothing — it does not forward requests to any model server. Reservations are modelled as framework sessions: a request comes in and you get a worker; release and it scales back down.
When to use it
Use this worker when you want to drive Vast Serverless autoscaling but you do not want inbound requests to reach a model on the instance. Typical setup:
- You already have a job queue on your own infrastructure (Redis, SQS, NATS, etc.).
- A separate worker process on the Vast instance pulls work from that queue directly. The Vast PyWorker is not involved in the request/response path. Your consumer can be any language — node, golang, python, a binary — this PyWorker is implementation-agnostic.
- You want one Vast worker per active queue consumer, and you want the Serverless autoscaler to spin instances up and down based on demand on your side.
How it works
- Reservations use the framework's session model. The SDK exposes
endpoint.session(cost, lifetime)which POSTs to/session/create(a built-in framework route) and returns aSessionobject usable asasync with. Closing the context (or callingawait session.close()) POSTs to/session/end— counted as a normal success in metrics. max_sessions=1on the worker side means a second/session/createagainst an already-occupied worker returns429. Serverless routes that request to a free worker or scales a new one up.- Sessions are excluded from queue-wait math (the framework filters
if not request.is_session), so an occupied worker doesn't look like it has a request queue piling up. The autoscaler treats a session as occupancy, not as work-in-progress. lifecycleis used instead ofmodel_log_file, so there is no log to tail and no model server to start. The worker reports itself ready immediately after a trivial benchmark.
Healthchecking
The framework periodically GETs a healthcheck URL after startup; if it ever fails after the first success, the worker is marked errored and the autoscaler can decommission it. Two modes:
- Stub (default) — the internal control server also answers
GET /healthwith200. Just enough to satisfy the framework while you wire up real consumers. - Point at your queue consumer (recommended) — set
BACKEND_HEALTH_URL=http://127.0.0.1:9090/health(absolute URL) and the pyworker will healthcheck your consumer instead. If the consumer process crashes, the autoscaler will see the worker as broken.
API
Reservation: POST /session/create (external, signed)
Not implemented here — the framework provides this route automatically on every PyWorker. Use the SDK:
from vastai import Serverless
async with Serverless() as client:
endpoint = await client.get_endpoint(name="my-null-endpoint")
async with endpoint.session(cost=100, lifetime=600) as s:
# Worker is now reserved. Your queue dispatcher does whatever it
# needs to do (typically: enqueue a job that mentions s.session_id).
...
# `async with` exit posts to /session/end → 200 success in metrics
Or raw HTTP (the SDK takes care of autoscaler signing for you, but the shape of the request is documented for non-Python clients):
POST /session/create
{
"auth_data": { /* signed by autoscaler */ },
"payload": {
"lifetime": 600,
"on_close_route": "https://your.callback/notify",
"on_close_payload": {"job_id": "..."}
}
}
Release from a local consumer: POST /release (internal, localhost-only)
Closes the active session, regardless of who created it. No body, no
auth. Use this when the queue consumer doesn't have (and shouldn't need)
the session's session_auth:
curl -X POST http://127.0.0.1:18999/release
Responses:
200 {"released": true, "session_ids": ["..."]}— closed; the held client-side/session/createcompletes and counts as a success.200 {"released": false, "reason": "no active session"}— nothing active, no-op.
For setups where the dispatcher can hand the consumer session_auth
(e.g. as part of the queue payload), the consumer can instead POST
/session/end on the framework's HTTP-only port
($WORKER_HTTP_PORT, default WORKER_PORT+1) — the standard, fully
authenticated release path.
Environment variables
BACKEND_HEALTH_URL— absolute URL the framework should healthcheck (e.g.http://127.0.0.1:9090/health). When set, the stub/healthroute is not registered on the internal server.NULL_CONTROL_PORT— port for the internal control server (hosts/releaseand optionally/health). Defaults to18999.
Deploying on Vast Serverless
- Create a Serverless endpoint and point
PYWORKER_REPOat this repository (or your fork). - Set
BACKEND=nullin the template sostart_server.shrunsworkers.null.worker. - There is no model server to configure; you can omit model-related env vars entirely.
- Run your own queue-consumer process on the instance alongside the
PyWorker. When it finishes its work:
curl -X POST http://127.0.0.1:18999/release
Endpoint scaling parameters
The null worker reports max_perf = 100 and each reservation is a
session of cost = 100. The intended model is one session = one
worker, scaling elastically from zero up to as many concurrent
sessions as you ask for.
target_util = 1.0— required. The default of0.9reserves ~11% spare capacity, which for a unit-occupancy worker rounds up to a whole extra worker (e.g.min_load = 100becomes100 / 0.9 = 111.1→ 2 active workers instead of 1). Withtarget_util = 1.0the math is clean:min_load = 100 * Nkeeps exactlyNworkers active.min_load = 0— required for scale-to-zero. Withmin_load = 0and a positiveinactivity_timeout, the endpoint can scale down to zero active workers when no sessions exist.max_workers— cap on total reservations the endpoint can ever serve concurrently.inactivity_timeout— positive value enables scale-to-zero after the configured number of seconds of no active sessions. Use alongsidecold_workers = 0to also drop the inactive pool.max_queue_time = 0andtarget_queue_time = 0— recommended. The autoscaler computes per-worker queue-time ascur_load / max_perfand sessions are incur_load. With the defaults (~30s), an occupied null worker (cur_load = 100,max_perf = 100, implied queue = 1s) looks "available" for routing, so a third reservation gets repeatedly 429'd and never triggers scale-up. Zeroing both knobs tells the autoscaler "don't estimate when this worker will free up; route to a free one or make a new one."
Known autoscaler quirk
In current Vast Serverless, scale-up reliably fires for the 1→2 worker transition (the first 429 from an occupied worker activates a cold one), but the 2→3 transition often fails to fire — the third reservation 429s on both occupied workers and sits in the autoscaler's global queue indefinitely instead of activating a third cold worker. Scale-to-zero also has known issues.
Fixes are pending on the Vast side. Until they land, a temporary
workaround is to over-provision by reporting cost > max_perf on
session creation:
python -m workers.null.client --demo --session-cost 200
With cost = 200, max_perf = 100, each occupied worker reports
cur_load / max_perf = 2.0 — clearly over capacity, so the autoscaler
keeps one extra active worker warm per session. The next
/session/create lands on the warm worker directly with no queue.
This is a band-aid, not the design. The intended steady state
is cost = 100 with predictable elastic scale-up.
Client example
Single reservation (holds for 180s):
python -m workers.null.client --endpoint <ENDPOINT_NAME>
Staggered demo:
python -m workers.null.client --endpoint <ENDPOINT_NAME> --demo
Starts three sessions 30s apart (all held concurrently), holds the
3-worker plateau for 5 minutes so the autoscaler has time to actually
provision the third worker before any scale-down starts, then closes
the sessions one at a time, also 30s apart, and exits. Every session
ends cleanly via the SDK's session.close() — 200 successes in
metrics, no cancellations.
Tune the timing with --interval and --plateau. To exercise the
local-release path, shell into a worker and run
curl -X POST http://127.0.0.1:18999/release.
Notes and caveats
- The reservation's lifetime caps how long the session can live without
client activity. Set it comfortably longer than the work you expect to
do, or have the client periodically POST
/pingwithsession_idto extend. - The
on_close_routepayload (passed at/session/create) is POSTed by the framework when the session ends. Useful for notifying your queue consumer that the reservation is closing. /releaseon the internal port is convenient but bypassessession_auth. If you need the standard authenticated release flow, passsession_authto your consumer (e.g. through the queue payload) and have it POST to/session/endon the framework's HTTP port instead.