Three reservations 30s apart, each with a 90s duration. They end one at a time, also 30s apart, then the client exits. Each reservation ends via its duration cap (200 success) rather than the previous "cancel one, leave two open" pattern that left two 499s pending. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.9 KiB
Null PyWorker
A PyWorker that does nothing — it does not forward requests to any model
server. Each HTTP POST to /reserve simply marks the worker as busy and holds
the request open until the user's queue consumer (running locally on the
instance) calls /release on the internal control port — or a safety
timeout elapses.
When to use it
Use this worker when you want to drive Vast Serverless autoscaling but you do not want inbound requests to reach a model on the instance. Typical setup:
- You already have a job queue on your own infrastructure (Redis, SQS, NATS, etc.).
- A separate worker process on the Vast instance pulls work from that queue directly. The Vast PyWorker is not involved in the request/response path.
- You want one Vast worker per active queue consumer, and you want the Serverless autoscaler to spin instances up and down based on demand on your side.
A request comes in and you get a worker. Release and it scales back down.
POST to /reserve and serverless gives you a worker, held busy for the
lifetime of the request. When your queue consumer is done, POST to
/release on the internal port (127.0.0.1:18999 by default) and the
held /reserve returns 200.
How it works
allow_parallel_requests=Falseandmax_queue_time=0.0, so one in-flight/reservefully occupies the worker and any further request that lands on it is rejected with429immediately — serverless will route to a free worker or scale a new one up.lifecycleis used instead ofmodel_log_file, so there is no log to tail and no model server to start. The worker reports itself ready immediately after the (trivial) benchmark.- The
/reservehandler is aremote_functionrather than an HTTP proxy, so the framework never tries to forward the request anywhere — it just awaits an internalasyncio.Event. - An internal aiohttp control server, bound to
127.0.0.1, hosts/release(and, when no external healthcheck URL is provided, a stub/health).
Healthchecking
The framework periodically GETs a healthcheck URL after startup; if it ever fails after the first success, the worker is marked errored and the autoscaler can decommission it. Two modes:
- Stub (default) — the internal control server also answers
GET /healthwith200. This is just enough to satisfy the framework while you wire up real consumers. - Point at your queue consumer (recommended) — set
BACKEND_HEALTH_URL=http://127.0.0.1:9090/health(absolute URL) and the pyworker will healthcheck your consumer instead. If your consumer process crashes, the autoscaler will see the worker as broken.
Run your queue consumer on the instance alongside the PyWorker, expose a
plain /health endpoint on it, then set BACKEND_HEALTH_URL accordingly in
your template.
API
POST /reserve (external port, signed by the autoscaler)
Holds the worker busy until the reservation ends.
Request body (all fields optional):
{ "duration": 600 }
duration(seconds, optional): safety cap on how long to hold the reservation if no/releasearrives. Capped byMAX_RESERVATION_SECONDS(env var, default 3600). If omitted, defaults to that cap.
Behavior:
- Returns
200with{"released": "explicit", ...}when the local consumer POSTs/releaseon the internal port. This is the intended happy path — the request is counted as a success in metrics. - Returns
200with{"released": "duration_elapsed", "duration": <n>}if the duration cap fires (safety net for a stuck consumer). - Returns
499if the external client disconnects (counted as cancelled in metrics — avoid this; use/releaseinstead). - Returns
429immediately if the worker is already holding a reservation (so serverless routes the request to a free worker instead of queueing).
POST /release (internal port, localhost-only)
Marks the active reservation as done. No body required. Idempotent:
curl -X POST http://127.0.0.1:18999/release
Responses:
200 {"released": true}— active reservation was released; the held/reservewill return{"released": "explicit"}.200 {"released": false, "reason": "no active reservation"}— nothing was in flight, no-op.
Only processes on the Vast instance can reach this port. There is no authentication on it.
Environment variables
MAX_RESERVATION_SECONDS— upper bound on how long a single/reservecall can hold a worker if/releaseis never called. Defaults to3600.BACKEND_HEALTH_URL— absolute URL the framework should healthcheck (e.g.http://127.0.0.1:9090/health). When set, the stub/healthroute is not registered on the internal server. When unset, the built-in stub is used.NULL_CONTROL_PORT— port for the internal control server (hosts/releaseand optionally/health). Defaults to18999.
Deploying on Vast Serverless
- Create a Serverless endpoint and point
PYWORKER_REPOat this repository (or your fork). - Set
BACKEND=nullin the template sostart_server.shrunsworkers.null.worker. - There is no model server to configure; you can omit model-related env vars entirely.
- Run your own queue-consumer process on the instance alongside the
PyWorker. When the consumer finishes its work it should:
so the held
curl -X POST http://127.0.0.1:18999/release/reservereturns success and the autoscaler can scale the worker down cleanly.
Client example
Single reservation:
python -m workers.null.client --endpoint <ENDPOINT_NAME> --duration 600
To exercise the full flow, shell into the worker and run
curl -X POST http://127.0.0.1:18999/release — the client returns with
{"released": "explicit", ...}.
Staggered demo:
python -m workers.null.client --endpoint <ENDPOINT_NAME> --demo
Starts three reservations 30s apart (all held concurrently) with a 90s duration each. They scale down one at a time, also 30s apart, then the client exits — a clean trapezoidal load curve for watching scale-up and scale-down in the autoscaler dashboard. Each reservation ends via its duration cap (a 200 success in metrics).
Notes and caveats
- The HTTP connection from the external caller must stay open for the full reservation. Make sure your client and any intermediate proxies allow long-lived requests (disable idle timeouts, retries, and connection reuse if necessary).
- If your client retries on timeout, you may end up provisioning duplicate
workers. Configure
durationgenerously and rely on/releasefrom the consumer to end reservations promptly. - Avoid disconnecting the external
/reserverequest as a way to release — that produces a499and is counted as a cancellation in Vast metrics. Always release viaPOST /releaseon the internal port. - There is no streaming / heartbeat in the response; the request returns exactly once, when the reservation ends.