A PyWorker that does not forward to any model server. POST /reserve holds the worker busy until the client disconnects (or the duration cap elapses), so users with their own job queue can drive Vast autoscaling without exposing inbound model traffic on the instance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3.9 KiB
Null PyWorker
A PyWorker that does nothing — it does not forward requests to any model
server. Each HTTP POST to /reserve simply marks the worker as busy and holds
the request open until the caller disconnects (or a configured timeout
elapses).
When to use it
Use this worker when you want to drive Vast Serverless autoscaling but you do not want inbound requests to reach a model on the instance. Typical setup:
- You already have a job queue on your own infrastructure (Redis, SQS, NATS, etc.).
- A separate worker process on the Vast instance pulls work from that queue directly. The Vast PyWorker is not involved in the request/response path.
- You want one Vast worker per active queue consumer, and you want the Serverless autoscaler to spin instances up and down based on demand on your side.
For each job your side wants to run on a Vast instance, you POST once to
/reserve. The autoscaler will provision a worker if none is free; the
request stays open, keeping that worker counted as busy, until you close the
connection. When you close, the worker goes idle and the autoscaler is free
to scale it down.
How it works
allow_parallel_requests=False, so one in-flight/reservefully occupies the worker. Any second request that lands on the same worker queues (or is rejected with429aftermax_queue_time), pushing the autoscaler to provision more workers.lifecycleis used instead ofmodel_log_file, so there is no log to tail and no model server to start. The worker reports itself ready immediately after the (trivial) benchmark.- The handler is a
remote_functionrather than an HTTP proxy, so the framework never tries to forward the request anywhere.
API
POST /reserve
Holds the worker busy for the lifetime of the request.
Request body (all fields optional):
{ "duration": 60 }
duration(seconds, optional): how long to hold the reservation if the client does not disconnect first. Capped byMAX_RESERVATION_SECONDS(env var, default 3600). If omitted, defaults to the cap.
Behavior:
- Returns
200with{"released": "duration_elapsed", "duration": <n>}when the duration elapses normally. - Returns
499when the client disconnects (the reservation is released immediately). - Returns
429if the worker is already busy and queue wait would exceedmax_queue_time(30s by default).
Environment variables
MAX_RESERVATION_SECONDS— upper bound on how long a single/reservecall can hold a worker. Defaults to3600. Set lower if you want a tighter safety cap against stuck clients.
Deploying on Vast Serverless
- Create a Serverless endpoint and point
PYWORKER_REPOat this repository (or your fork). - Set
BACKEND=nullin the template sostart_server.shrunsworkers.null.worker. - There is no model server to configure; you can omit model-related env vars entirely.
- Run your own queue-consumer process on the instance alongside the PyWorker (e.g. as a separate supervisor service started by the template).
Client example
python -m workers.null.client --endpoint <ENDPOINT_NAME> --duration 300
This will POST once to /reserve, which causes exactly one worker to be
provisioned (if none is free) and held busy for up to 300 seconds. Killing
the client process (Ctrl-C) drops the connection and releases the worker
early.
Notes and caveats
- The HTTP connection must stay open for the full reservation. Make sure your client and any intermediate proxies allow long-lived requests (disable idle timeouts, retries, and connection reuse if necessary).
- If your client retries on timeout, you may end up provisioning duplicate
workers. Use idempotent semantics in your queue, or set
durationto a finite value and accept release-on-elapse as the normal exit. - There is no streaming / heartbeat in the response; the request returns exactly once, when the reservation ends.