2026-05-11 16:48:52 +01:00
|
|
|
|
# Null PyWorker
|
|
|
|
|
|
|
|
|
|
|
|
A PyWorker that does **nothing** — it does not forward requests to any model
|
2026-05-12 10:51:24 +01:00
|
|
|
|
server. Reservations are modelled as framework **sessions**: a request
|
|
|
|
|
|
comes in and you get a worker; release and it scales back down.
|
2026-05-11 16:48:52 +01:00
|
|
|
|
|
|
|
|
|
|
## When to use it
|
|
|
|
|
|
|
|
|
|
|
|
Use this worker when you want to drive Vast Serverless autoscaling but you do
|
|
|
|
|
|
**not** want inbound requests to reach a model on the instance. Typical setup:
|
|
|
|
|
|
|
|
|
|
|
|
- You already have a job queue on your own infrastructure (Redis, SQS, NATS,
|
|
|
|
|
|
etc.).
|
|
|
|
|
|
- A separate worker process on the Vast instance pulls work from that queue
|
|
|
|
|
|
directly. The Vast PyWorker is not involved in the request/response path.
|
2026-05-12 10:51:24 +01:00
|
|
|
|
Your consumer can be any language — node, golang, python, a binary —
|
|
|
|
|
|
this PyWorker is implementation-agnostic.
|
2026-05-11 16:48:52 +01:00
|
|
|
|
- You want one Vast worker per active queue consumer, and you want the
|
|
|
|
|
|
Serverless autoscaler to spin instances up and down based on demand on
|
|
|
|
|
|
*your* side.
|
|
|
|
|
|
|
|
|
|
|
|
## How it works
|
|
|
|
|
|
|
2026-05-12 10:51:24 +01:00
|
|
|
|
- Reservations use the framework's **session** model. The SDK exposes
|
|
|
|
|
|
`endpoint.session(cost, lifetime)` which POSTs to `/session/create` (a
|
|
|
|
|
|
built-in framework route) and returns a `Session` object usable as
|
|
|
|
|
|
`async with`. Closing the context (or calling `await session.close()`)
|
|
|
|
|
|
POSTs to `/session/end` — counted as a normal success in metrics.
|
|
|
|
|
|
- `max_sessions=1` on the worker side means a second `/session/create`
|
|
|
|
|
|
against an already-occupied worker returns `429`. Serverless routes
|
|
|
|
|
|
that request to a free worker or scales a new one up.
|
|
|
|
|
|
- Sessions are **excluded from queue-wait math** (the framework filters
|
|
|
|
|
|
`if not request.is_session`), so an occupied worker doesn't look like
|
|
|
|
|
|
it has a request queue piling up. The autoscaler treats a session as
|
|
|
|
|
|
occupancy, not as work-in-progress.
|
|
|
|
|
|
- `lifecycle` is used instead of `model_log_file`, so there is no log to
|
|
|
|
|
|
tail and no model server to start. The worker reports itself ready
|
|
|
|
|
|
immediately after a trivial benchmark.
|
2026-05-11 16:48:52 +01:00
|
|
|
|
|
2026-05-11 16:53:26 +01:00
|
|
|
|
## Healthchecking
|
|
|
|
|
|
|
|
|
|
|
|
The framework periodically GETs a healthcheck URL after startup; if it ever
|
|
|
|
|
|
fails after the first success, the worker is marked errored and the
|
2026-05-11 16:59:46 +01:00
|
|
|
|
autoscaler can decommission it. Two modes:
|
2026-05-11 16:53:26 +01:00
|
|
|
|
|
2026-05-11 16:59:46 +01:00
|
|
|
|
- **Stub (default)** — the internal control server also answers
|
2026-05-12 10:51:24 +01:00
|
|
|
|
`GET /health` with `200`. Just enough to satisfy the framework while
|
|
|
|
|
|
you wire up real consumers.
|
2026-05-11 16:53:26 +01:00
|
|
|
|
- **Point at your queue consumer (recommended)** — set
|
2026-05-12 10:51:24 +01:00
|
|
|
|
`BACKEND_HEALTH_URL=http://127.0.0.1:9090/health` (absolute URL) and
|
|
|
|
|
|
the pyworker will healthcheck *your* consumer instead. If the consumer
|
2026-05-11 16:53:26 +01:00
|
|
|
|
process crashes, the autoscaler will see the worker as broken.
|
|
|
|
|
|
|
2026-05-11 16:48:52 +01:00
|
|
|
|
## API
|
|
|
|
|
|
|
2026-05-12 10:51:24 +01:00
|
|
|
|
### Reservation: `POST /session/create` (external, signed)
|
2026-05-11 16:48:52 +01:00
|
|
|
|
|
2026-05-12 10:51:24 +01:00
|
|
|
|
Not implemented here — the framework provides this route automatically on
|
|
|
|
|
|
every PyWorker. Use the SDK:
|
2026-05-11 16:48:52 +01:00
|
|
|
|
|
2026-05-12 10:51:24 +01:00
|
|
|
|
```python
|
|
|
|
|
|
from vastai import Serverless
|
2026-05-11 16:48:52 +01:00
|
|
|
|
|
2026-05-12 10:51:24 +01:00
|
|
|
|
async with Serverless() as client:
|
|
|
|
|
|
endpoint = await client.get_endpoint(name="my-null-endpoint")
|
|
|
|
|
|
async with endpoint.session(cost=100, lifetime=600) as s:
|
|
|
|
|
|
# Worker is now reserved. Your queue dispatcher does whatever it
|
|
|
|
|
|
# needs to do (typically: enqueue a job that mentions s.session_id).
|
|
|
|
|
|
...
|
|
|
|
|
|
# `async with` exit posts to /session/end → 200 success in metrics
|
2026-05-11 16:48:52 +01:00
|
|
|
|
```
|
|
|
|
|
|
|
2026-05-12 10:51:24 +01:00
|
|
|
|
Or raw HTTP (the SDK takes care of autoscaler signing for you, but the
|
|
|
|
|
|
shape of the request is documented for non-Python clients):
|
2026-05-11 16:48:52 +01:00
|
|
|
|
|
2026-05-12 10:51:24 +01:00
|
|
|
|
```
|
|
|
|
|
|
POST /session/create
|
|
|
|
|
|
{
|
|
|
|
|
|
"auth_data": { /* signed by autoscaler */ },
|
|
|
|
|
|
"payload": {
|
|
|
|
|
|
"lifetime": 600,
|
|
|
|
|
|
"on_close_route": "https://your.callback/notify",
|
|
|
|
|
|
"on_close_payload": {"job_id": "..."}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
```
|
2026-05-11 16:48:52 +01:00
|
|
|
|
|
2026-05-12 10:51:24 +01:00
|
|
|
|
### Release from a local consumer: `POST /release` (internal, localhost-only)
|
2026-05-11 16:59:46 +01:00
|
|
|
|
|
2026-05-12 10:51:24 +01:00
|
|
|
|
Closes the active session, regardless of who created it. No body, no
|
|
|
|
|
|
auth. Use this when the queue consumer doesn't have (and shouldn't need)
|
|
|
|
|
|
the session's `session_auth`:
|
2026-05-11 16:59:46 +01:00
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
curl -X POST http://127.0.0.1:18999/release
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
Responses:
|
|
|
|
|
|
|
2026-05-12 10:51:24 +01:00
|
|
|
|
- `200 {"released": true, "session_ids": ["..."]}` — closed; the held
|
|
|
|
|
|
client-side `/session/create` completes and counts as a success.
|
|
|
|
|
|
- `200 {"released": false, "reason": "no active session"}` — nothing
|
|
|
|
|
|
active, no-op.
|
2026-05-11 16:59:46 +01:00
|
|
|
|
|
2026-05-12 10:51:24 +01:00
|
|
|
|
For setups where the dispatcher can hand the consumer `session_auth`
|
|
|
|
|
|
(e.g. as part of the queue payload), the consumer can instead POST
|
|
|
|
|
|
`/session/end` on the framework's HTTP-only port
|
|
|
|
|
|
(`$WORKER_HTTP_PORT`, default `WORKER_PORT+1`) — the standard, fully
|
|
|
|
|
|
authenticated release path.
|
2026-05-11 16:59:46 +01:00
|
|
|
|
|
2026-05-11 16:48:52 +01:00
|
|
|
|
## Environment variables
|
|
|
|
|
|
|
2026-05-11 16:53:26 +01:00
|
|
|
|
- `BACKEND_HEALTH_URL` — absolute URL the framework should healthcheck
|
2026-05-12 10:51:24 +01:00
|
|
|
|
(e.g. `http://127.0.0.1:9090/health`). When set, the stub `/health`
|
|
|
|
|
|
route is not registered on the internal server.
|
2026-05-11 16:59:46 +01:00
|
|
|
|
- `NULL_CONTROL_PORT` — port for the internal control server (hosts
|
|
|
|
|
|
`/release` and optionally `/health`). Defaults to `18999`.
|
2026-05-11 16:48:52 +01:00
|
|
|
|
|
|
|
|
|
|
## Deploying on Vast Serverless
|
|
|
|
|
|
|
2026-05-12 10:51:24 +01:00
|
|
|
|
1. Create a Serverless endpoint and point `PYWORKER_REPO` at this
|
|
|
|
|
|
repository (or your fork).
|
2026-05-11 16:48:52 +01:00
|
|
|
|
2. Set `BACKEND=null` in the template so `start_server.sh` runs
|
|
|
|
|
|
`workers.null.worker`.
|
2026-05-12 10:51:24 +01:00
|
|
|
|
3. There is no model server to configure; you can omit model-related env
|
|
|
|
|
|
vars entirely.
|
2026-05-11 16:48:52 +01:00
|
|
|
|
4. Run your own queue-consumer process on the instance alongside the
|
2026-05-12 10:51:24 +01:00
|
|
|
|
PyWorker. When it finishes its work:
|
2026-05-11 16:59:46 +01:00
|
|
|
|
```bash
|
|
|
|
|
|
curl -X POST http://127.0.0.1:18999/release
|
|
|
|
|
|
```
|
2026-05-11 16:48:52 +01:00
|
|
|
|
|
2026-05-12 11:06:04 +01:00
|
|
|
|
### Endpoint scaling parameters
|
|
|
|
|
|
|
|
|
|
|
|
The null worker reports `max_perf = 100` and each reservation is a
|
|
|
|
|
|
session of `cost = 100`. Set the endpoint accordingly:
|
|
|
|
|
|
|
|
|
|
|
|
- **`target_util = 1.0`** — required. The default of `0.9` reserves
|
|
|
|
|
|
~11% spare capacity, which for a unit-occupancy worker rounds up to a
|
|
|
|
|
|
whole extra worker (e.g. `min_load = 100` becomes `100 / 0.9 = 111.1`
|
|
|
|
|
|
→ 2 active workers instead of 1). With `target_util = 1.0` the math
|
|
|
|
|
|
is clean: `min_load = 100 * N` keeps exactly `N` workers active.
|
|
|
|
|
|
- **`min_load`** — set to `100 * N` for `N` always-on workers (with
|
|
|
|
|
|
`target_util = 1.0`).
|
|
|
|
|
|
- **`max_workers`** — cap on total reservations the endpoint can ever
|
|
|
|
|
|
serve concurrently.
|
2026-05-12 11:31:26 +01:00
|
|
|
|
- **Session `cost = 2 × max_perf`** (e.g. `200` when `max_perf = 100`) —
|
|
|
|
|
|
recommended. Reporting `cost = max_perf` puts each occupied worker at
|
|
|
|
|
|
exactly 100% utilization, which the autoscaler reads as "at target,
|
|
|
|
|
|
no action needed." The third reservation then gets 429'd by both
|
|
|
|
|
|
occupied workers and stalls in the autoscaler's global queue
|
|
|
|
|
|
indefinitely instead of activating a cold worker.
|
|
|
|
|
|
|
|
|
|
|
|
Bumping `cost` above `max_perf` makes each session look like more than
|
|
|
|
|
|
one worker of work (`cur_load / max_perf > 1.0`), so the autoscaler
|
|
|
|
|
|
keeps an extra active worker hot per session. Slight over-provisioning
|
|
|
|
|
|
in exchange for predictable scale-up. The demo client defaults to
|
|
|
|
|
|
`--session-cost 200`.
|
2026-05-12 11:14:20 +01:00
|
|
|
|
- **`max_queue_time = 0`** (or very small, e.g. `0.1`) — required.
|
|
|
|
|
|
The per-worker `wait_time` property used internally to reject
|
|
|
|
|
|
requests filters sessions out, but the **autoscaler** computes its
|
|
|
|
|
|
own queue-time estimate from `cur_load / max_perf` — and `cur_load`
|
|
|
|
|
|
*does* include sessions. With defaults around 30s, an occupied null
|
|
|
|
|
|
worker (`cur_load = 100`, `max_perf = 100`, queue estimate = 1s)
|
|
|
|
|
|
looks "available" and the autoscaler keeps routing extra reservations
|
|
|
|
|
|
there, getting 429s and queueing them instead of scaling up. Setting
|
|
|
|
|
|
`max_queue_time = 0` makes any in-flight load mark the worker "full"
|
|
|
|
|
|
for routing.
|
|
|
|
|
|
- **`target_queue_time = 0`** — required. Aggressive scale-up trigger;
|
|
|
|
|
|
with `max_queue_time = 0` to keep occupied workers off the routing
|
|
|
|
|
|
table, this ensures the autoscaler provisions a new worker the
|
|
|
|
|
|
moment all existing ones are occupied rather than queueing on its
|
|
|
|
|
|
side. The queue-time math conceptually assumes work *completes in
|
|
|
|
|
|
proportion to load*, which doesn't hold for sessions (they last
|
|
|
|
|
|
hours, not `cur_load / max_perf` seconds). Zeroing both knobs tells
|
|
|
|
|
|
the autoscaler "don't estimate when this worker will free up; route
|
|
|
|
|
|
to a free one or make a new one."
|
2026-05-12 11:06:04 +01:00
|
|
|
|
- **`inactivity_timeout`** — works as expected: idle (no active
|
|
|
|
|
|
sessions) for N seconds → permitted to scale down past `min_load`.
|
|
|
|
|
|
|
2026-05-11 16:48:52 +01:00
|
|
|
|
## Client example
|
|
|
|
|
|
|
2026-05-12 10:51:24 +01:00
|
|
|
|
Single reservation (holds for 180s):
|
2026-05-11 17:08:44 +01:00
|
|
|
|
|
2026-05-11 16:48:52 +01:00
|
|
|
|
```bash
|
2026-05-12 10:51:24 +01:00
|
|
|
|
python -m workers.null.client --endpoint <ENDPOINT_NAME>
|
2026-05-11 16:48:52 +01:00
|
|
|
|
```
|
|
|
|
|
|
|
2026-05-11 17:08:44 +01:00
|
|
|
|
Staggered demo:
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
python -m workers.null.client --endpoint <ENDPOINT_NAME> --demo
|
|
|
|
|
|
```
|
|
|
|
|
|
|
2026-05-12 10:51:24 +01:00
|
|
|
|
Starts three sessions 30s apart (all held concurrently), holds the
|
2026-05-11 18:26:31 +01:00
|
|
|
|
3-worker plateau for 5 minutes so the autoscaler has time to actually
|
2026-05-12 10:51:24 +01:00
|
|
|
|
provision the third worker before any scale-down starts, then closes
|
|
|
|
|
|
the sessions one at a time, also 30s apart, and exits. Every session
|
|
|
|
|
|
ends cleanly via the SDK's `session.close()` — `200` successes in
|
|
|
|
|
|
metrics, no cancellations.
|
2026-05-11 18:26:31 +01:00
|
|
|
|
|
2026-05-12 10:51:24 +01:00
|
|
|
|
Tune the timing with `--interval` and `--plateau`. To exercise the
|
|
|
|
|
|
local-release path, shell into a worker and run
|
|
|
|
|
|
`curl -X POST http://127.0.0.1:18999/release`.
|
2026-05-11 16:48:52 +01:00
|
|
|
|
|
|
|
|
|
|
## Notes and caveats
|
|
|
|
|
|
|
2026-05-12 10:51:24 +01:00
|
|
|
|
- The reservation's lifetime caps how long the session can live without
|
|
|
|
|
|
client activity. Set it comfortably longer than the work you expect to
|
|
|
|
|
|
do, or have the client periodically POST `/ping` with `session_id` to
|
|
|
|
|
|
extend.
|
|
|
|
|
|
- The `on_close_route` payload (passed at `/session/create`) is POSTed by
|
|
|
|
|
|
the framework when the session ends. Useful for notifying your queue
|
|
|
|
|
|
consumer that the reservation is closing.
|
|
|
|
|
|
- `/release` on the internal port is convenient but bypasses
|
|
|
|
|
|
`session_auth`. If you need the standard authenticated release flow,
|
|
|
|
|
|
pass `session_auth` to your consumer (e.g. through the queue payload)
|
|
|
|
|
|
and have it POST to `/session/end` on the framework's HTTP port
|
|
|
|
|
|
instead.
|