Rewrite null pyworker on the framework session model

Drop the held-/reserve approach in favour of the framework's session
primitive (max_sessions=1 + /session/create). Sessions are excluded from
the autoscaler's queue-wait math and don't suffer the cur_perf=0
degradation that a long-held request did, so this naturally produces the
"one request comes in and you get a worker; release and it scales back
down" model we were hand-rolling.

Server side:
  - max_sessions=1; framework auto-registers /session/* routes
  - Drop custom /reserve handler, _active_reservation event, max_queue_
    time=0.0, MAX_RESERVATION_SECONDS, _perf_heartbeat
  - Trivial /ping handler exists only to satisfy the framework's
    "at least one handler with BenchmarkConfig" requirement (and to give
    clients an extension/keepalive route)
  - /release on the internal control port is kept as a convenience for
    queue consumers that don't carry session_auth — calls the framework's
    __close_session via name-mangling, which bypasses the session_auth
    check but is fine for a localhost-only endpoint
  - Workload/perf back to 100 (conventional)

Client side:
  - Uses endpoint.session(cost, lifetime) instead of POST /reserve
  - async with the SDK Session; close on exit posts /session/end with
    proper auth → 200 success in metrics
  - Demo and single modes both ride the same reserve() helper

Sessions landed in vastai-sdk 0.4.2 (commit ec9ef59, 2026-01-20).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Rob Ballantyne
2026-05-12 10:51:24 +01:00
parent 6c2f194b28
commit 6a562a1376
3 changed files with 206 additions and 252 deletions
+90 -94
View File
@@ -1,10 +1,8 @@
# Null PyWorker
A PyWorker that does **nothing** — it does not forward requests to any model
server. Each HTTP POST to `/reserve` simply marks the worker as busy and holds
the request open until the user's queue consumer (running locally on the
instance) calls `/release` on the internal control port — or a safety
timeout elapses.
server. Reservations are modelled as framework **sessions**: a request
comes in and you get a worker; release and it scales back down.
## When to use it
@@ -15,32 +13,29 @@ Use this worker when you want to drive Vast Serverless autoscaling but you do
etc.).
- A separate worker process on the Vast instance pulls work from that queue
directly. The Vast PyWorker is not involved in the request/response path.
Your consumer can be any language — node, golang, python, a binary —
this PyWorker is implementation-agnostic.
- You want one Vast worker per active queue consumer, and you want the
Serverless autoscaler to spin instances up and down based on demand on
*your* side.
A request comes in and you get a worker. Release and it scales back down.
POST to `/reserve` and serverless gives you a worker, held busy for the
lifetime of the request. When your queue consumer is done, POST to
`/release` on the internal port (`127.0.0.1:18999` by default) and the
held `/reserve` returns `200`.
## How it works
- `allow_parallel_requests=False` and `max_queue_time=0.0`, so one in-flight
`/reserve` fully occupies the worker and any further request that lands
on it is rejected with `429` immediately — serverless will route to a
free worker or scale a new one up.
- `lifecycle` is used instead of `model_log_file`, so there is no log to tail
and no model server to start. The worker reports itself ready immediately
after the (trivial) benchmark.
- The `/reserve` handler is a `remote_function` rather than an HTTP proxy, so
the framework never tries to forward the request anywhere — it just awaits
an internal `asyncio.Event`.
- An internal aiohttp control server, bound to `127.0.0.1`, hosts
`/release` (and, when no external healthcheck URL is provided, a stub
`/health`).
- Reservations use the framework's **session** model. The SDK exposes
`endpoint.session(cost, lifetime)` which POSTs to `/session/create` (a
built-in framework route) and returns a `Session` object usable as
`async with`. Closing the context (or calling `await session.close()`)
POSTs to `/session/end` — counted as a normal success in metrics.
- `max_sessions=1` on the worker side means a second `/session/create`
against an already-occupied worker returns `429`. Serverless routes
that request to a free worker or scales a new one up.
- Sessions are **excluded from queue-wait math** (the framework filters
`if not request.is_session`), so an occupied worker doesn't look like
it has a request queue piling up. The autoscaler treats a session as
occupancy, not as work-in-progress.
- `lifecycle` is used instead of `model_log_file`, so there is no log to
tail and no model server to start. The worker reports itself ready
immediately after a trivial benchmark.
## Healthchecking
@@ -49,48 +44,52 @@ fails after the first success, the worker is marked errored and the
autoscaler can decommission it. Two modes:
- **Stub (default)** — the internal control server also answers
`GET /health` with `200`. This is just enough to satisfy the framework
while you wire up real consumers.
`GET /health` with `200`. Just enough to satisfy the framework while
you wire up real consumers.
- **Point at your queue consumer (recommended)** — set
`BACKEND_HEALTH_URL=http://127.0.0.1:9090/health` (absolute URL) and the
pyworker will healthcheck *your* consumer instead. If your consumer
`BACKEND_HEALTH_URL=http://127.0.0.1:9090/health` (absolute URL) and
the pyworker will healthcheck *your* consumer instead. If the consumer
process crashes, the autoscaler will see the worker as broken.
Run your queue consumer on the instance alongside the PyWorker, expose a
plain `/health` endpoint on it, then set `BACKEND_HEALTH_URL` accordingly in
your template.
## API
### `POST /reserve` (external port, signed by the autoscaler)
### Reservation: `POST /session/create` (external, signed)
Holds the worker busy until the reservation ends.
Not implemented here — the framework provides this route automatically on
every PyWorker. Use the SDK:
Request body (all fields optional):
```python
from vastai import Serverless
```json
{ "duration": 600 }
async with Serverless() as client:
endpoint = await client.get_endpoint(name="my-null-endpoint")
async with endpoint.session(cost=100, lifetime=600) as s:
# Worker is now reserved. Your queue dispatcher does whatever it
# needs to do (typically: enqueue a job that mentions s.session_id).
...
# `async with` exit posts to /session/end → 200 success in metrics
```
- `duration` (seconds, optional): safety cap on how long to hold the
reservation if no `/release` arrives. Capped by `MAX_RESERVATION_SECONDS`
(env var, default 3600). If omitted, defaults to that cap.
Or raw HTTP (the SDK takes care of autoscaler signing for you, but the
shape of the request is documented for non-Python clients):
Behavior:
```
POST /session/create
{
"auth_data": { /* signed by autoscaler */ },
"payload": {
"lifetime": 600,
"on_close_route": "https://your.callback/notify",
"on_close_payload": {"job_id": "..."}
}
}
```
- Returns `200` with `{"released": "explicit", ...}` when the local consumer
POSTs `/release` on the internal port. **This is the intended happy path
— the request is counted as a success in metrics.**
- Returns `200` with `{"released": "duration_elapsed", "duration": <n>}` if
the duration cap fires (safety net for a stuck consumer).
- Returns `499` if the external client disconnects (counted as cancelled in
metrics — avoid this; use `/release` instead).
- Returns `429` immediately if the worker is already holding a reservation
(so serverless routes the request to a free worker instead of queueing).
### Release from a local consumer: `POST /release` (internal, localhost-only)
### `POST /release` (internal port, localhost-only)
Marks the active reservation as done. No body required. Idempotent:
Closes the active session, regardless of who created it. No body, no
auth. Use this when the queue consumer doesn't have (and shouldn't need)
the session's `session_auth`:
```bash
curl -X POST http://127.0.0.1:18999/release
@@ -98,78 +97,75 @@ curl -X POST http://127.0.0.1:18999/release
Responses:
- `200 {"released": true}` — active reservation was released; the held
`/reserve` will return `{"released": "explicit"}`.
- `200 {"released": false, "reason": "no active reservation"}` — nothing was
in flight, no-op.
- `200 {"released": true, "session_ids": ["..."]}` — closed; the held
client-side `/session/create` completes and counts as a success.
- `200 {"released": false, "reason": "no active session"}` — nothing
active, no-op.
Only processes on the Vast instance can reach this port. There is no
authentication on it.
For setups where the dispatcher can hand the consumer `session_auth`
(e.g. as part of the queue payload), the consumer can instead POST
`/session/end` on the framework's HTTP-only port
(`$WORKER_HTTP_PORT`, default `WORKER_PORT+1`) — the standard, fully
authenticated release path.
## Environment variables
- `MAX_RESERVATION_SECONDS` — upper bound on how long a single `/reserve`
call can hold a worker if `/release` is never called. Defaults to `3600`.
- `BACKEND_HEALTH_URL` — absolute URL the framework should healthcheck
(e.g. `http://127.0.0.1:9090/health`). When set, the stub `/health` route
is not registered on the internal server. When unset, the built-in stub
is used.
(e.g. `http://127.0.0.1:9090/health`). When set, the stub `/health`
route is not registered on the internal server.
- `NULL_CONTROL_PORT` — port for the internal control server (hosts
`/release` and optionally `/health`). Defaults to `18999`.
## Deploying on Vast Serverless
1. Create a Serverless endpoint and point `PYWORKER_REPO` at this repository
(or your fork).
1. Create a Serverless endpoint and point `PYWORKER_REPO` at this
repository (or your fork).
2. Set `BACKEND=null` in the template so `start_server.sh` runs
`workers.null.worker`.
3. There is no model server to configure; you can omit model-related env vars
entirely.
3. There is no model server to configure; you can omit model-related env
vars entirely.
4. Run your own queue-consumer process on the instance alongside the
PyWorker. When the consumer finishes its work it should:
PyWorker. When it finishes its work:
```bash
curl -X POST http://127.0.0.1:18999/release
```
so the held `/reserve` returns success and the autoscaler can scale the
worker down cleanly.
## Client example
Single reservation:
Single reservation (holds for 180s):
```bash
python -m workers.null.client --endpoint <ENDPOINT_NAME> --duration 600
python -m workers.null.client --endpoint <ENDPOINT_NAME>
```
To exercise the full flow, shell into the worker and run
`curl -X POST http://127.0.0.1:18999/release` — the client returns with
`{"released": "explicit", ...}`.
Staggered demo:
```bash
python -m workers.null.client --endpoint <ENDPOINT_NAME> --demo
```
Starts three reservations 30s apart (all held concurrently), holds the
Starts three sessions 30s apart (all held concurrently), holds the
3-worker plateau for 5 minutes so the autoscaler has time to actually
provision the third worker before any scale-down starts, then scales
down one worker at a time, also 30s apart, and exits.
provision the third worker before any scale-down starts, then closes
the sessions one at a time, also 30s apart, and exits. Every session
ends cleanly via the SDK's `session.close()` — `200` successes in
metrics, no cancellations.
Each reservation ends via its duration cap (a 200 success in metrics).
Tune the timing with `--interval` and `--plateau`.
Tune the timing with `--interval` and `--plateau`. To exercise the
local-release path, shell into a worker and run
`curl -X POST http://127.0.0.1:18999/release`.
## Notes and caveats
- The HTTP connection from the external caller must stay open for the full
reservation. Make sure your client and any intermediate proxies allow
long-lived requests (disable idle timeouts, retries, and connection
reuse if necessary).
- If your client retries on timeout, you may end up provisioning duplicate
workers. Configure `duration` generously and rely on `/release` from the
consumer to end reservations promptly.
- Avoid disconnecting the external `/reserve` request as a way to release —
that produces a `499` and is counted as a cancellation in Vast metrics.
Always release via `POST /release` on the internal port.
- There is no streaming / heartbeat in the response; the request returns
exactly once, when the reservation ends.
- The reservation's lifetime caps how long the session can live without
client activity. Set it comfortably longer than the work you expect to
do, or have the client periodically POST `/ping` with `session_id` to
extend.
- The `on_close_route` payload (passed at `/session/create`) is POSTed by
the framework when the session ends. Useful for notifying your queue
consumer that the reservation is closing.
- `/release` on the internal port is convenient but bypasses
`session_auth`. If you need the standard authenticated release flow,
pass `session_auth` to your consumer (e.g. through the queue payload)
and have it POST to `/session/end` on the framework's HTTP port
instead.