Add /release control endpoint to null pyworker

The held /reserve now waits on an asyncio.Event and resolves when the local
queue consumer POSTs /release on the internal control port (127.0.0.1:18999
by default). This produces a 200 success in metrics instead of the 499
cancellation you got from disconnecting the client. The duration cap stays
as a safety net for stuck consumers.

The internal aiohttp server is now unconditional and hosts /release always;
the stub /health route is added only when BACKEND_HEALTH_URL is unset.
NULL_STUB_HEALTH_PORT is renamed to NULL_CONTROL_PORT to reflect the
broader role.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Rob Ballantyne
2026-05-11 16:59:46 +01:00
parent 89761b378a
commit 254ccdf181
2 changed files with 159 additions and 84 deletions
+76 -41
View File
@@ -2,8 +2,9 @@
A PyWorker that does **nothing** — it does not forward requests to any model A PyWorker that does **nothing** — it does not forward requests to any model
server. Each HTTP POST to `/reserve` simply marks the worker as busy and holds server. Each HTTP POST to `/reserve` simply marks the worker as busy and holds
the request open until the caller disconnects (or a configured timeout the request open until the user's queue consumer (running locally on the
elapses). instance) calls `/release` on the internal control port — or a safety
timeout elapses.
## When to use it ## When to use it
@@ -18,11 +19,12 @@ Use this worker when you want to drive Vast Serverless autoscaling but you do
Serverless autoscaler to spin instances up and down based on demand on Serverless autoscaler to spin instances up and down based on demand on
*your* side. *your* side.
For each job your side wants to run on a Vast instance, you POST once to For each batch of work your side wants on a Vast instance, you POST once to
`/reserve`. The autoscaler will provision a worker if none is free; the `/reserve`. The autoscaler provisions a worker if none is free; the request
request stays open, keeping that worker counted as busy, until you close the stays open, keeping that worker counted as busy. When your queue consumer
connection. When you close, the worker goes idle and the autoscaler is free finishes its work it POSTs `/release` on `127.0.0.1:18999` and the held
to scale it down. `/reserve` returns `200`, so the request is recorded as a normal success in
Vast metrics (not a cancellation).
## How it works ## How it works
@@ -33,19 +35,22 @@ to scale it down.
- `lifecycle` is used instead of `model_log_file`, so there is no log to tail - `lifecycle` is used instead of `model_log_file`, so there is no log to tail
and no model server to start. The worker reports itself ready immediately and no model server to start. The worker reports itself ready immediately
after the (trivial) benchmark. after the (trivial) benchmark.
- The handler is a `remote_function` rather than an HTTP proxy, so the - The `/reserve` handler is a `remote_function` rather than an HTTP proxy, so
framework never tries to forward the request anywhere. the framework never tries to forward the request anywhere — it just awaits
an internal `asyncio.Event`.
- An internal aiohttp control server, bound to `127.0.0.1`, hosts
`/release` (and, when no external healthcheck URL is provided, a stub
`/health`).
## Healthchecking ## Healthchecking
The framework periodically GETs a healthcheck URL after startup; if it ever The framework periodically GETs a healthcheck URL after startup; if it ever
fails after the first success, the worker is marked errored and the fails after the first success, the worker is marked errored and the
autoscaler can decommission it. The null worker exposes two modes: autoscaler can decommission it. Two modes:
- **Stub (default)** — a tiny HTTP server runs on - **Stub (default)** — the internal control server also answers
`http://127.0.0.1:18999/health` (override the port with `GET /health` with `200`. This is just enough to satisfy the framework
`NULL_STUB_HEALTH_PORT`) and always returns `200`. This is just enough to while you wire up real consumers.
satisfy the framework while you wire up real consumers.
- **Point at your queue consumer (recommended)** — set - **Point at your queue consumer (recommended)** — set
`BACKEND_HEALTH_URL=http://127.0.0.1:9090/health` (absolute URL) and the `BACKEND_HEALTH_URL=http://127.0.0.1:9090/health` (absolute URL) and the
pyworker will healthcheck *your* consumer instead. If your consumer pyworker will healthcheck *your* consumer instead. If your consumer
@@ -57,39 +62,60 @@ your template.
## API ## API
### `POST /reserve` ### `POST /reserve` (external port, signed by the autoscaler)
Holds the worker busy for the lifetime of the request. Holds the worker busy until the reservation ends.
Request body (all fields optional): Request body (all fields optional):
```json ```json
{ "duration": 60 } { "duration": 600 }
``` ```
- `duration` (seconds, optional): how long to hold the reservation if the - `duration` (seconds, optional): safety cap on how long to hold the
client does not disconnect first. Capped by `MAX_RESERVATION_SECONDS` (env reservation if no `/release` arrives. Capped by `MAX_RESERVATION_SECONDS`
var, default 3600). If omitted, defaults to the cap. (env var, default 3600). If omitted, defaults to that cap.
Behavior: Behavior:
- Returns `200` with `{"released": "duration_elapsed", "duration": <n>}` when - Returns `200` with `{"released": "explicit", ...}` when the local consumer
the duration elapses normally. POSTs `/release` on the internal port. **This is the intended happy path
- Returns `499` when the client disconnects (the reservation is released — the request is counted as a success in metrics.**
immediately). - Returns `200` with `{"released": "duration_elapsed", "duration": <n>}` if
the duration cap fires (safety net for a stuck consumer).
- Returns `499` if the external client disconnects (counted as cancelled in
metrics — avoid this; use `/release` instead).
- Returns `429` if the worker is already busy and queue wait would exceed - Returns `429` if the worker is already busy and queue wait would exceed
`max_queue_time` (30s by default). `max_queue_time` (30s by default).
### `POST /release` (internal port, localhost-only)
Marks the active reservation as done. No body required. Idempotent:
```bash
curl -X POST http://127.0.0.1:18999/release
```
Responses:
- `200 {"released": true}` — active reservation was released; the held
`/reserve` will return `{"released": "explicit"}`.
- `200 {"released": false, "reason": "no active reservation"}` — nothing was
in flight, no-op.
Only processes on the Vast instance can reach this port. There is no
authentication on it.
## Environment variables ## Environment variables
- `MAX_RESERVATION_SECONDS` — upper bound on how long a single `/reserve` - `MAX_RESERVATION_SECONDS` — upper bound on how long a single `/reserve`
call can hold a worker. Defaults to `3600`. Set lower if you want a tighter call can hold a worker if `/release` is never called. Defaults to `3600`.
safety cap against stuck clients.
- `BACKEND_HEALTH_URL` — absolute URL the framework should healthcheck - `BACKEND_HEALTH_URL` — absolute URL the framework should healthcheck
(e.g. `http://127.0.0.1:9090/health`). When set, the stub server does not (e.g. `http://127.0.0.1:9090/health`). When set, the stub `/health` route
run. When unset, the built-in stub is used. is not registered on the internal server. When unset, the built-in stub
- `NULL_STUB_HEALTH_PORT` — port for the built-in stub healthcheck server. is used.
Defaults to `18999`. Only used when `BACKEND_HEALTH_URL` is unset. - `NULL_CONTROL_PORT` — port for the internal control server (hosts
`/release` and optionally `/health`). Defaults to `18999`.
## Deploying on Vast Serverless ## Deploying on Vast Serverless
@@ -100,26 +126,35 @@ Behavior:
3. There is no model server to configure; you can omit model-related env vars 3. There is no model server to configure; you can omit model-related env vars
entirely. entirely.
4. Run your own queue-consumer process on the instance alongside the 4. Run your own queue-consumer process on the instance alongside the
PyWorker (e.g. as a separate supervisor service started by the template). PyWorker. When the consumer finishes its work it should:
```bash
curl -X POST http://127.0.0.1:18999/release
```
so the held `/reserve` returns success and the autoscaler can scale the
worker down cleanly.
## Client example ## Client example
```bash ```bash
python -m workers.null.client --endpoint <ENDPOINT_NAME> --duration 300 python -m workers.null.client --endpoint <ENDPOINT_NAME> --duration 600
``` ```
This will POST once to `/reserve`, which causes exactly one worker to be This POSTs once to `/reserve`, which causes exactly one worker to be
provisioned (if none is free) and held busy for up to 300 seconds. Killing provisioned (if none is free) and held busy. To exercise the full flow,
the client process (Ctrl-C) drops the connection and releases the worker shell into the worker and run `curl -X POST http://127.0.0.1:18999/release`
early. — the client will return with `{"released": "explicit", ...}`.
## Notes and caveats ## Notes and caveats
- The HTTP connection must stay open for the full reservation. Make sure - The HTTP connection from the external caller must stay open for the full
your client and any intermediate proxies allow long-lived requests reservation. Make sure your client and any intermediate proxies allow
(disable idle timeouts, retries, and connection reuse if necessary). long-lived requests (disable idle timeouts, retries, and connection
reuse if necessary).
- If your client retries on timeout, you may end up provisioning duplicate - If your client retries on timeout, you may end up provisioning duplicate
workers. Use idempotent semantics in *your* queue, or set `duration` to a workers. Configure `duration` generously and rely on `/release` from the
finite value and accept release-on-elapse as the normal exit. consumer to end reservations promptly.
- Avoid disconnecting the external `/reserve` request as a way to release —
that produces a `499` and is counted as a cancellation in Vast metrics.
Always release via `POST /release` on the internal port.
- There is no streaming / heartbeat in the response; the request returns - There is no streaming / heartbeat in the response; the request returns
exactly once, when the reservation ends. exactly once, when the reservation ends.
+83 -43
View File
@@ -2,6 +2,7 @@ import asyncio
import logging import logging
import os import os
from contextlib import asynccontextmanager from contextlib import asynccontextmanager
from typing import Optional
from urllib.parse import urlsplit from urllib.parse import urlsplit
from aiohttp import web from aiohttp import web
@@ -16,8 +17,8 @@ from vastai import (
log = logging.getLogger(__file__) log = logging.getLogger(__file__)
# Safety cap: if a client never disconnects and never sets `duration`, the # Safety cap: if the user's queue consumer never calls /release, the
# reservation is auto-released after this many seconds so a stuck client # reservation is auto-released after this many seconds so a forgotten /release
# can't pin a worker indefinitely. Override with MAX_RESERVATION_SECONDS. # can't pin a worker indefinitely. Override with MAX_RESERVATION_SECONDS.
MAX_RESERVATION_SECONDS = float(os.environ.get("MAX_RESERVATION_SECONDS", 3600)) MAX_RESERVATION_SECONDS = float(os.environ.get("MAX_RESERVATION_SECONDS", 3600))
@@ -25,20 +26,19 @@ MAX_RESERVATION_SECONDS = float(os.environ.get("MAX_RESERVATION_SECONDS", 3600))
# immediately during capacity estimation instead of sleeping. # immediately during capacity estimation instead of sleeping.
BENCHMARK_SENTINEL = "__null_worker_benchmark__" BENCHMARK_SENTINEL = "__null_worker_benchmark__"
# Healthcheck wiring. The framework periodically GETs # Internal control server. Hosts:
# `<model_server_url>:<model_server_port><model_healthcheck_url>` and marks the # * POST /release — always available, marks the active reservation as
# worker errored if that ever fails after the first success. For the null # done so the held /reserve returns 200 (success in metrics, not a
# worker we either: # cancellation).
# * point at a URL the user supplies via BACKEND_HEALTH_URL — typically # * GET /health — only when no external BACKEND_HEALTH_URL is set; the
# their own queue-consumer's health endpoint, so the autoscaler sees the # framework's healthcheck loop polls it so the worker has a live signal.
# worker as broken if the consumer dies, or # Bound to 127.0.0.1 so only processes on the instance can reach it.
# * run a tiny built-in stub that always returns 200, so the framework has INTERNAL_HOST = "127.0.0.1"
# something live to talk to until the user wires up a real consumer. INTERNAL_PORT = int(os.environ.get("NULL_CONTROL_PORT", 18999))
BACKEND_HEALTH_URL = os.environ.get("BACKEND_HEALTH_URL", "").strip()
STUB_HEALTH_HOST = "127.0.0.1"
STUB_HEALTH_PORT = int(os.environ.get("NULL_STUB_HEALTH_PORT", 18999))
STUB_HEALTH_PATH = "/health" STUB_HEALTH_PATH = "/health"
BACKEND_HEALTH_URL = os.environ.get("BACKEND_HEALTH_URL", "").strip()
if BACKEND_HEALTH_URL: if BACKEND_HEALTH_URL:
_parsed = urlsplit(BACKEND_HEALTH_URL) _parsed = urlsplit(BACKEND_HEALTH_URL)
if not _parsed.scheme or not _parsed.hostname: if not _parsed.scheme or not _parsed.hostname:
@@ -48,43 +48,73 @@ if BACKEND_HEALTH_URL:
HEALTH_BASE_URL = f"{_parsed.scheme}://{_parsed.hostname}" HEALTH_BASE_URL = f"{_parsed.scheme}://{_parsed.hostname}"
HEALTH_PORT = _parsed.port or (443 if _parsed.scheme == "https" else 80) HEALTH_PORT = _parsed.port or (443 if _parsed.scheme == "https" else 80)
HEALTH_PATH = _parsed.path or "/" HEALTH_PATH = _parsed.path or "/"
USE_STUB = False USE_STUB_HEALTH = False
else: else:
HEALTH_BASE_URL = f"http://{STUB_HEALTH_HOST}" HEALTH_BASE_URL = f"http://{INTERNAL_HOST}"
HEALTH_PORT = STUB_HEALTH_PORT HEALTH_PORT = INTERNAL_PORT
HEALTH_PATH = STUB_HEALTH_PATH HEALTH_PATH = STUB_HEALTH_PATH
USE_STUB = True USE_STUB_HEALTH = True
# Singleton active reservation. `allow_parallel_requests=False` on the
# /reserve handler guarantees the framework only runs one at a time per
# worker, so a single slot is enough.
_active_reservation: Optional[asyncio.Event] = None
def _build_internal_app() -> web.Application:
app = web.Application()
async def release_handler(_request: web.Request) -> web.Response:
event = _active_reservation
if event is None:
return web.json_response(
{"released": False, "reason": "no active reservation"},
status=200,
)
event.set()
return web.json_response({"released": True}, status=200)
app.router.add_post("/release", release_handler)
if USE_STUB_HEALTH:
async def stub_health(_request: web.Request) -> web.Response:
return web.Response(status=200, text="ok")
app.router.add_get(STUB_HEALTH_PATH, stub_health)
return app
@asynccontextmanager @asynccontextmanager
async def null_lifecycle(): async def null_lifecycle():
runner = None app = _build_internal_app()
if USE_STUB: runner = web.AppRunner(app)
async def stub_health(_request: web.Request) -> web.Response: await runner.setup()
return web.Response(status=200, text="ok") site = web.TCPSite(runner, INTERNAL_HOST, INTERNAL_PORT)
await site.start()
app = web.Application() lines = [
app.router.add_get(STUB_HEALTH_PATH, stub_health) f"Null pyworker internal control server: http://{INTERNAL_HOST}:{INTERNAL_PORT}",
runner = web.AppRunner(app) f" POST /release - end the active reservation (call from your queue consumer)",
await runner.setup() ]
site = web.TCPSite(runner, STUB_HEALTH_HOST, STUB_HEALTH_PORT) if USE_STUB_HEALTH:
await site.start() lines.append(
log.info( f" GET {STUB_HEALTH_PATH} - stub healthcheck (override with BACKEND_HEALTH_URL)"
f"Null pyworker stub healthcheck listening on "
f"http://{STUB_HEALTH_HOST}:{STUB_HEALTH_PORT}{STUB_HEALTH_PATH} "
f"(override by setting BACKEND_HEALTH_URL)"
) )
else: else:
log.info(f"Null pyworker healthcheck pointing at {BACKEND_HEALTH_URL}") lines.append(f"Framework healthcheck pointed at: {BACKEND_HEALTH_URL}")
log.info("\n".join(lines))
try: try:
yield yield
finally: finally:
if runner is not None: await runner.cleanup()
await runner.cleanup()
async def reserve_worker(**params: object) -> dict: async def reserve_worker(**params: object) -> dict:
global _active_reservation
if params.get(BENCHMARK_SENTINEL): if params.get(BENCHMARK_SENTINEL):
return {"ok": True, "benchmark": True} return {"ok": True, "benchmark": True}
@@ -97,17 +127,27 @@ async def reserve_worker(**params: object) -> dict:
except (TypeError, ValueError): except (TypeError, ValueError):
duration = MAX_RESERVATION_SECONDS duration = MAX_RESERVATION_SECONDS
event = asyncio.Event()
_active_reservation = event
log.info( log.info(
f"Reservation acquired; holding worker busy for up to {duration:.1f}s " f"Reservation acquired; awaiting POST /release on "
f"(release early by disconnecting the HTTP request)" f"http://{INTERNAL_HOST}:{INTERNAL_PORT}/release "
f"(auto-release after {duration:.1f}s)"
) )
try: try:
await asyncio.sleep(duration) try:
log.info("Reservation duration elapsed; releasing worker") await asyncio.wait_for(event.wait(), timeout=duration)
return {"released": "duration_elapsed", "duration": duration} log.info("Reservation released via /release")
except asyncio.CancelledError: return {"released": "explicit", "duration_cap": duration}
log.info("Reservation released by client disconnect") except asyncio.TimeoutError:
raise log.warning(
f"Reservation hit duration cap of {duration:.1f}s without "
f"explicit /release; releasing automatically"
)
return {"released": "duration_elapsed", "duration": duration}
finally:
if _active_reservation is event:
_active_reservation = None
worker_config = WorkerConfig( worker_config = WorkerConfig(