Rewrite null pyworker on the framework session model
Drop the held-/reserve approach in favour of the framework's session
primitive (max_sessions=1 + /session/create). Sessions are excluded from
the autoscaler's queue-wait math and don't suffer the cur_perf=0
degradation that a long-held request did, so this naturally produces the
"one request comes in and you get a worker; release and it scales back
down" model we were hand-rolling.
Server side:
- max_sessions=1; framework auto-registers /session/* routes
- Drop custom /reserve handler, _active_reservation event, max_queue_
time=0.0, MAX_RESERVATION_SECONDS, _perf_heartbeat
- Trivial /ping handler exists only to satisfy the framework's
"at least one handler with BenchmarkConfig" requirement (and to give
clients an extension/keepalive route)
- /release on the internal control port is kept as a convenience for
queue consumers that don't carry session_auth — calls the framework's
__close_session via name-mangling, which bypasses the session_auth
check but is fine for a localhost-only endpoint
- Workload/perf back to 100 (conventional)
Client side:
- Uses endpoint.session(cost, lifetime) instead of POST /reserve
- async with the SDK Session; close on exit posts /session/end with
proper auth → 200 success in metrics
- Demo and single modes both ride the same reserve() helper
Sessions landed in vastai-sdk 0.4.2 (commit ec9ef59, 2026-01-20).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+90
-94
@@ -1,10 +1,8 @@
|
|||||||
# Null PyWorker
|
# Null PyWorker
|
||||||
|
|
||||||
A PyWorker that does **nothing** — it does not forward requests to any model
|
A PyWorker that does **nothing** — it does not forward requests to any model
|
||||||
server. Each HTTP POST to `/reserve` simply marks the worker as busy and holds
|
server. Reservations are modelled as framework **sessions**: a request
|
||||||
the request open until the user's queue consumer (running locally on the
|
comes in and you get a worker; release and it scales back down.
|
||||||
instance) calls `/release` on the internal control port — or a safety
|
|
||||||
timeout elapses.
|
|
||||||
|
|
||||||
## When to use it
|
## When to use it
|
||||||
|
|
||||||
@@ -15,32 +13,29 @@ Use this worker when you want to drive Vast Serverless autoscaling but you do
|
|||||||
etc.).
|
etc.).
|
||||||
- A separate worker process on the Vast instance pulls work from that queue
|
- A separate worker process on the Vast instance pulls work from that queue
|
||||||
directly. The Vast PyWorker is not involved in the request/response path.
|
directly. The Vast PyWorker is not involved in the request/response path.
|
||||||
|
Your consumer can be any language — node, golang, python, a binary —
|
||||||
|
this PyWorker is implementation-agnostic.
|
||||||
- You want one Vast worker per active queue consumer, and you want the
|
- You want one Vast worker per active queue consumer, and you want the
|
||||||
Serverless autoscaler to spin instances up and down based on demand on
|
Serverless autoscaler to spin instances up and down based on demand on
|
||||||
*your* side.
|
*your* side.
|
||||||
|
|
||||||
A request comes in and you get a worker. Release and it scales back down.
|
|
||||||
|
|
||||||
POST to `/reserve` and serverless gives you a worker, held busy for the
|
|
||||||
lifetime of the request. When your queue consumer is done, POST to
|
|
||||||
`/release` on the internal port (`127.0.0.1:18999` by default) and the
|
|
||||||
held `/reserve` returns `200`.
|
|
||||||
|
|
||||||
## How it works
|
## How it works
|
||||||
|
|
||||||
- `allow_parallel_requests=False` and `max_queue_time=0.0`, so one in-flight
|
- Reservations use the framework's **session** model. The SDK exposes
|
||||||
`/reserve` fully occupies the worker and any further request that lands
|
`endpoint.session(cost, lifetime)` which POSTs to `/session/create` (a
|
||||||
on it is rejected with `429` immediately — serverless will route to a
|
built-in framework route) and returns a `Session` object usable as
|
||||||
free worker or scale a new one up.
|
`async with`. Closing the context (or calling `await session.close()`)
|
||||||
- `lifecycle` is used instead of `model_log_file`, so there is no log to tail
|
POSTs to `/session/end` — counted as a normal success in metrics.
|
||||||
and no model server to start. The worker reports itself ready immediately
|
- `max_sessions=1` on the worker side means a second `/session/create`
|
||||||
after the (trivial) benchmark.
|
against an already-occupied worker returns `429`. Serverless routes
|
||||||
- The `/reserve` handler is a `remote_function` rather than an HTTP proxy, so
|
that request to a free worker or scales a new one up.
|
||||||
the framework never tries to forward the request anywhere — it just awaits
|
- Sessions are **excluded from queue-wait math** (the framework filters
|
||||||
an internal `asyncio.Event`.
|
`if not request.is_session`), so an occupied worker doesn't look like
|
||||||
- An internal aiohttp control server, bound to `127.0.0.1`, hosts
|
it has a request queue piling up. The autoscaler treats a session as
|
||||||
`/release` (and, when no external healthcheck URL is provided, a stub
|
occupancy, not as work-in-progress.
|
||||||
`/health`).
|
- `lifecycle` is used instead of `model_log_file`, so there is no log to
|
||||||
|
tail and no model server to start. The worker reports itself ready
|
||||||
|
immediately after a trivial benchmark.
|
||||||
|
|
||||||
## Healthchecking
|
## Healthchecking
|
||||||
|
|
||||||
@@ -49,48 +44,52 @@ fails after the first success, the worker is marked errored and the
|
|||||||
autoscaler can decommission it. Two modes:
|
autoscaler can decommission it. Two modes:
|
||||||
|
|
||||||
- **Stub (default)** — the internal control server also answers
|
- **Stub (default)** — the internal control server also answers
|
||||||
`GET /health` with `200`. This is just enough to satisfy the framework
|
`GET /health` with `200`. Just enough to satisfy the framework while
|
||||||
while you wire up real consumers.
|
you wire up real consumers.
|
||||||
- **Point at your queue consumer (recommended)** — set
|
- **Point at your queue consumer (recommended)** — set
|
||||||
`BACKEND_HEALTH_URL=http://127.0.0.1:9090/health` (absolute URL) and the
|
`BACKEND_HEALTH_URL=http://127.0.0.1:9090/health` (absolute URL) and
|
||||||
pyworker will healthcheck *your* consumer instead. If your consumer
|
the pyworker will healthcheck *your* consumer instead. If the consumer
|
||||||
process crashes, the autoscaler will see the worker as broken.
|
process crashes, the autoscaler will see the worker as broken.
|
||||||
|
|
||||||
Run your queue consumer on the instance alongside the PyWorker, expose a
|
|
||||||
plain `/health` endpoint on it, then set `BACKEND_HEALTH_URL` accordingly in
|
|
||||||
your template.
|
|
||||||
|
|
||||||
## API
|
## API
|
||||||
|
|
||||||
### `POST /reserve` (external port, signed by the autoscaler)
|
### Reservation: `POST /session/create` (external, signed)
|
||||||
|
|
||||||
Holds the worker busy until the reservation ends.
|
Not implemented here — the framework provides this route automatically on
|
||||||
|
every PyWorker. Use the SDK:
|
||||||
|
|
||||||
Request body (all fields optional):
|
```python
|
||||||
|
from vastai import Serverless
|
||||||
|
|
||||||
```json
|
async with Serverless() as client:
|
||||||
{ "duration": 600 }
|
endpoint = await client.get_endpoint(name="my-null-endpoint")
|
||||||
|
async with endpoint.session(cost=100, lifetime=600) as s:
|
||||||
|
# Worker is now reserved. Your queue dispatcher does whatever it
|
||||||
|
# needs to do (typically: enqueue a job that mentions s.session_id).
|
||||||
|
...
|
||||||
|
# `async with` exit posts to /session/end → 200 success in metrics
|
||||||
```
|
```
|
||||||
|
|
||||||
- `duration` (seconds, optional): safety cap on how long to hold the
|
Or raw HTTP (the SDK takes care of autoscaler signing for you, but the
|
||||||
reservation if no `/release` arrives. Capped by `MAX_RESERVATION_SECONDS`
|
shape of the request is documented for non-Python clients):
|
||||||
(env var, default 3600). If omitted, defaults to that cap.
|
|
||||||
|
|
||||||
Behavior:
|
```
|
||||||
|
POST /session/create
|
||||||
|
{
|
||||||
|
"auth_data": { /* signed by autoscaler */ },
|
||||||
|
"payload": {
|
||||||
|
"lifetime": 600,
|
||||||
|
"on_close_route": "https://your.callback/notify",
|
||||||
|
"on_close_payload": {"job_id": "..."}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
- Returns `200` with `{"released": "explicit", ...}` when the local consumer
|
### Release from a local consumer: `POST /release` (internal, localhost-only)
|
||||||
POSTs `/release` on the internal port. **This is the intended happy path
|
|
||||||
— the request is counted as a success in metrics.**
|
|
||||||
- Returns `200` with `{"released": "duration_elapsed", "duration": <n>}` if
|
|
||||||
the duration cap fires (safety net for a stuck consumer).
|
|
||||||
- Returns `499` if the external client disconnects (counted as cancelled in
|
|
||||||
metrics — avoid this; use `/release` instead).
|
|
||||||
- Returns `429` immediately if the worker is already holding a reservation
|
|
||||||
(so serverless routes the request to a free worker instead of queueing).
|
|
||||||
|
|
||||||
### `POST /release` (internal port, localhost-only)
|
Closes the active session, regardless of who created it. No body, no
|
||||||
|
auth. Use this when the queue consumer doesn't have (and shouldn't need)
|
||||||
Marks the active reservation as done. No body required. Idempotent:
|
the session's `session_auth`:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -X POST http://127.0.0.1:18999/release
|
curl -X POST http://127.0.0.1:18999/release
|
||||||
@@ -98,78 +97,75 @@ curl -X POST http://127.0.0.1:18999/release
|
|||||||
|
|
||||||
Responses:
|
Responses:
|
||||||
|
|
||||||
- `200 {"released": true}` — active reservation was released; the held
|
- `200 {"released": true, "session_ids": ["..."]}` — closed; the held
|
||||||
`/reserve` will return `{"released": "explicit"}`.
|
client-side `/session/create` completes and counts as a success.
|
||||||
- `200 {"released": false, "reason": "no active reservation"}` — nothing was
|
- `200 {"released": false, "reason": "no active session"}` — nothing
|
||||||
in flight, no-op.
|
active, no-op.
|
||||||
|
|
||||||
Only processes on the Vast instance can reach this port. There is no
|
For setups where the dispatcher can hand the consumer `session_auth`
|
||||||
authentication on it.
|
(e.g. as part of the queue payload), the consumer can instead POST
|
||||||
|
`/session/end` on the framework's HTTP-only port
|
||||||
|
(`$WORKER_HTTP_PORT`, default `WORKER_PORT+1`) — the standard, fully
|
||||||
|
authenticated release path.
|
||||||
|
|
||||||
## Environment variables
|
## Environment variables
|
||||||
|
|
||||||
- `MAX_RESERVATION_SECONDS` — upper bound on how long a single `/reserve`
|
|
||||||
call can hold a worker if `/release` is never called. Defaults to `3600`.
|
|
||||||
- `BACKEND_HEALTH_URL` — absolute URL the framework should healthcheck
|
- `BACKEND_HEALTH_URL` — absolute URL the framework should healthcheck
|
||||||
(e.g. `http://127.0.0.1:9090/health`). When set, the stub `/health` route
|
(e.g. `http://127.0.0.1:9090/health`). When set, the stub `/health`
|
||||||
is not registered on the internal server. When unset, the built-in stub
|
route is not registered on the internal server.
|
||||||
is used.
|
|
||||||
- `NULL_CONTROL_PORT` — port for the internal control server (hosts
|
- `NULL_CONTROL_PORT` — port for the internal control server (hosts
|
||||||
`/release` and optionally `/health`). Defaults to `18999`.
|
`/release` and optionally `/health`). Defaults to `18999`.
|
||||||
|
|
||||||
## Deploying on Vast Serverless
|
## Deploying on Vast Serverless
|
||||||
|
|
||||||
1. Create a Serverless endpoint and point `PYWORKER_REPO` at this repository
|
1. Create a Serverless endpoint and point `PYWORKER_REPO` at this
|
||||||
(or your fork).
|
repository (or your fork).
|
||||||
2. Set `BACKEND=null` in the template so `start_server.sh` runs
|
2. Set `BACKEND=null` in the template so `start_server.sh` runs
|
||||||
`workers.null.worker`.
|
`workers.null.worker`.
|
||||||
3. There is no model server to configure; you can omit model-related env vars
|
3. There is no model server to configure; you can omit model-related env
|
||||||
entirely.
|
vars entirely.
|
||||||
4. Run your own queue-consumer process on the instance alongside the
|
4. Run your own queue-consumer process on the instance alongside the
|
||||||
PyWorker. When the consumer finishes its work it should:
|
PyWorker. When it finishes its work:
|
||||||
```bash
|
```bash
|
||||||
curl -X POST http://127.0.0.1:18999/release
|
curl -X POST http://127.0.0.1:18999/release
|
||||||
```
|
```
|
||||||
so the held `/reserve` returns success and the autoscaler can scale the
|
|
||||||
worker down cleanly.
|
|
||||||
|
|
||||||
## Client example
|
## Client example
|
||||||
|
|
||||||
Single reservation:
|
Single reservation (holds for 180s):
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m workers.null.client --endpoint <ENDPOINT_NAME> --duration 600
|
python -m workers.null.client --endpoint <ENDPOINT_NAME>
|
||||||
```
|
```
|
||||||
|
|
||||||
To exercise the full flow, shell into the worker and run
|
|
||||||
`curl -X POST http://127.0.0.1:18999/release` — the client returns with
|
|
||||||
`{"released": "explicit", ...}`.
|
|
||||||
|
|
||||||
Staggered demo:
|
Staggered demo:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m workers.null.client --endpoint <ENDPOINT_NAME> --demo
|
python -m workers.null.client --endpoint <ENDPOINT_NAME> --demo
|
||||||
```
|
```
|
||||||
|
|
||||||
Starts three reservations 30s apart (all held concurrently), holds the
|
Starts three sessions 30s apart (all held concurrently), holds the
|
||||||
3-worker plateau for 5 minutes so the autoscaler has time to actually
|
3-worker plateau for 5 minutes so the autoscaler has time to actually
|
||||||
provision the third worker before any scale-down starts, then scales
|
provision the third worker before any scale-down starts, then closes
|
||||||
down one worker at a time, also 30s apart, and exits.
|
the sessions one at a time, also 30s apart, and exits. Every session
|
||||||
|
ends cleanly via the SDK's `session.close()` — `200` successes in
|
||||||
|
metrics, no cancellations.
|
||||||
|
|
||||||
Each reservation ends via its duration cap (a 200 success in metrics).
|
Tune the timing with `--interval` and `--plateau`. To exercise the
|
||||||
Tune the timing with `--interval` and `--plateau`.
|
local-release path, shell into a worker and run
|
||||||
|
`curl -X POST http://127.0.0.1:18999/release`.
|
||||||
|
|
||||||
## Notes and caveats
|
## Notes and caveats
|
||||||
|
|
||||||
- The HTTP connection from the external caller must stay open for the full
|
- The reservation's lifetime caps how long the session can live without
|
||||||
reservation. Make sure your client and any intermediate proxies allow
|
client activity. Set it comfortably longer than the work you expect to
|
||||||
long-lived requests (disable idle timeouts, retries, and connection
|
do, or have the client periodically POST `/ping` with `session_id` to
|
||||||
reuse if necessary).
|
extend.
|
||||||
- If your client retries on timeout, you may end up provisioning duplicate
|
- The `on_close_route` payload (passed at `/session/create`) is POSTed by
|
||||||
workers. Configure `duration` generously and rely on `/release` from the
|
the framework when the session ends. Useful for notifying your queue
|
||||||
consumer to end reservations promptly.
|
consumer that the reservation is closing.
|
||||||
- Avoid disconnecting the external `/reserve` request as a way to release —
|
- `/release` on the internal port is convenient but bypasses
|
||||||
that produces a `499` and is counted as a cancellation in Vast metrics.
|
`session_auth`. If you need the standard authenticated release flow,
|
||||||
Always release via `POST /release` on the internal port.
|
pass `session_auth` to your consumer (e.g. through the queue payload)
|
||||||
- There is no streaming / heartbeat in the response; the request returns
|
and have it POST to `/session/end` on the framework's HTTP port
|
||||||
exactly once, when the reservation ends.
|
instead.
|
||||||
|
|||||||
+42
-33
@@ -15,35 +15,42 @@ logging.basicConfig(
|
|||||||
log = logging.getLogger(__file__)
|
log = logging.getLogger(__file__)
|
||||||
|
|
||||||
ENDPOINT_NAME = "null-prod"
|
ENDPOINT_NAME = "null-prod"
|
||||||
|
SESSION_COST = 100
|
||||||
|
|
||||||
|
|
||||||
async def reserve(
|
async def reserve(
|
||||||
client: Serverless,
|
client: Serverless,
|
||||||
*,
|
*,
|
||||||
endpoint_name: str,
|
endpoint_name: str,
|
||||||
duration: float,
|
hold_for: float,
|
||||||
label: str = "reservation",
|
label: str = "session",
|
||||||
) -> dict:
|
) -> None:
|
||||||
"""Hold a Vast worker open for `duration` seconds (or until we disconnect).
|
"""Open a session, hold the worker for `hold_for` seconds, close cleanly.
|
||||||
|
|
||||||
The worker counts itself busy for the lifetime of this call. Returning
|
Uses the framework's session model — each session counts as one worker
|
||||||
here means the reservation has ended — either /release was called on
|
occupied, but unlike a held HTTP request it isn't poisoning the
|
||||||
the worker's internal control port, or the duration cap fired, or the
|
worker's throughput math. max_sessions=1 on the worker side means a
|
||||||
HTTP request was cancelled.
|
second /session/create against the same worker gets 429, so serverless
|
||||||
|
routes the second reservation to a free worker or scales a new one up.
|
||||||
"""
|
"""
|
||||||
endpoint = await client.get_endpoint(name=endpoint_name)
|
endpoint = await client.get_endpoint(name=endpoint_name)
|
||||||
payload = {"duration": duration}
|
# Session lifetime must outlast the hold. The framework expires sessions
|
||||||
|
# whose `expiration` (set to now + lifetime at creation) has passed; we
|
||||||
|
# don't make any keepalive requests so no extension happens.
|
||||||
|
lifetime = hold_for + 60
|
||||||
start = time.monotonic()
|
start = time.monotonic()
|
||||||
log.info("[%s] POST /reserve duration=%ss", label, duration)
|
log.info("[%s] creating session (lifetime=%.0fs, hold=%.0fs)", label, lifetime, hold_for)
|
||||||
|
async with endpoint.session(cost=SESSION_COST, lifetime=lifetime) as s:
|
||||||
|
log.info("[%s] session %s open", label, s.session_id)
|
||||||
try:
|
try:
|
||||||
resp = await endpoint.request("/reserve", payload, cost=150)
|
await asyncio.sleep(hold_for)
|
||||||
elapsed = time.monotonic() - start
|
log.info("[%s] hold complete, closing session", label)
|
||||||
log.info("[%s] returned after %.1fs: %s", label, elapsed, resp.get("response"))
|
|
||||||
return resp["response"]
|
|
||||||
except asyncio.CancelledError:
|
except asyncio.CancelledError:
|
||||||
elapsed = time.monotonic() - start
|
elapsed = time.monotonic() - start
|
||||||
log.info("[%s] cancelled after %.1fs (HTTP connection dropped)", label, elapsed)
|
log.info("[%s] cancelled after %.1fs, closing session", label, elapsed)
|
||||||
raise
|
raise
|
||||||
|
elapsed = time.monotonic() - start
|
||||||
|
log.info("[%s] session closed cleanly after %.1fs", label, elapsed)
|
||||||
|
|
||||||
|
|
||||||
async def run_demo(
|
async def run_demo(
|
||||||
@@ -53,38 +60,41 @@ async def run_demo(
|
|||||||
interval: float,
|
interval: float,
|
||||||
plateau: float,
|
plateau: float,
|
||||||
) -> None:
|
) -> None:
|
||||||
"""Trapezoidal load: ramp up three reservations, plateau, then scale down.
|
"""Trapezoidal load: ramp up three sessions, plateau, then scale down.
|
||||||
|
|
||||||
Start three reservations spaced `interval` seconds apart. Pick the
|
Start three sessions spaced `interval` seconds apart. Each holds for
|
||||||
duration so that the first release fires `plateau` seconds *after the
|
`(n-1)*interval + plateau` seconds, so the first release fires
|
||||||
last reservation started*, giving the autoscaler time to actually have
|
`plateau` seconds after the last session started — giving the
|
||||||
all three workers running before any of them begin to scale down.
|
autoscaler time to actually have all three workers running before any
|
||||||
Releases then fire `interval` seconds apart, matching the ramp-up.
|
scale-down begins. Releases then fire `interval` seconds apart,
|
||||||
|
matching the ramp-up.
|
||||||
|
|
||||||
Each reservation ends via its duration cap (a 200 success).
|
Each session ends via the SDK's `session.close()` on `async with` exit,
|
||||||
|
which posts to /session/end with proper auth — counted as a normal
|
||||||
|
success in metrics.
|
||||||
"""
|
"""
|
||||||
n = 3
|
n = 3
|
||||||
hold = (n - 1) * interval + plateau
|
hold = (n - 1) * interval + plateau
|
||||||
tasks: list[asyncio.Task] = []
|
tasks: list[asyncio.Task] = []
|
||||||
for i in range(1, n + 1):
|
for i in range(1, n + 1):
|
||||||
label = f"res-{i}"
|
label = f"res-{i}"
|
||||||
log.info("[%s] starting (auto-release after %.0fs)", label, hold)
|
log.info("[%s] starting (hold=%.0fs)", label, hold)
|
||||||
task = asyncio.create_task(
|
task = asyncio.create_task(
|
||||||
reserve(
|
reserve(
|
||||||
client,
|
client,
|
||||||
endpoint_name=endpoint_name,
|
endpoint_name=endpoint_name,
|
||||||
duration=hold,
|
hold_for=hold,
|
||||||
label=label,
|
label=label,
|
||||||
),
|
),
|
||||||
name=label,
|
name=label,
|
||||||
)
|
)
|
||||||
tasks.append(task)
|
tasks.append(task)
|
||||||
if i < n:
|
if i < n:
|
||||||
log.info("Waiting %.0fs before next reservation...", interval)
|
log.info("Waiting %.0fs before next session...", interval)
|
||||||
await asyncio.sleep(interval)
|
await asyncio.sleep(interval)
|
||||||
|
|
||||||
log.info(
|
log.info(
|
||||||
"All %d reservations in flight; holding plateau for %.0fs, "
|
"All %d sessions in flight; holding plateau for %.0fs, "
|
||||||
"then scaling down %.0fs apart",
|
"then scaling down %.0fs apart",
|
||||||
n,
|
n,
|
||||||
plateau,
|
plateau,
|
||||||
@@ -106,19 +116,19 @@ def build_arg_parser() -> argparse.ArgumentParser:
|
|||||||
"--duration",
|
"--duration",
|
||||||
type=float,
|
type=float,
|
||||||
default=180.0,
|
default=180.0,
|
||||||
help="Seconds to hold each worker busy (default: 180)",
|
help="Single-reserve mode: seconds to hold the worker (default: 180)",
|
||||||
)
|
)
|
||||||
|
|
||||||
modes = p.add_mutually_exclusive_group(required=False)
|
modes = p.add_mutually_exclusive_group(required=False)
|
||||||
modes.add_argument(
|
modes.add_argument(
|
||||||
"--reserve",
|
"--reserve",
|
||||||
action="store_true",
|
action="store_true",
|
||||||
help="Make a single /reserve call (default if no mode given)",
|
help="Make a single session (default if no mode given)",
|
||||||
)
|
)
|
||||||
modes.add_argument(
|
modes.add_argument(
|
||||||
"--demo",
|
"--demo",
|
||||||
action="store_true",
|
action="store_true",
|
||||||
help="Run the staggered 3-reservation demo, cancelling one mid-way",
|
help="Run the staggered 3-reservation trapezoid demo",
|
||||||
)
|
)
|
||||||
|
|
||||||
p.add_argument(
|
p.add_argument(
|
||||||
@@ -157,15 +167,14 @@ async def main_async():
|
|||||||
plateau=args.plateau,
|
plateau=args.plateau,
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
response = await reserve(
|
await reserve(
|
||||||
client,
|
client,
|
||||||
endpoint_name=args.endpoint,
|
endpoint_name=args.endpoint,
|
||||||
duration=args.duration,
|
hold_for=args.duration,
|
||||||
label="reservation",
|
label="reservation",
|
||||||
)
|
)
|
||||||
print(f"Reservation result: {response}")
|
|
||||||
except KeyboardInterrupt:
|
except KeyboardInterrupt:
|
||||||
log.info("Interrupted; dropping any in-flight reservations")
|
log.info("Interrupted; dropping any in-flight sessions")
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
log.error("Error: %s", e, exc_info=True)
|
log.error("Error: %s", e, exc_info=True)
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
|
|||||||
+70
-121
@@ -2,7 +2,6 @@ import asyncio
|
|||||||
import logging
|
import logging
|
||||||
import os
|
import os
|
||||||
from contextlib import asynccontextmanager
|
from contextlib import asynccontextmanager
|
||||||
from typing import Optional
|
|
||||||
from urllib.parse import urlsplit
|
from urllib.parse import urlsplit
|
||||||
|
|
||||||
from aiohttp import web
|
from aiohttp import web
|
||||||
@@ -17,21 +16,21 @@ from vastai import (
|
|||||||
|
|
||||||
log = logging.getLogger(__file__)
|
log = logging.getLogger(__file__)
|
||||||
|
|
||||||
# Safety cap: if the user's queue consumer never calls /release, the
|
# Performance value pinned in the benchmark cache; sent to autoscaler as
|
||||||
# reservation is auto-released after this many seconds so a forgotten /release
|
# max_perf. Standardized at 100 — the conventional default the rest of the
|
||||||
# can't pin a worker indefinitely. Override with MAX_RESERVATION_SECONDS.
|
# serverless system expects.
|
||||||
MAX_RESERVATION_SECONDS = float(os.environ.get("MAX_RESERVATION_SECONDS", 3600))
|
TARGET_PERF = 100.0
|
||||||
|
|
||||||
# Marker the benchmark path sets so the same remote function can return
|
# Marker the benchmark path sets so the fallback /ping path returns
|
||||||
# immediately during capacity estimation instead of sleeping.
|
# immediately during the framework's startup benchmark.
|
||||||
BENCHMARK_SENTINEL = "__null_worker_benchmark__"
|
BENCHMARK_SENTINEL = "__null_worker_benchmark__"
|
||||||
|
|
||||||
# Internal control server. Hosts:
|
# Internal control server. Hosts:
|
||||||
# * POST /release — always available, marks the active reservation as
|
# * POST /release — releases the active reservation by closing the
|
||||||
# done so the held /reserve returns 200 (success in metrics, not a
|
# singleton session on this worker. Called by the user's queue
|
||||||
# cancellation).
|
# consumer when its work is done.
|
||||||
# * GET /health — only when no external BACKEND_HEALTH_URL is set; the
|
# * GET /health — only when BACKEND_HEALTH_URL is unset; gives the
|
||||||
# framework's healthcheck loop polls it so the worker has a live signal.
|
# framework's healthcheck loop something live to talk to.
|
||||||
# Bound to 127.0.0.1 so only processes on the instance can reach it.
|
# Bound to 127.0.0.1 so only processes on the instance can reach it.
|
||||||
INTERNAL_HOST = "127.0.0.1"
|
INTERNAL_HOST = "127.0.0.1"
|
||||||
INTERNAL_PORT = int(os.environ.get("NULL_CONTROL_PORT", 18999))
|
INTERNAL_PORT = int(os.environ.get("NULL_CONTROL_PORT", 18999))
|
||||||
@@ -56,62 +55,51 @@ else:
|
|||||||
USE_STUB_HEALTH = True
|
USE_STUB_HEALTH = True
|
||||||
|
|
||||||
|
|
||||||
# Workload reported per /reserve and target perf for the heartbeat below.
|
# Stashed after Worker(...) is constructed so /release can reach the
|
||||||
TARGET_PERF = 150.0
|
# framework's session machinery. Dict so the lifecycle closure picks up
|
||||||
|
# the assignment that happens before .run().
|
||||||
# Singleton active reservation. `allow_parallel_requests=False` on the
|
|
||||||
# /reserve handler guarantees the framework only runs one at a time per
|
|
||||||
# worker, so a single slot is enough.
|
|
||||||
_active_reservation: Optional[asyncio.Event] = None
|
|
||||||
|
|
||||||
# Backed in after Worker(...) is constructed so the heartbeat coroutine in
|
|
||||||
# null_lifecycle() can mutate backend.metrics. Stored in a dict so the
|
|
||||||
# lifecycle closure picks up the assignment that happens before .run().
|
|
||||||
_backend_ref: dict = {"backend": None}
|
_backend_ref: dict = {"backend": None}
|
||||||
|
|
||||||
|
|
||||||
async def _perf_heartbeat() -> None:
|
|
||||||
"""Keep cur_perf pegged to TARGET_PERF while a reservation is held.
|
|
||||||
|
|
||||||
Without this, workload_served stays at 0 while a /reserve is being held
|
|
||||||
open. The autoscaler observes cur_perf=0 against max_perf=150, decides
|
|
||||||
the worker can't deliver its claimed throughput, and downgrades it —
|
|
||||||
which makes it cautious about scaling up and prone to queueing
|
|
||||||
subsequent requests behind the held one instead of routing elsewhere.
|
|
||||||
|
|
||||||
Every second, if anything is in flight, set workload_served=TARGET_PERF
|
|
||||||
and mark update_pending so the metrics loop sends immediately. The
|
|
||||||
metrics tick resets workload_served back to 0 after sending; we
|
|
||||||
re-pin it next iteration.
|
|
||||||
"""
|
|
||||||
while True:
|
|
||||||
try:
|
|
||||||
await asyncio.sleep(1.0)
|
|
||||||
backend = _backend_ref.get("backend")
|
|
||||||
if backend is None:
|
|
||||||
continue
|
|
||||||
mm = backend.metrics.model_metrics
|
|
||||||
if mm.requests_working:
|
|
||||||
mm.workload_served = TARGET_PERF
|
|
||||||
backend.metrics.update_pending = True
|
|
||||||
except asyncio.CancelledError:
|
|
||||||
raise
|
|
||||||
except Exception as e:
|
|
||||||
log.debug(f"perf heartbeat error: {e}")
|
|
||||||
|
|
||||||
|
|
||||||
def _build_internal_app() -> web.Application:
|
def _build_internal_app() -> web.Application:
|
||||||
app = web.Application()
|
app = web.Application()
|
||||||
|
|
||||||
async def release_handler(_request: web.Request) -> web.Response:
|
async def release_handler(_request: web.Request) -> web.Response:
|
||||||
event = _active_reservation
|
"""End the active reservation (the singleton session on this worker).
|
||||||
if event is None:
|
|
||||||
|
max_sessions=1 means at most one session is active here. We call
|
||||||
|
the framework's internal __close_session via name-mangling to
|
||||||
|
bypass the session_auth check that /session/end normally requires.
|
||||||
|
That's intentional: this endpoint is localhost-only so trust is
|
||||||
|
assumed, and the user's consumer can release without having to
|
||||||
|
plumb session_auth through their queue.
|
||||||
|
|
||||||
|
__close_session reports the session metrics as a success, fires
|
||||||
|
on_close_route if configured, and pops the session from the dict.
|
||||||
|
"""
|
||||||
|
backend = _backend_ref.get("backend")
|
||||||
|
if backend is None:
|
||||||
return web.json_response(
|
return web.json_response(
|
||||||
{"released": False, "reason": "no active reservation"},
|
{"released": False, "reason": "backend not ready"},
|
||||||
|
status=503,
|
||||||
|
)
|
||||||
|
sids = list(backend.sessions.keys())
|
||||||
|
if not sids:
|
||||||
|
return web.json_response(
|
||||||
|
{"released": False, "reason": "no active session"},
|
||||||
|
status=200,
|
||||||
|
)
|
||||||
|
closed = []
|
||||||
|
for sid in sids:
|
||||||
|
try:
|
||||||
|
if await backend._Backend__close_session(sid):
|
||||||
|
closed.append(sid)
|
||||||
|
except Exception as e:
|
||||||
|
log.warning(f"Error closing session {sid}: {e}")
|
||||||
|
return web.json_response(
|
||||||
|
{"released": bool(closed), "session_ids": closed},
|
||||||
status=200,
|
status=200,
|
||||||
)
|
)
|
||||||
event.set()
|
|
||||||
return web.json_response({"released": True}, status=200)
|
|
||||||
|
|
||||||
app.router.add_post("/release", release_handler)
|
app.router.add_post("/release", release_handler)
|
||||||
|
|
||||||
@@ -126,18 +114,14 @@ def _build_internal_app() -> web.Application:
|
|||||||
|
|
||||||
@asynccontextmanager
|
@asynccontextmanager
|
||||||
async def null_lifecycle():
|
async def null_lifecycle():
|
||||||
# Pin max_throughput to exactly 100 by pre-populating the framework's
|
# Pin max_throughput to exactly TARGET_PERF by pre-populating the
|
||||||
# benchmark cache file. The framework's __run_benchmark short-circuits
|
# framework's benchmark cache file. __run_benchmark short-circuits to
|
||||||
# to `float(file_contents)` when this file exists, bypassing the
|
# float(file_contents) when this file exists.
|
||||||
# time-based calculation that would otherwise drift to ~99.x due to
|
|
||||||
# asyncio scheduling overhead. The filename matches the framework
|
|
||||||
# constant BENCHMARK_INDICATOR_FILE in
|
|
||||||
# vastai.serverless.server.lib.backend.
|
|
||||||
try:
|
try:
|
||||||
with open(".has_benchmark", "w") as fh:
|
with open(".has_benchmark", "w") as fh:
|
||||||
fh.write("150")
|
fh.write(str(int(TARGET_PERF)))
|
||||||
except OSError as e:
|
except OSError as e:
|
||||||
log.warning(f"Could not pin benchmark cache to 150: {e}")
|
log.warning(f"Could not pin benchmark cache: {e}")
|
||||||
|
|
||||||
app = _build_internal_app()
|
app = _build_internal_app()
|
||||||
runner = web.AppRunner(app)
|
runner = web.AppRunner(app)
|
||||||
@@ -145,8 +129,6 @@ async def null_lifecycle():
|
|||||||
site = web.TCPSite(runner, INTERNAL_HOST, INTERNAL_PORT)
|
site = web.TCPSite(runner, INTERNAL_HOST, INTERNAL_PORT)
|
||||||
await site.start()
|
await site.start()
|
||||||
|
|
||||||
heartbeat = asyncio.create_task(_perf_heartbeat(), name="null-perf-heartbeat")
|
|
||||||
|
|
||||||
lines = [
|
lines = [
|
||||||
f"Null pyworker internal control server: http://{INTERNAL_HOST}:{INTERNAL_PORT}",
|
f"Null pyworker internal control server: http://{INTERNAL_HOST}:{INTERNAL_PORT}",
|
||||||
f" POST /release - end the active reservation (call from your queue consumer)",
|
f" POST /release - end the active reservation (call from your queue consumer)",
|
||||||
@@ -157,60 +139,32 @@ async def null_lifecycle():
|
|||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
lines.append(f"Framework healthcheck pointed at: {BACKEND_HEALTH_URL}")
|
lines.append(f"Framework healthcheck pointed at: {BACKEND_HEALTH_URL}")
|
||||||
|
lines.append(
|
||||||
|
"Reservations use the framework session model. Clients POST to "
|
||||||
|
"/session/create via the SDK to acquire a worker; max_sessions=1 "
|
||||||
|
"so each worker holds at most one reservation."
|
||||||
|
)
|
||||||
log.info("\n".join(lines))
|
log.info("\n".join(lines))
|
||||||
|
|
||||||
try:
|
try:
|
||||||
yield
|
yield
|
||||||
finally:
|
finally:
|
||||||
heartbeat.cancel()
|
|
||||||
try:
|
|
||||||
await heartbeat
|
|
||||||
except (asyncio.CancelledError, Exception):
|
|
||||||
pass
|
|
||||||
await runner.cleanup()
|
await runner.cleanup()
|
||||||
|
|
||||||
|
|
||||||
async def reserve_worker(**params: object) -> dict:
|
async def ping(**params: object) -> dict:
|
||||||
global _active_reservation
|
"""Trivial handler. Exists to satisfy the framework's requirement that
|
||||||
|
at least one HandlerConfig has a BenchmarkConfig, and to give clients
|
||||||
|
a route they can hit with session_id to extend their session TTL.
|
||||||
|
"""
|
||||||
if params.get(BENCHMARK_SENTINEL):
|
if params.get(BENCHMARK_SENTINEL):
|
||||||
# Fallback path only — the lifecycle pre-populates .has_benchmark
|
# Fallback only — the lifecycle pre-pins .has_benchmark so
|
||||||
# with "150" so __run_benchmark normally short-circuits and never
|
# __run_benchmark normally short-circuits and this never runs. If
|
||||||
# invokes us. If the cache write failed, sleep ~1s so the
|
# the cache write failed, sleep ~1s so the time-based throughput
|
||||||
# time-based calculation lands near 150 (workload=150 / time~=1s).
|
# math lands near TARGET_PERF.
|
||||||
await asyncio.sleep(1.0)
|
await asyncio.sleep(1.0)
|
||||||
return {"ok": True, "benchmark": True}
|
return {"ok": True, "benchmark": True}
|
||||||
|
return {"ok": True}
|
||||||
requested = params.get("duration")
|
|
||||||
if requested is None:
|
|
||||||
duration = MAX_RESERVATION_SECONDS
|
|
||||||
else:
|
|
||||||
try:
|
|
||||||
duration = max(0.0, min(float(requested), MAX_RESERVATION_SECONDS))
|
|
||||||
except (TypeError, ValueError):
|
|
||||||
duration = MAX_RESERVATION_SECONDS
|
|
||||||
|
|
||||||
event = asyncio.Event()
|
|
||||||
_active_reservation = event
|
|
||||||
log.info(
|
|
||||||
f"Reservation acquired; awaiting POST /release on "
|
|
||||||
f"http://{INTERNAL_HOST}:{INTERNAL_PORT}/release "
|
|
||||||
f"(auto-release after {duration:.1f}s)"
|
|
||||||
)
|
|
||||||
try:
|
|
||||||
try:
|
|
||||||
await asyncio.wait_for(event.wait(), timeout=duration)
|
|
||||||
log.info("Reservation released via /release")
|
|
||||||
return {"released": "explicit", "duration_cap": duration}
|
|
||||||
except asyncio.TimeoutError:
|
|
||||||
log.warning(
|
|
||||||
f"Reservation hit duration cap of {duration:.1f}s without "
|
|
||||||
f"explicit /release; releasing automatically"
|
|
||||||
)
|
|
||||||
return {"released": "duration_elapsed", "duration": duration}
|
|
||||||
finally:
|
|
||||||
if _active_reservation is event:
|
|
||||||
_active_reservation = None
|
|
||||||
|
|
||||||
|
|
||||||
worker_config = WorkerConfig(
|
worker_config = WorkerConfig(
|
||||||
@@ -218,17 +172,12 @@ worker_config = WorkerConfig(
|
|||||||
model_server_port=HEALTH_PORT,
|
model_server_port=HEALTH_PORT,
|
||||||
model_healthcheck_url=HEALTH_PATH,
|
model_healthcheck_url=HEALTH_PATH,
|
||||||
lifecycle=null_lifecycle(),
|
lifecycle=null_lifecycle(),
|
||||||
|
max_sessions=1,
|
||||||
handlers=[
|
handlers=[
|
||||||
HandlerConfig(
|
HandlerConfig(
|
||||||
route="/reserve",
|
route="/ping",
|
||||||
allow_parallel_requests=False,
|
allow_parallel_requests=True,
|
||||||
# Reject (429) any /reserve that arrives while the worker is
|
remote_function=ping,
|
||||||
# already busy. A held reservation lasts up to MAX_RESERVATION_
|
|
||||||
# SECONDS, so queueing behind it would mean hours of wait —
|
|
||||||
# better to bounce the request immediately so serverless routes
|
|
||||||
# it to a free worker (or spins up a new one).
|
|
||||||
max_queue_time=0.0,
|
|
||||||
remote_function=reserve_worker,
|
|
||||||
workload_calculator=lambda _payload: TARGET_PERF,
|
workload_calculator=lambda _payload: TARGET_PERF,
|
||||||
benchmark_config=BenchmarkConfig(
|
benchmark_config=BenchmarkConfig(
|
||||||
generator=lambda: {BENCHMARK_SENTINEL: True},
|
generator=lambda: {BENCHMARK_SENTINEL: True},
|
||||||
|
|||||||
Reference in New Issue
Block a user