Rewrite null pyworker on the framework session model

Drop the held-/reserve approach in favour of the framework's session
primitive (max_sessions=1 + /session/create). Sessions are excluded from
the autoscaler's queue-wait math and don't suffer the cur_perf=0
degradation that a long-held request did, so this naturally produces the
"one request comes in and you get a worker; release and it scales back
down" model we were hand-rolling.

Server side:
  - max_sessions=1; framework auto-registers /session/* routes
  - Drop custom /reserve handler, _active_reservation event, max_queue_
    time=0.0, MAX_RESERVATION_SECONDS, _perf_heartbeat
  - Trivial /ping handler exists only to satisfy the framework's
    "at least one handler with BenchmarkConfig" requirement (and to give
    clients an extension/keepalive route)
  - /release on the internal control port is kept as a convenience for
    queue consumers that don't carry session_auth — calls the framework's
    __close_session via name-mangling, which bypasses the session_auth
    check but is fine for a localhost-only endpoint
  - Workload/perf back to 100 (conventional)

Client side:
  - Uses endpoint.session(cost, lifetime) instead of POST /reserve
  - async with the SDK Session; close on exit posts /session/end with
    proper auth → 200 success in metrics
  - Demo and single modes both ride the same reserve() helper

Sessions landed in vastai-sdk 0.4.2 (commit ec9ef59, 2026-01-20).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Rob Ballantyne
2026-05-12 10:51:24 +01:00
parent 6c2f194b28
commit 6a562a1376
3 changed files with 206 additions and 252 deletions
+90 -94
View File
@@ -1,10 +1,8 @@
# Null PyWorker # Null PyWorker
A PyWorker that does **nothing** — it does not forward requests to any model A PyWorker that does **nothing** — it does not forward requests to any model
server. Each HTTP POST to `/reserve` simply marks the worker as busy and holds server. Reservations are modelled as framework **sessions**: a request
the request open until the user's queue consumer (running locally on the comes in and you get a worker; release and it scales back down.
instance) calls `/release` on the internal control port — or a safety
timeout elapses.
## When to use it ## When to use it
@@ -15,32 +13,29 @@ Use this worker when you want to drive Vast Serverless autoscaling but you do
etc.). etc.).
- A separate worker process on the Vast instance pulls work from that queue - A separate worker process on the Vast instance pulls work from that queue
directly. The Vast PyWorker is not involved in the request/response path. directly. The Vast PyWorker is not involved in the request/response path.
Your consumer can be any language — node, golang, python, a binary —
this PyWorker is implementation-agnostic.
- You want one Vast worker per active queue consumer, and you want the - You want one Vast worker per active queue consumer, and you want the
Serverless autoscaler to spin instances up and down based on demand on Serverless autoscaler to spin instances up and down based on demand on
*your* side. *your* side.
A request comes in and you get a worker. Release and it scales back down.
POST to `/reserve` and serverless gives you a worker, held busy for the
lifetime of the request. When your queue consumer is done, POST to
`/release` on the internal port (`127.0.0.1:18999` by default) and the
held `/reserve` returns `200`.
## How it works ## How it works
- `allow_parallel_requests=False` and `max_queue_time=0.0`, so one in-flight - Reservations use the framework's **session** model. The SDK exposes
`/reserve` fully occupies the worker and any further request that lands `endpoint.session(cost, lifetime)` which POSTs to `/session/create` (a
on it is rejected with `429` immediately — serverless will route to a built-in framework route) and returns a `Session` object usable as
free worker or scale a new one up. `async with`. Closing the context (or calling `await session.close()`)
- `lifecycle` is used instead of `model_log_file`, so there is no log to tail POSTs to `/session/end` — counted as a normal success in metrics.
and no model server to start. The worker reports itself ready immediately - `max_sessions=1` on the worker side means a second `/session/create`
after the (trivial) benchmark. against an already-occupied worker returns `429`. Serverless routes
- The `/reserve` handler is a `remote_function` rather than an HTTP proxy, so that request to a free worker or scales a new one up.
the framework never tries to forward the request anywhere — it just awaits - Sessions are **excluded from queue-wait math** (the framework filters
an internal `asyncio.Event`. `if not request.is_session`), so an occupied worker doesn't look like
- An internal aiohttp control server, bound to `127.0.0.1`, hosts it has a request queue piling up. The autoscaler treats a session as
`/release` (and, when no external healthcheck URL is provided, a stub occupancy, not as work-in-progress.
`/health`). - `lifecycle` is used instead of `model_log_file`, so there is no log to
tail and no model server to start. The worker reports itself ready
immediately after a trivial benchmark.
## Healthchecking ## Healthchecking
@@ -49,48 +44,52 @@ fails after the first success, the worker is marked errored and the
autoscaler can decommission it. Two modes: autoscaler can decommission it. Two modes:
- **Stub (default)** — the internal control server also answers - **Stub (default)** — the internal control server also answers
`GET /health` with `200`. This is just enough to satisfy the framework `GET /health` with `200`. Just enough to satisfy the framework while
while you wire up real consumers. you wire up real consumers.
- **Point at your queue consumer (recommended)** — set - **Point at your queue consumer (recommended)** — set
`BACKEND_HEALTH_URL=http://127.0.0.1:9090/health` (absolute URL) and the `BACKEND_HEALTH_URL=http://127.0.0.1:9090/health` (absolute URL) and
pyworker will healthcheck *your* consumer instead. If your consumer the pyworker will healthcheck *your* consumer instead. If the consumer
process crashes, the autoscaler will see the worker as broken. process crashes, the autoscaler will see the worker as broken.
Run your queue consumer on the instance alongside the PyWorker, expose a
plain `/health` endpoint on it, then set `BACKEND_HEALTH_URL` accordingly in
your template.
## API ## API
### `POST /reserve` (external port, signed by the autoscaler) ### Reservation: `POST /session/create` (external, signed)
Holds the worker busy until the reservation ends. Not implemented here — the framework provides this route automatically on
every PyWorker. Use the SDK:
Request body (all fields optional): ```python
from vastai import Serverless
```json async with Serverless() as client:
{ "duration": 600 } endpoint = await client.get_endpoint(name="my-null-endpoint")
async with endpoint.session(cost=100, lifetime=600) as s:
# Worker is now reserved. Your queue dispatcher does whatever it
# needs to do (typically: enqueue a job that mentions s.session_id).
...
# `async with` exit posts to /session/end → 200 success in metrics
``` ```
- `duration` (seconds, optional): safety cap on how long to hold the Or raw HTTP (the SDK takes care of autoscaler signing for you, but the
reservation if no `/release` arrives. Capped by `MAX_RESERVATION_SECONDS` shape of the request is documented for non-Python clients):
(env var, default 3600). If omitted, defaults to that cap.
Behavior: ```
POST /session/create
{
"auth_data": { /* signed by autoscaler */ },
"payload": {
"lifetime": 600,
"on_close_route": "https://your.callback/notify",
"on_close_payload": {"job_id": "..."}
}
}
```
- Returns `200` with `{"released": "explicit", ...}` when the local consumer ### Release from a local consumer: `POST /release` (internal, localhost-only)
POSTs `/release` on the internal port. **This is the intended happy path
— the request is counted as a success in metrics.**
- Returns `200` with `{"released": "duration_elapsed", "duration": <n>}` if
the duration cap fires (safety net for a stuck consumer).
- Returns `499` if the external client disconnects (counted as cancelled in
metrics — avoid this; use `/release` instead).
- Returns `429` immediately if the worker is already holding a reservation
(so serverless routes the request to a free worker instead of queueing).
### `POST /release` (internal port, localhost-only) Closes the active session, regardless of who created it. No body, no
auth. Use this when the queue consumer doesn't have (and shouldn't need)
Marks the active reservation as done. No body required. Idempotent: the session's `session_auth`:
```bash ```bash
curl -X POST http://127.0.0.1:18999/release curl -X POST http://127.0.0.1:18999/release
@@ -98,78 +97,75 @@ curl -X POST http://127.0.0.1:18999/release
Responses: Responses:
- `200 {"released": true}` — active reservation was released; the held - `200 {"released": true, "session_ids": ["..."]}` — closed; the held
`/reserve` will return `{"released": "explicit"}`. client-side `/session/create` completes and counts as a success.
- `200 {"released": false, "reason": "no active reservation"}` — nothing was - `200 {"released": false, "reason": "no active session"}` — nothing
in flight, no-op. active, no-op.
Only processes on the Vast instance can reach this port. There is no For setups where the dispatcher can hand the consumer `session_auth`
authentication on it. (e.g. as part of the queue payload), the consumer can instead POST
`/session/end` on the framework's HTTP-only port
(`$WORKER_HTTP_PORT`, default `WORKER_PORT+1`) — the standard, fully
authenticated release path.
## Environment variables ## Environment variables
- `MAX_RESERVATION_SECONDS` — upper bound on how long a single `/reserve`
call can hold a worker if `/release` is never called. Defaults to `3600`.
- `BACKEND_HEALTH_URL` — absolute URL the framework should healthcheck - `BACKEND_HEALTH_URL` — absolute URL the framework should healthcheck
(e.g. `http://127.0.0.1:9090/health`). When set, the stub `/health` route (e.g. `http://127.0.0.1:9090/health`). When set, the stub `/health`
is not registered on the internal server. When unset, the built-in stub route is not registered on the internal server.
is used.
- `NULL_CONTROL_PORT` — port for the internal control server (hosts - `NULL_CONTROL_PORT` — port for the internal control server (hosts
`/release` and optionally `/health`). Defaults to `18999`. `/release` and optionally `/health`). Defaults to `18999`.
## Deploying on Vast Serverless ## Deploying on Vast Serverless
1. Create a Serverless endpoint and point `PYWORKER_REPO` at this repository 1. Create a Serverless endpoint and point `PYWORKER_REPO` at this
(or your fork). repository (or your fork).
2. Set `BACKEND=null` in the template so `start_server.sh` runs 2. Set `BACKEND=null` in the template so `start_server.sh` runs
`workers.null.worker`. `workers.null.worker`.
3. There is no model server to configure; you can omit model-related env vars 3. There is no model server to configure; you can omit model-related env
entirely. vars entirely.
4. Run your own queue-consumer process on the instance alongside the 4. Run your own queue-consumer process on the instance alongside the
PyWorker. When the consumer finishes its work it should: PyWorker. When it finishes its work:
```bash ```bash
curl -X POST http://127.0.0.1:18999/release curl -X POST http://127.0.0.1:18999/release
``` ```
so the held `/reserve` returns success and the autoscaler can scale the
worker down cleanly.
## Client example ## Client example
Single reservation: Single reservation (holds for 180s):
```bash ```bash
python -m workers.null.client --endpoint <ENDPOINT_NAME> --duration 600 python -m workers.null.client --endpoint <ENDPOINT_NAME>
``` ```
To exercise the full flow, shell into the worker and run
`curl -X POST http://127.0.0.1:18999/release` — the client returns with
`{"released": "explicit", ...}`.
Staggered demo: Staggered demo:
```bash ```bash
python -m workers.null.client --endpoint <ENDPOINT_NAME> --demo python -m workers.null.client --endpoint <ENDPOINT_NAME> --demo
``` ```
Starts three reservations 30s apart (all held concurrently), holds the Starts three sessions 30s apart (all held concurrently), holds the
3-worker plateau for 5 minutes so the autoscaler has time to actually 3-worker plateau for 5 minutes so the autoscaler has time to actually
provision the third worker before any scale-down starts, then scales provision the third worker before any scale-down starts, then closes
down one worker at a time, also 30s apart, and exits. the sessions one at a time, also 30s apart, and exits. Every session
ends cleanly via the SDK's `session.close()` — `200` successes in
metrics, no cancellations.
Each reservation ends via its duration cap (a 200 success in metrics). Tune the timing with `--interval` and `--plateau`. To exercise the
Tune the timing with `--interval` and `--plateau`. local-release path, shell into a worker and run
`curl -X POST http://127.0.0.1:18999/release`.
## Notes and caveats ## Notes and caveats
- The HTTP connection from the external caller must stay open for the full - The reservation's lifetime caps how long the session can live without
reservation. Make sure your client and any intermediate proxies allow client activity. Set it comfortably longer than the work you expect to
long-lived requests (disable idle timeouts, retries, and connection do, or have the client periodically POST `/ping` with `session_id` to
reuse if necessary). extend.
- If your client retries on timeout, you may end up provisioning duplicate - The `on_close_route` payload (passed at `/session/create`) is POSTed by
workers. Configure `duration` generously and rely on `/release` from the the framework when the session ends. Useful for notifying your queue
consumer to end reservations promptly. consumer that the reservation is closing.
- Avoid disconnecting the external `/reserve` request as a way to release — - `/release` on the internal port is convenient but bypasses
that produces a `499` and is counted as a cancellation in Vast metrics. `session_auth`. If you need the standard authenticated release flow,
Always release via `POST /release` on the internal port. pass `session_auth` to your consumer (e.g. through the queue payload)
- There is no streaming / heartbeat in the response; the request returns and have it POST to `/session/end` on the framework's HTTP port
exactly once, when the reservation ends. instead.
+42 -33
View File
@@ -15,35 +15,42 @@ logging.basicConfig(
log = logging.getLogger(__file__) log = logging.getLogger(__file__)
ENDPOINT_NAME = "null-prod" ENDPOINT_NAME = "null-prod"
SESSION_COST = 100
async def reserve( async def reserve(
client: Serverless, client: Serverless,
*, *,
endpoint_name: str, endpoint_name: str,
duration: float, hold_for: float,
label: str = "reservation", label: str = "session",
) -> dict: ) -> None:
"""Hold a Vast worker open for `duration` seconds (or until we disconnect). """Open a session, hold the worker for `hold_for` seconds, close cleanly.
The worker counts itself busy for the lifetime of this call. Returning Uses the framework's session model — each session counts as one worker
here means the reservation has ended — either /release was called on occupied, but unlike a held HTTP request it isn't poisoning the
the worker's internal control port, or the duration cap fired, or the worker's throughput math. max_sessions=1 on the worker side means a
HTTP request was cancelled. second /session/create against the same worker gets 429, so serverless
routes the second reservation to a free worker or scales a new one up.
""" """
endpoint = await client.get_endpoint(name=endpoint_name) endpoint = await client.get_endpoint(name=endpoint_name)
payload = {"duration": duration} # Session lifetime must outlast the hold. The framework expires sessions
# whose `expiration` (set to now + lifetime at creation) has passed; we
# don't make any keepalive requests so no extension happens.
lifetime = hold_for + 60
start = time.monotonic() start = time.monotonic()
log.info("[%s] POST /reserve duration=%ss", label, duration) log.info("[%s] creating session (lifetime=%.0fs, hold=%.0fs)", label, lifetime, hold_for)
async with endpoint.session(cost=SESSION_COST, lifetime=lifetime) as s:
log.info("[%s] session %s open", label, s.session_id)
try: try:
resp = await endpoint.request("/reserve", payload, cost=150) await asyncio.sleep(hold_for)
elapsed = time.monotonic() - start log.info("[%s] hold complete, closing session", label)
log.info("[%s] returned after %.1fs: %s", label, elapsed, resp.get("response"))
return resp["response"]
except asyncio.CancelledError: except asyncio.CancelledError:
elapsed = time.monotonic() - start elapsed = time.monotonic() - start
log.info("[%s] cancelled after %.1fs (HTTP connection dropped)", label, elapsed) log.info("[%s] cancelled after %.1fs, closing session", label, elapsed)
raise raise
elapsed = time.monotonic() - start
log.info("[%s] session closed cleanly after %.1fs", label, elapsed)
async def run_demo( async def run_demo(
@@ -53,38 +60,41 @@ async def run_demo(
interval: float, interval: float,
plateau: float, plateau: float,
) -> None: ) -> None:
"""Trapezoidal load: ramp up three reservations, plateau, then scale down. """Trapezoidal load: ramp up three sessions, plateau, then scale down.
Start three reservations spaced `interval` seconds apart. Pick the Start three sessions spaced `interval` seconds apart. Each holds for
duration so that the first release fires `plateau` seconds *after the `(n-1)*interval + plateau` seconds, so the first release fires
last reservation started*, giving the autoscaler time to actually have `plateau` seconds after the last session started giving the
all three workers running before any of them begin to scale down. autoscaler time to actually have all three workers running before any
Releases then fire `interval` seconds apart, matching the ramp-up. scale-down begins. Releases then fire `interval` seconds apart,
matching the ramp-up.
Each reservation ends via its duration cap (a 200 success). Each session ends via the SDK's `session.close()` on `async with` exit,
which posts to /session/end with proper auth — counted as a normal
success in metrics.
""" """
n = 3 n = 3
hold = (n - 1) * interval + plateau hold = (n - 1) * interval + plateau
tasks: list[asyncio.Task] = [] tasks: list[asyncio.Task] = []
for i in range(1, n + 1): for i in range(1, n + 1):
label = f"res-{i}" label = f"res-{i}"
log.info("[%s] starting (auto-release after %.0fs)", label, hold) log.info("[%s] starting (hold=%.0fs)", label, hold)
task = asyncio.create_task( task = asyncio.create_task(
reserve( reserve(
client, client,
endpoint_name=endpoint_name, endpoint_name=endpoint_name,
duration=hold, hold_for=hold,
label=label, label=label,
), ),
name=label, name=label,
) )
tasks.append(task) tasks.append(task)
if i < n: if i < n:
log.info("Waiting %.0fs before next reservation...", interval) log.info("Waiting %.0fs before next session...", interval)
await asyncio.sleep(interval) await asyncio.sleep(interval)
log.info( log.info(
"All %d reservations in flight; holding plateau for %.0fs, " "All %d sessions in flight; holding plateau for %.0fs, "
"then scaling down %.0fs apart", "then scaling down %.0fs apart",
n, n,
plateau, plateau,
@@ -106,19 +116,19 @@ def build_arg_parser() -> argparse.ArgumentParser:
"--duration", "--duration",
type=float, type=float,
default=180.0, default=180.0,
help="Seconds to hold each worker busy (default: 180)", help="Single-reserve mode: seconds to hold the worker (default: 180)",
) )
modes = p.add_mutually_exclusive_group(required=False) modes = p.add_mutually_exclusive_group(required=False)
modes.add_argument( modes.add_argument(
"--reserve", "--reserve",
action="store_true", action="store_true",
help="Make a single /reserve call (default if no mode given)", help="Make a single session (default if no mode given)",
) )
modes.add_argument( modes.add_argument(
"--demo", "--demo",
action="store_true", action="store_true",
help="Run the staggered 3-reservation demo, cancelling one mid-way", help="Run the staggered 3-reservation trapezoid demo",
) )
p.add_argument( p.add_argument(
@@ -157,15 +167,14 @@ async def main_async():
plateau=args.plateau, plateau=args.plateau,
) )
else: else:
response = await reserve( await reserve(
client, client,
endpoint_name=args.endpoint, endpoint_name=args.endpoint,
duration=args.duration, hold_for=args.duration,
label="reservation", label="reservation",
) )
print(f"Reservation result: {response}")
except KeyboardInterrupt: except KeyboardInterrupt:
log.info("Interrupted; dropping any in-flight reservations") log.info("Interrupted; dropping any in-flight sessions")
except Exception as e: except Exception as e:
log.error("Error: %s", e, exc_info=True) log.error("Error: %s", e, exc_info=True)
sys.exit(1) sys.exit(1)
+70 -121
View File
@@ -2,7 +2,6 @@ import asyncio
import logging import logging
import os import os
from contextlib import asynccontextmanager from contextlib import asynccontextmanager
from typing import Optional
from urllib.parse import urlsplit from urllib.parse import urlsplit
from aiohttp import web from aiohttp import web
@@ -17,21 +16,21 @@ from vastai import (
log = logging.getLogger(__file__) log = logging.getLogger(__file__)
# Safety cap: if the user's queue consumer never calls /release, the # Performance value pinned in the benchmark cache; sent to autoscaler as
# reservation is auto-released after this many seconds so a forgotten /release # max_perf. Standardized at 100 — the conventional default the rest of the
# can't pin a worker indefinitely. Override with MAX_RESERVATION_SECONDS. # serverless system expects.
MAX_RESERVATION_SECONDS = float(os.environ.get("MAX_RESERVATION_SECONDS", 3600)) TARGET_PERF = 100.0
# Marker the benchmark path sets so the same remote function can return # Marker the benchmark path sets so the fallback /ping path returns
# immediately during capacity estimation instead of sleeping. # immediately during the framework's startup benchmark.
BENCHMARK_SENTINEL = "__null_worker_benchmark__" BENCHMARK_SENTINEL = "__null_worker_benchmark__"
# Internal control server. Hosts: # Internal control server. Hosts:
# * POST /release — always available, marks the active reservation as # * POST /release — releases the active reservation by closing the
# done so the held /reserve returns 200 (success in metrics, not a # singleton session on this worker. Called by the user's queue
# cancellation). # consumer when its work is done.
# * GET /health — only when no external BACKEND_HEALTH_URL is set; the # * GET /health — only when BACKEND_HEALTH_URL is unset; gives the
# framework's healthcheck loop polls it so the worker has a live signal. # framework's healthcheck loop something live to talk to.
# Bound to 127.0.0.1 so only processes on the instance can reach it. # Bound to 127.0.0.1 so only processes on the instance can reach it.
INTERNAL_HOST = "127.0.0.1" INTERNAL_HOST = "127.0.0.1"
INTERNAL_PORT = int(os.environ.get("NULL_CONTROL_PORT", 18999)) INTERNAL_PORT = int(os.environ.get("NULL_CONTROL_PORT", 18999))
@@ -56,62 +55,51 @@ else:
USE_STUB_HEALTH = True USE_STUB_HEALTH = True
# Workload reported per /reserve and target perf for the heartbeat below. # Stashed after Worker(...) is constructed so /release can reach the
TARGET_PERF = 150.0 # framework's session machinery. Dict so the lifecycle closure picks up
# the assignment that happens before .run().
# Singleton active reservation. `allow_parallel_requests=False` on the
# /reserve handler guarantees the framework only runs one at a time per
# worker, so a single slot is enough.
_active_reservation: Optional[asyncio.Event] = None
# Backed in after Worker(...) is constructed so the heartbeat coroutine in
# null_lifecycle() can mutate backend.metrics. Stored in a dict so the
# lifecycle closure picks up the assignment that happens before .run().
_backend_ref: dict = {"backend": None} _backend_ref: dict = {"backend": None}
async def _perf_heartbeat() -> None:
"""Keep cur_perf pegged to TARGET_PERF while a reservation is held.
Without this, workload_served stays at 0 while a /reserve is being held
open. The autoscaler observes cur_perf=0 against max_perf=150, decides
the worker can't deliver its claimed throughput, and downgrades it —
which makes it cautious about scaling up and prone to queueing
subsequent requests behind the held one instead of routing elsewhere.
Every second, if anything is in flight, set workload_served=TARGET_PERF
and mark update_pending so the metrics loop sends immediately. The
metrics tick resets workload_served back to 0 after sending; we
re-pin it next iteration.
"""
while True:
try:
await asyncio.sleep(1.0)
backend = _backend_ref.get("backend")
if backend is None:
continue
mm = backend.metrics.model_metrics
if mm.requests_working:
mm.workload_served = TARGET_PERF
backend.metrics.update_pending = True
except asyncio.CancelledError:
raise
except Exception as e:
log.debug(f"perf heartbeat error: {e}")
def _build_internal_app() -> web.Application: def _build_internal_app() -> web.Application:
app = web.Application() app = web.Application()
async def release_handler(_request: web.Request) -> web.Response: async def release_handler(_request: web.Request) -> web.Response:
event = _active_reservation """End the active reservation (the singleton session on this worker).
if event is None:
max_sessions=1 means at most one session is active here. We call
the framework's internal __close_session via name-mangling to
bypass the session_auth check that /session/end normally requires.
That's intentional: this endpoint is localhost-only so trust is
assumed, and the user's consumer can release without having to
plumb session_auth through their queue.
__close_session reports the session metrics as a success, fires
on_close_route if configured, and pops the session from the dict.
"""
backend = _backend_ref.get("backend")
if backend is None:
return web.json_response( return web.json_response(
{"released": False, "reason": "no active reservation"}, {"released": False, "reason": "backend not ready"},
status=503,
)
sids = list(backend.sessions.keys())
if not sids:
return web.json_response(
{"released": False, "reason": "no active session"},
status=200,
)
closed = []
for sid in sids:
try:
if await backend._Backend__close_session(sid):
closed.append(sid)
except Exception as e:
log.warning(f"Error closing session {sid}: {e}")
return web.json_response(
{"released": bool(closed), "session_ids": closed},
status=200, status=200,
) )
event.set()
return web.json_response({"released": True}, status=200)
app.router.add_post("/release", release_handler) app.router.add_post("/release", release_handler)
@@ -126,18 +114,14 @@ def _build_internal_app() -> web.Application:
@asynccontextmanager @asynccontextmanager
async def null_lifecycle(): async def null_lifecycle():
# Pin max_throughput to exactly 100 by pre-populating the framework's # Pin max_throughput to exactly TARGET_PERF by pre-populating the
# benchmark cache file. The framework's __run_benchmark short-circuits # framework's benchmark cache file. __run_benchmark short-circuits to
# to `float(file_contents)` when this file exists, bypassing the # float(file_contents) when this file exists.
# time-based calculation that would otherwise drift to ~99.x due to
# asyncio scheduling overhead. The filename matches the framework
# constant BENCHMARK_INDICATOR_FILE in
# vastai.serverless.server.lib.backend.
try: try:
with open(".has_benchmark", "w") as fh: with open(".has_benchmark", "w") as fh:
fh.write("150") fh.write(str(int(TARGET_PERF)))
except OSError as e: except OSError as e:
log.warning(f"Could not pin benchmark cache to 150: {e}") log.warning(f"Could not pin benchmark cache: {e}")
app = _build_internal_app() app = _build_internal_app()
runner = web.AppRunner(app) runner = web.AppRunner(app)
@@ -145,8 +129,6 @@ async def null_lifecycle():
site = web.TCPSite(runner, INTERNAL_HOST, INTERNAL_PORT) site = web.TCPSite(runner, INTERNAL_HOST, INTERNAL_PORT)
await site.start() await site.start()
heartbeat = asyncio.create_task(_perf_heartbeat(), name="null-perf-heartbeat")
lines = [ lines = [
f"Null pyworker internal control server: http://{INTERNAL_HOST}:{INTERNAL_PORT}", f"Null pyworker internal control server: http://{INTERNAL_HOST}:{INTERNAL_PORT}",
f" POST /release - end the active reservation (call from your queue consumer)", f" POST /release - end the active reservation (call from your queue consumer)",
@@ -157,60 +139,32 @@ async def null_lifecycle():
) )
else: else:
lines.append(f"Framework healthcheck pointed at: {BACKEND_HEALTH_URL}") lines.append(f"Framework healthcheck pointed at: {BACKEND_HEALTH_URL}")
lines.append(
"Reservations use the framework session model. Clients POST to "
"/session/create via the SDK to acquire a worker; max_sessions=1 "
"so each worker holds at most one reservation."
)
log.info("\n".join(lines)) log.info("\n".join(lines))
try: try:
yield yield
finally: finally:
heartbeat.cancel()
try:
await heartbeat
except (asyncio.CancelledError, Exception):
pass
await runner.cleanup() await runner.cleanup()
async def reserve_worker(**params: object) -> dict: async def ping(**params: object) -> dict:
global _active_reservation """Trivial handler. Exists to satisfy the framework's requirement that
at least one HandlerConfig has a BenchmarkConfig, and to give clients
a route they can hit with session_id to extend their session TTL.
"""
if params.get(BENCHMARK_SENTINEL): if params.get(BENCHMARK_SENTINEL):
# Fallback path only — the lifecycle pre-populates .has_benchmark # Fallback only — the lifecycle pre-pins .has_benchmark so
# with "150" so __run_benchmark normally short-circuits and never # __run_benchmark normally short-circuits and this never runs. If
# invokes us. If the cache write failed, sleep ~1s so the # the cache write failed, sleep ~1s so the time-based throughput
# time-based calculation lands near 150 (workload=150 / time~=1s). # math lands near TARGET_PERF.
await asyncio.sleep(1.0) await asyncio.sleep(1.0)
return {"ok": True, "benchmark": True} return {"ok": True, "benchmark": True}
return {"ok": True}
requested = params.get("duration")
if requested is None:
duration = MAX_RESERVATION_SECONDS
else:
try:
duration = max(0.0, min(float(requested), MAX_RESERVATION_SECONDS))
except (TypeError, ValueError):
duration = MAX_RESERVATION_SECONDS
event = asyncio.Event()
_active_reservation = event
log.info(
f"Reservation acquired; awaiting POST /release on "
f"http://{INTERNAL_HOST}:{INTERNAL_PORT}/release "
f"(auto-release after {duration:.1f}s)"
)
try:
try:
await asyncio.wait_for(event.wait(), timeout=duration)
log.info("Reservation released via /release")
return {"released": "explicit", "duration_cap": duration}
except asyncio.TimeoutError:
log.warning(
f"Reservation hit duration cap of {duration:.1f}s without "
f"explicit /release; releasing automatically"
)
return {"released": "duration_elapsed", "duration": duration}
finally:
if _active_reservation is event:
_active_reservation = None
worker_config = WorkerConfig( worker_config = WorkerConfig(
@@ -218,17 +172,12 @@ worker_config = WorkerConfig(
model_server_port=HEALTH_PORT, model_server_port=HEALTH_PORT,
model_healthcheck_url=HEALTH_PATH, model_healthcheck_url=HEALTH_PATH,
lifecycle=null_lifecycle(), lifecycle=null_lifecycle(),
max_sessions=1,
handlers=[ handlers=[
HandlerConfig( HandlerConfig(
route="/reserve", route="/ping",
allow_parallel_requests=False, allow_parallel_requests=True,
# Reject (429) any /reserve that arrives while the worker is remote_function=ping,
# already busy. A held reservation lasts up to MAX_RESERVATION_
# SECONDS, so queueing behind it would mean hours of wait —
# better to bounce the request immediately so serverless routes
# it to a free worker (or spins up a new one).
max_queue_time=0.0,
remote_function=reserve_worker,
workload_calculator=lambda _payload: TARGET_PERF, workload_calculator=lambda _payload: TARGET_PERF,
benchmark_config=BenchmarkConfig( benchmark_config=BenchmarkConfig(
generator=lambda: {BENCHMARK_SENTINEL: True}, generator=lambda: {BENCHMARK_SENTINEL: True},