Rewrite null pyworker on the framework session model

Drop the held-/reserve approach in favour of the framework's session
primitive (max_sessions=1 + /session/create). Sessions are excluded from
the autoscaler's queue-wait math and don't suffer the cur_perf=0
degradation that a long-held request did, so this naturally produces the
"one request comes in and you get a worker; release and it scales back
down" model we were hand-rolling.

Server side:
  - max_sessions=1; framework auto-registers /session/* routes
  - Drop custom /reserve handler, _active_reservation event, max_queue_
    time=0.0, MAX_RESERVATION_SECONDS, _perf_heartbeat
  - Trivial /ping handler exists only to satisfy the framework's
    "at least one handler with BenchmarkConfig" requirement (and to give
    clients an extension/keepalive route)
  - /release on the internal control port is kept as a convenience for
    queue consumers that don't carry session_auth — calls the framework's
    __close_session via name-mangling, which bypasses the session_auth
    check but is fine for a localhost-only endpoint
  - Workload/perf back to 100 (conventional)

Client side:
  - Uses endpoint.session(cost, lifetime) instead of POST /reserve
  - async with the SDK Session; close on exit posts /session/end with
    proper auth → 200 success in metrics
  - Demo and single modes both ride the same reserve() helper

Sessions landed in vastai-sdk 0.4.2 (commit ec9ef59, 2026-01-20).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Rob Ballantyne
2026-05-12 10:51:24 +01:00
parent 6c2f194b28
commit 6a562a1376
3 changed files with 206 additions and 252 deletions
+90 -94
View File
@@ -1,10 +1,8 @@
# Null PyWorker
A PyWorker that does **nothing** — it does not forward requests to any model
server. Each HTTP POST to `/reserve` simply marks the worker as busy and holds
the request open until the user's queue consumer (running locally on the
instance) calls `/release` on the internal control port — or a safety
timeout elapses.
server. Reservations are modelled as framework **sessions**: a request
comes in and you get a worker; release and it scales back down.
## When to use it
@@ -15,32 +13,29 @@ Use this worker when you want to drive Vast Serverless autoscaling but you do
etc.).
- A separate worker process on the Vast instance pulls work from that queue
directly. The Vast PyWorker is not involved in the request/response path.
Your consumer can be any language — node, golang, python, a binary —
this PyWorker is implementation-agnostic.
- You want one Vast worker per active queue consumer, and you want the
Serverless autoscaler to spin instances up and down based on demand on
*your* side.
A request comes in and you get a worker. Release and it scales back down.
POST to `/reserve` and serverless gives you a worker, held busy for the
lifetime of the request. When your queue consumer is done, POST to
`/release` on the internal port (`127.0.0.1:18999` by default) and the
held `/reserve` returns `200`.
## How it works
- `allow_parallel_requests=False` and `max_queue_time=0.0`, so one in-flight
`/reserve` fully occupies the worker and any further request that lands
on it is rejected with `429` immediately — serverless will route to a
free worker or scale a new one up.
- `lifecycle` is used instead of `model_log_file`, so there is no log to tail
and no model server to start. The worker reports itself ready immediately
after the (trivial) benchmark.
- The `/reserve` handler is a `remote_function` rather than an HTTP proxy, so
the framework never tries to forward the request anywhere — it just awaits
an internal `asyncio.Event`.
- An internal aiohttp control server, bound to `127.0.0.1`, hosts
`/release` (and, when no external healthcheck URL is provided, a stub
`/health`).
- Reservations use the framework's **session** model. The SDK exposes
`endpoint.session(cost, lifetime)` which POSTs to `/session/create` (a
built-in framework route) and returns a `Session` object usable as
`async with`. Closing the context (or calling `await session.close()`)
POSTs to `/session/end` — counted as a normal success in metrics.
- `max_sessions=1` on the worker side means a second `/session/create`
against an already-occupied worker returns `429`. Serverless routes
that request to a free worker or scales a new one up.
- Sessions are **excluded from queue-wait math** (the framework filters
`if not request.is_session`), so an occupied worker doesn't look like
it has a request queue piling up. The autoscaler treats a session as
occupancy, not as work-in-progress.
- `lifecycle` is used instead of `model_log_file`, so there is no log to
tail and no model server to start. The worker reports itself ready
immediately after a trivial benchmark.
## Healthchecking
@@ -49,48 +44,52 @@ fails after the first success, the worker is marked errored and the
autoscaler can decommission it. Two modes:
- **Stub (default)** — the internal control server also answers
`GET /health` with `200`. This is just enough to satisfy the framework
while you wire up real consumers.
`GET /health` with `200`. Just enough to satisfy the framework while
you wire up real consumers.
- **Point at your queue consumer (recommended)** — set
`BACKEND_HEALTH_URL=http://127.0.0.1:9090/health` (absolute URL) and the
pyworker will healthcheck *your* consumer instead. If your consumer
`BACKEND_HEALTH_URL=http://127.0.0.1:9090/health` (absolute URL) and
the pyworker will healthcheck *your* consumer instead. If the consumer
process crashes, the autoscaler will see the worker as broken.
Run your queue consumer on the instance alongside the PyWorker, expose a
plain `/health` endpoint on it, then set `BACKEND_HEALTH_URL` accordingly in
your template.
## API
### `POST /reserve` (external port, signed by the autoscaler)
### Reservation: `POST /session/create` (external, signed)
Holds the worker busy until the reservation ends.
Not implemented here — the framework provides this route automatically on
every PyWorker. Use the SDK:
Request body (all fields optional):
```python
from vastai import Serverless
```json
{ "duration": 600 }
async with Serverless() as client:
endpoint = await client.get_endpoint(name="my-null-endpoint")
async with endpoint.session(cost=100, lifetime=600) as s:
# Worker is now reserved. Your queue dispatcher does whatever it
# needs to do (typically: enqueue a job that mentions s.session_id).
...
# `async with` exit posts to /session/end → 200 success in metrics
```
- `duration` (seconds, optional): safety cap on how long to hold the
reservation if no `/release` arrives. Capped by `MAX_RESERVATION_SECONDS`
(env var, default 3600). If omitted, defaults to that cap.
Or raw HTTP (the SDK takes care of autoscaler signing for you, but the
shape of the request is documented for non-Python clients):
Behavior:
```
POST /session/create
{
"auth_data": { /* signed by autoscaler */ },
"payload": {
"lifetime": 600,
"on_close_route": "https://your.callback/notify",
"on_close_payload": {"job_id": "..."}
}
}
```
- Returns `200` with `{"released": "explicit", ...}` when the local consumer
POSTs `/release` on the internal port. **This is the intended happy path
— the request is counted as a success in metrics.**
- Returns `200` with `{"released": "duration_elapsed", "duration": <n>}` if
the duration cap fires (safety net for a stuck consumer).
- Returns `499` if the external client disconnects (counted as cancelled in
metrics — avoid this; use `/release` instead).
- Returns `429` immediately if the worker is already holding a reservation
(so serverless routes the request to a free worker instead of queueing).
### Release from a local consumer: `POST /release` (internal, localhost-only)
### `POST /release` (internal port, localhost-only)
Marks the active reservation as done. No body required. Idempotent:
Closes the active session, regardless of who created it. No body, no
auth. Use this when the queue consumer doesn't have (and shouldn't need)
the session's `session_auth`:
```bash
curl -X POST http://127.0.0.1:18999/release
@@ -98,78 +97,75 @@ curl -X POST http://127.0.0.1:18999/release
Responses:
- `200 {"released": true}` — active reservation was released; the held
`/reserve` will return `{"released": "explicit"}`.
- `200 {"released": false, "reason": "no active reservation"}` — nothing was
in flight, no-op.
- `200 {"released": true, "session_ids": ["..."]}` — closed; the held
client-side `/session/create` completes and counts as a success.
- `200 {"released": false, "reason": "no active session"}` — nothing
active, no-op.
Only processes on the Vast instance can reach this port. There is no
authentication on it.
For setups where the dispatcher can hand the consumer `session_auth`
(e.g. as part of the queue payload), the consumer can instead POST
`/session/end` on the framework's HTTP-only port
(`$WORKER_HTTP_PORT`, default `WORKER_PORT+1`) — the standard, fully
authenticated release path.
## Environment variables
- `MAX_RESERVATION_SECONDS` — upper bound on how long a single `/reserve`
call can hold a worker if `/release` is never called. Defaults to `3600`.
- `BACKEND_HEALTH_URL` — absolute URL the framework should healthcheck
(e.g. `http://127.0.0.1:9090/health`). When set, the stub `/health` route
is not registered on the internal server. When unset, the built-in stub
is used.
(e.g. `http://127.0.0.1:9090/health`). When set, the stub `/health`
route is not registered on the internal server.
- `NULL_CONTROL_PORT` — port for the internal control server (hosts
`/release` and optionally `/health`). Defaults to `18999`.
## Deploying on Vast Serverless
1. Create a Serverless endpoint and point `PYWORKER_REPO` at this repository
(or your fork).
1. Create a Serverless endpoint and point `PYWORKER_REPO` at this
repository (or your fork).
2. Set `BACKEND=null` in the template so `start_server.sh` runs
`workers.null.worker`.
3. There is no model server to configure; you can omit model-related env vars
entirely.
3. There is no model server to configure; you can omit model-related env
vars entirely.
4. Run your own queue-consumer process on the instance alongside the
PyWorker. When the consumer finishes its work it should:
PyWorker. When it finishes its work:
```bash
curl -X POST http://127.0.0.1:18999/release
```
so the held `/reserve` returns success and the autoscaler can scale the
worker down cleanly.
## Client example
Single reservation:
Single reservation (holds for 180s):
```bash
python -m workers.null.client --endpoint <ENDPOINT_NAME> --duration 600
python -m workers.null.client --endpoint <ENDPOINT_NAME>
```
To exercise the full flow, shell into the worker and run
`curl -X POST http://127.0.0.1:18999/release` — the client returns with
`{"released": "explicit", ...}`.
Staggered demo:
```bash
python -m workers.null.client --endpoint <ENDPOINT_NAME> --demo
```
Starts three reservations 30s apart (all held concurrently), holds the
Starts three sessions 30s apart (all held concurrently), holds the
3-worker plateau for 5 minutes so the autoscaler has time to actually
provision the third worker before any scale-down starts, then scales
down one worker at a time, also 30s apart, and exits.
provision the third worker before any scale-down starts, then closes
the sessions one at a time, also 30s apart, and exits. Every session
ends cleanly via the SDK's `session.close()` — `200` successes in
metrics, no cancellations.
Each reservation ends via its duration cap (a 200 success in metrics).
Tune the timing with `--interval` and `--plateau`.
Tune the timing with `--interval` and `--plateau`. To exercise the
local-release path, shell into a worker and run
`curl -X POST http://127.0.0.1:18999/release`.
## Notes and caveats
- The HTTP connection from the external caller must stay open for the full
reservation. Make sure your client and any intermediate proxies allow
long-lived requests (disable idle timeouts, retries, and connection
reuse if necessary).
- If your client retries on timeout, you may end up provisioning duplicate
workers. Configure `duration` generously and rely on `/release` from the
consumer to end reservations promptly.
- Avoid disconnecting the external `/reserve` request as a way to release —
that produces a `499` and is counted as a cancellation in Vast metrics.
Always release via `POST /release` on the internal port.
- There is no streaming / heartbeat in the response; the request returns
exactly once, when the reservation ends.
- The reservation's lifetime caps how long the session can live without
client activity. Set it comfortably longer than the work you expect to
do, or have the client periodically POST `/ping` with `session_id` to
extend.
- The `on_close_route` payload (passed at `/session/create`) is POSTed by
the framework when the session ends. Useful for notifying your queue
consumer that the reservation is closing.
- `/release` on the internal port is convenient but bypasses
`session_auth`. If you need the standard authenticated release flow,
pass `session_auth` to your consumer (e.g. through the queue payload)
and have it POST to `/session/end` on the framework's HTTP port
instead.
+42 -33
View File
@@ -15,35 +15,42 @@ logging.basicConfig(
log = logging.getLogger(__file__)
ENDPOINT_NAME = "null-prod"
SESSION_COST = 100
async def reserve(
client: Serverless,
*,
endpoint_name: str,
duration: float,
label: str = "reservation",
) -> dict:
"""Hold a Vast worker open for `duration` seconds (or until we disconnect).
hold_for: float,
label: str = "session",
) -> None:
"""Open a session, hold the worker for `hold_for` seconds, close cleanly.
The worker counts itself busy for the lifetime of this call. Returning
here means the reservation has ended — either /release was called on
the worker's internal control port, or the duration cap fired, or the
HTTP request was cancelled.
Uses the framework's session model — each session counts as one worker
occupied, but unlike a held HTTP request it isn't poisoning the
worker's throughput math. max_sessions=1 on the worker side means a
second /session/create against the same worker gets 429, so serverless
routes the second reservation to a free worker or scales a new one up.
"""
endpoint = await client.get_endpoint(name=endpoint_name)
payload = {"duration": duration}
# Session lifetime must outlast the hold. The framework expires sessions
# whose `expiration` (set to now + lifetime at creation) has passed; we
# don't make any keepalive requests so no extension happens.
lifetime = hold_for + 60
start = time.monotonic()
log.info("[%s] POST /reserve duration=%ss", label, duration)
log.info("[%s] creating session (lifetime=%.0fs, hold=%.0fs)", label, lifetime, hold_for)
async with endpoint.session(cost=SESSION_COST, lifetime=lifetime) as s:
log.info("[%s] session %s open", label, s.session_id)
try:
resp = await endpoint.request("/reserve", payload, cost=150)
elapsed = time.monotonic() - start
log.info("[%s] returned after %.1fs: %s", label, elapsed, resp.get("response"))
return resp["response"]
await asyncio.sleep(hold_for)
log.info("[%s] hold complete, closing session", label)
except asyncio.CancelledError:
elapsed = time.monotonic() - start
log.info("[%s] cancelled after %.1fs (HTTP connection dropped)", label, elapsed)
log.info("[%s] cancelled after %.1fs, closing session", label, elapsed)
raise
elapsed = time.monotonic() - start
log.info("[%s] session closed cleanly after %.1fs", label, elapsed)
async def run_demo(
@@ -53,38 +60,41 @@ async def run_demo(
interval: float,
plateau: float,
) -> None:
"""Trapezoidal load: ramp up three reservations, plateau, then scale down.
"""Trapezoidal load: ramp up three sessions, plateau, then scale down.
Start three reservations spaced `interval` seconds apart. Pick the
duration so that the first release fires `plateau` seconds *after the
last reservation started*, giving the autoscaler time to actually have
all three workers running before any of them begin to scale down.
Releases then fire `interval` seconds apart, matching the ramp-up.
Start three sessions spaced `interval` seconds apart. Each holds for
`(n-1)*interval + plateau` seconds, so the first release fires
`plateau` seconds after the last session started giving the
autoscaler time to actually have all three workers running before any
scale-down begins. Releases then fire `interval` seconds apart,
matching the ramp-up.
Each reservation ends via its duration cap (a 200 success).
Each session ends via the SDK's `session.close()` on `async with` exit,
which posts to /session/end with proper auth — counted as a normal
success in metrics.
"""
n = 3
hold = (n - 1) * interval + plateau
tasks: list[asyncio.Task] = []
for i in range(1, n + 1):
label = f"res-{i}"
log.info("[%s] starting (auto-release after %.0fs)", label, hold)
log.info("[%s] starting (hold=%.0fs)", label, hold)
task = asyncio.create_task(
reserve(
client,
endpoint_name=endpoint_name,
duration=hold,
hold_for=hold,
label=label,
),
name=label,
)
tasks.append(task)
if i < n:
log.info("Waiting %.0fs before next reservation...", interval)
log.info("Waiting %.0fs before next session...", interval)
await asyncio.sleep(interval)
log.info(
"All %d reservations in flight; holding plateau for %.0fs, "
"All %d sessions in flight; holding plateau for %.0fs, "
"then scaling down %.0fs apart",
n,
plateau,
@@ -106,19 +116,19 @@ def build_arg_parser() -> argparse.ArgumentParser:
"--duration",
type=float,
default=180.0,
help="Seconds to hold each worker busy (default: 180)",
help="Single-reserve mode: seconds to hold the worker (default: 180)",
)
modes = p.add_mutually_exclusive_group(required=False)
modes.add_argument(
"--reserve",
action="store_true",
help="Make a single /reserve call (default if no mode given)",
help="Make a single session (default if no mode given)",
)
modes.add_argument(
"--demo",
action="store_true",
help="Run the staggered 3-reservation demo, cancelling one mid-way",
help="Run the staggered 3-reservation trapezoid demo",
)
p.add_argument(
@@ -157,15 +167,14 @@ async def main_async():
plateau=args.plateau,
)
else:
response = await reserve(
await reserve(
client,
endpoint_name=args.endpoint,
duration=args.duration,
hold_for=args.duration,
label="reservation",
)
print(f"Reservation result: {response}")
except KeyboardInterrupt:
log.info("Interrupted; dropping any in-flight reservations")
log.info("Interrupted; dropping any in-flight sessions")
except Exception as e:
log.error("Error: %s", e, exc_info=True)
sys.exit(1)
+70 -121
View File
@@ -2,7 +2,6 @@ import asyncio
import logging
import os
from contextlib import asynccontextmanager
from typing import Optional
from urllib.parse import urlsplit
from aiohttp import web
@@ -17,21 +16,21 @@ from vastai import (
log = logging.getLogger(__file__)
# Safety cap: if the user's queue consumer never calls /release, the
# reservation is auto-released after this many seconds so a forgotten /release
# can't pin a worker indefinitely. Override with MAX_RESERVATION_SECONDS.
MAX_RESERVATION_SECONDS = float(os.environ.get("MAX_RESERVATION_SECONDS", 3600))
# Performance value pinned in the benchmark cache; sent to autoscaler as
# max_perf. Standardized at 100 — the conventional default the rest of the
# serverless system expects.
TARGET_PERF = 100.0
# Marker the benchmark path sets so the same remote function can return
# immediately during capacity estimation instead of sleeping.
# Marker the benchmark path sets so the fallback /ping path returns
# immediately during the framework's startup benchmark.
BENCHMARK_SENTINEL = "__null_worker_benchmark__"
# Internal control server. Hosts:
# * POST /release — always available, marks the active reservation as
# done so the held /reserve returns 200 (success in metrics, not a
# cancellation).
# * GET /health — only when no external BACKEND_HEALTH_URL is set; the
# framework's healthcheck loop polls it so the worker has a live signal.
# * POST /release — releases the active reservation by closing the
# singleton session on this worker. Called by the user's queue
# consumer when its work is done.
# * GET /health — only when BACKEND_HEALTH_URL is unset; gives the
# framework's healthcheck loop something live to talk to.
# Bound to 127.0.0.1 so only processes on the instance can reach it.
INTERNAL_HOST = "127.0.0.1"
INTERNAL_PORT = int(os.environ.get("NULL_CONTROL_PORT", 18999))
@@ -56,62 +55,51 @@ else:
USE_STUB_HEALTH = True
# Workload reported per /reserve and target perf for the heartbeat below.
TARGET_PERF = 150.0
# Singleton active reservation. `allow_parallel_requests=False` on the
# /reserve handler guarantees the framework only runs one at a time per
# worker, so a single slot is enough.
_active_reservation: Optional[asyncio.Event] = None
# Backed in after Worker(...) is constructed so the heartbeat coroutine in
# null_lifecycle() can mutate backend.metrics. Stored in a dict so the
# lifecycle closure picks up the assignment that happens before .run().
# Stashed after Worker(...) is constructed so /release can reach the
# framework's session machinery. Dict so the lifecycle closure picks up
# the assignment that happens before .run().
_backend_ref: dict = {"backend": None}
async def _perf_heartbeat() -> None:
"""Keep cur_perf pegged to TARGET_PERF while a reservation is held.
Without this, workload_served stays at 0 while a /reserve is being held
open. The autoscaler observes cur_perf=0 against max_perf=150, decides
the worker can't deliver its claimed throughput, and downgrades it —
which makes it cautious about scaling up and prone to queueing
subsequent requests behind the held one instead of routing elsewhere.
Every second, if anything is in flight, set workload_served=TARGET_PERF
and mark update_pending so the metrics loop sends immediately. The
metrics tick resets workload_served back to 0 after sending; we
re-pin it next iteration.
"""
while True:
try:
await asyncio.sleep(1.0)
backend = _backend_ref.get("backend")
if backend is None:
continue
mm = backend.metrics.model_metrics
if mm.requests_working:
mm.workload_served = TARGET_PERF
backend.metrics.update_pending = True
except asyncio.CancelledError:
raise
except Exception as e:
log.debug(f"perf heartbeat error: {e}")
def _build_internal_app() -> web.Application:
app = web.Application()
async def release_handler(_request: web.Request) -> web.Response:
event = _active_reservation
if event is None:
"""End the active reservation (the singleton session on this worker).
max_sessions=1 means at most one session is active here. We call
the framework's internal __close_session via name-mangling to
bypass the session_auth check that /session/end normally requires.
That's intentional: this endpoint is localhost-only so trust is
assumed, and the user's consumer can release without having to
plumb session_auth through their queue.
__close_session reports the session metrics as a success, fires
on_close_route if configured, and pops the session from the dict.
"""
backend = _backend_ref.get("backend")
if backend is None:
return web.json_response(
{"released": False, "reason": "no active reservation"},
{"released": False, "reason": "backend not ready"},
status=503,
)
sids = list(backend.sessions.keys())
if not sids:
return web.json_response(
{"released": False, "reason": "no active session"},
status=200,
)
closed = []
for sid in sids:
try:
if await backend._Backend__close_session(sid):
closed.append(sid)
except Exception as e:
log.warning(f"Error closing session {sid}: {e}")
return web.json_response(
{"released": bool(closed), "session_ids": closed},
status=200,
)
event.set()
return web.json_response({"released": True}, status=200)
app.router.add_post("/release", release_handler)
@@ -126,18 +114,14 @@ def _build_internal_app() -> web.Application:
@asynccontextmanager
async def null_lifecycle():
# Pin max_throughput to exactly 100 by pre-populating the framework's
# benchmark cache file. The framework's __run_benchmark short-circuits
# to `float(file_contents)` when this file exists, bypassing the
# time-based calculation that would otherwise drift to ~99.x due to
# asyncio scheduling overhead. The filename matches the framework
# constant BENCHMARK_INDICATOR_FILE in
# vastai.serverless.server.lib.backend.
# Pin max_throughput to exactly TARGET_PERF by pre-populating the
# framework's benchmark cache file. __run_benchmark short-circuits to
# float(file_contents) when this file exists.
try:
with open(".has_benchmark", "w") as fh:
fh.write("150")
fh.write(str(int(TARGET_PERF)))
except OSError as e:
log.warning(f"Could not pin benchmark cache to 150: {e}")
log.warning(f"Could not pin benchmark cache: {e}")
app = _build_internal_app()
runner = web.AppRunner(app)
@@ -145,8 +129,6 @@ async def null_lifecycle():
site = web.TCPSite(runner, INTERNAL_HOST, INTERNAL_PORT)
await site.start()
heartbeat = asyncio.create_task(_perf_heartbeat(), name="null-perf-heartbeat")
lines = [
f"Null pyworker internal control server: http://{INTERNAL_HOST}:{INTERNAL_PORT}",
f" POST /release - end the active reservation (call from your queue consumer)",
@@ -157,60 +139,32 @@ async def null_lifecycle():
)
else:
lines.append(f"Framework healthcheck pointed at: {BACKEND_HEALTH_URL}")
lines.append(
"Reservations use the framework session model. Clients POST to "
"/session/create via the SDK to acquire a worker; max_sessions=1 "
"so each worker holds at most one reservation."
)
log.info("\n".join(lines))
try:
yield
finally:
heartbeat.cancel()
try:
await heartbeat
except (asyncio.CancelledError, Exception):
pass
await runner.cleanup()
async def reserve_worker(**params: object) -> dict:
global _active_reservation
async def ping(**params: object) -> dict:
"""Trivial handler. Exists to satisfy the framework's requirement that
at least one HandlerConfig has a BenchmarkConfig, and to give clients
a route they can hit with session_id to extend their session TTL.
"""
if params.get(BENCHMARK_SENTINEL):
# Fallback path only — the lifecycle pre-populates .has_benchmark
# with "150" so __run_benchmark normally short-circuits and never
# invokes us. If the cache write failed, sleep ~1s so the
# time-based calculation lands near 150 (workload=150 / time~=1s).
# Fallback only — the lifecycle pre-pins .has_benchmark so
# __run_benchmark normally short-circuits and this never runs. If
# the cache write failed, sleep ~1s so the time-based throughput
# math lands near TARGET_PERF.
await asyncio.sleep(1.0)
return {"ok": True, "benchmark": True}
requested = params.get("duration")
if requested is None:
duration = MAX_RESERVATION_SECONDS
else:
try:
duration = max(0.0, min(float(requested), MAX_RESERVATION_SECONDS))
except (TypeError, ValueError):
duration = MAX_RESERVATION_SECONDS
event = asyncio.Event()
_active_reservation = event
log.info(
f"Reservation acquired; awaiting POST /release on "
f"http://{INTERNAL_HOST}:{INTERNAL_PORT}/release "
f"(auto-release after {duration:.1f}s)"
)
try:
try:
await asyncio.wait_for(event.wait(), timeout=duration)
log.info("Reservation released via /release")
return {"released": "explicit", "duration_cap": duration}
except asyncio.TimeoutError:
log.warning(
f"Reservation hit duration cap of {duration:.1f}s without "
f"explicit /release; releasing automatically"
)
return {"released": "duration_elapsed", "duration": duration}
finally:
if _active_reservation is event:
_active_reservation = None
return {"ok": True}
worker_config = WorkerConfig(
@@ -218,17 +172,12 @@ worker_config = WorkerConfig(
model_server_port=HEALTH_PORT,
model_healthcheck_url=HEALTH_PATH,
lifecycle=null_lifecycle(),
max_sessions=1,
handlers=[
HandlerConfig(
route="/reserve",
allow_parallel_requests=False,
# Reject (429) any /reserve that arrives while the worker is
# already busy. A held reservation lasts up to MAX_RESERVATION_
# SECONDS, so queueing behind it would mean hours of wait —
# better to bounce the request immediately so serverless routes
# it to a free worker (or spins up a new one).
max_queue_time=0.0,
remote_function=reserve_worker,
route="/ping",
allow_parallel_requests=True,
remote_function=ping,
workload_calculator=lambda _payload: TARGET_PERF,
benchmark_config=BenchmarkConfig(
generator=lambda: {BENCHMARK_SENTINEL: True},