Reject queued /reserve immediately on busy null workers

A held reservation runs for up to MAX_RESERVATION_SECONDS (default 1h), so
queueing a second /reserve behind it makes no sense — the wait would dwarf
any sane timeout. Set max_queue_time=0.0 so the framework rejects 429 as
soon as another reservation is in flight, and serverless routes the request
to a free worker or scales a new one up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Rob Ballantyne
2026-05-11 17:05:02 +01:00
parent 3668d948be
commit ed0db198c3
2 changed files with 12 additions and 7 deletions
+6 -6
View File
@@ -28,10 +28,10 @@ held `/reserve` returns `200`.
## How it works
- `allow_parallel_requests=False`, so one in-flight `/reserve` fully occupies
the worker. Any second request that lands on the same worker queues (or is
rejected with `429` after `max_queue_time`), pushing the autoscaler to
provision more workers.
- `allow_parallel_requests=False` and `max_queue_time=0.0`, so one in-flight
`/reserve` fully occupies the worker and any further request that lands
on it is rejected with `429` immediately — serverless will route to a
free worker or scale a new one up.
- `lifecycle` is used instead of `model_log_file`, so there is no log to tail
and no model server to start. The worker reports itself ready immediately
after the (trivial) benchmark.
@@ -85,8 +85,8 @@ Behavior:
the duration cap fires (safety net for a stuck consumer).
- Returns `499` if the external client disconnects (counted as cancelled in
metrics — avoid this; use `/release` instead).
- Returns `429` if the worker is already busy and queue wait would exceed
`max_queue_time` (30s by default).
- Returns `429` immediately if the worker is already holding a reservation
(so serverless routes the request to a free worker instead of queueing).
### `POST /release` (internal port, localhost-only)