Revert default session cost to 100; document the over-provision as a workaround
cost = max_perf = 100 is the intended steady-state semantics: one session = one worker, scaling elastically from zero. Reverting the default so the design reads correctly even where current autoscaler bugs make it misbehave (2→3 scale-up not firing reliably, scale-to-zero issues — fixes pending on the Vast side). README now describes the intended model first (clean unit occupancy, scale-to-zero via inactivity_timeout + min_load=0), then flags the known autoscaler quirk and presents --session-cost 200 as a temporary band-aid until the Vast fixes land. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+41
-35
@@ -133,50 +133,56 @@ authenticated release path.
|
|||||||
### Endpoint scaling parameters
|
### Endpoint scaling parameters
|
||||||
|
|
||||||
The null worker reports `max_perf = 100` and each reservation is a
|
The null worker reports `max_perf = 100` and each reservation is a
|
||||||
session of `cost = 100`. Set the endpoint accordingly:
|
session of `cost = 100`. The intended model is **one session = one
|
||||||
|
worker**, scaling elastically from zero up to as many concurrent
|
||||||
|
sessions as you ask for.
|
||||||
|
|
||||||
- **`target_util = 1.0`** — required. The default of `0.9` reserves
|
- **`target_util = 1.0`** — required. The default of `0.9` reserves
|
||||||
~11% spare capacity, which for a unit-occupancy worker rounds up to a
|
~11% spare capacity, which for a unit-occupancy worker rounds up to a
|
||||||
whole extra worker (e.g. `min_load = 100` becomes `100 / 0.9 = 111.1`
|
whole extra worker (e.g. `min_load = 100` becomes `100 / 0.9 = 111.1`
|
||||||
→ 2 active workers instead of 1). With `target_util = 1.0` the math
|
→ 2 active workers instead of 1). With `target_util = 1.0` the math
|
||||||
is clean: `min_load = 100 * N` keeps exactly `N` workers active.
|
is clean: `min_load = 100 * N` keeps exactly `N` workers active.
|
||||||
- **`min_load`** — set to `100 * N` for `N` always-on workers (with
|
- **`min_load = 0`** — required for scale-to-zero. With `min_load = 0`
|
||||||
`target_util = 1.0`).
|
and a positive `inactivity_timeout`, the endpoint can scale down to
|
||||||
|
zero active workers when no sessions exist.
|
||||||
- **`max_workers`** — cap on total reservations the endpoint can ever
|
- **`max_workers`** — cap on total reservations the endpoint can ever
|
||||||
serve concurrently.
|
serve concurrently.
|
||||||
- **Session `cost = 2 × max_perf`** (e.g. `200` when `max_perf = 100`) —
|
- **`inactivity_timeout`** — positive value enables scale-to-zero
|
||||||
recommended. Reporting `cost = max_perf` puts each occupied worker at
|
after the configured number of seconds of no active sessions. Use
|
||||||
exactly 100% utilization, which the autoscaler reads as "at target,
|
alongside `cold_workers = 0` to also drop the inactive pool.
|
||||||
no action needed." The third reservation then gets 429'd by both
|
- **`max_queue_time = 0`** and **`target_queue_time = 0`** —
|
||||||
occupied workers and stalls in the autoscaler's global queue
|
recommended. The autoscaler computes per-worker queue-time as
|
||||||
indefinitely instead of activating a cold worker.
|
`cur_load / max_perf` and sessions *are* in `cur_load`. With the
|
||||||
|
defaults (~30s), an occupied null worker (`cur_load = 100`,
|
||||||
|
`max_perf = 100`, implied queue = 1s) looks "available" for routing,
|
||||||
|
so a third reservation gets repeatedly 429'd and never triggers
|
||||||
|
scale-up. Zeroing both knobs tells the autoscaler "don't estimate
|
||||||
|
when this worker will free up; route to a free one or make a new
|
||||||
|
one."
|
||||||
|
|
||||||
Bumping `cost` above `max_perf` makes each session look like more than
|
#### Known autoscaler quirk
|
||||||
one worker of work (`cur_load / max_perf > 1.0`), so the autoscaler
|
|
||||||
keeps an extra active worker hot per session. Slight over-provisioning
|
In current Vast Serverless, scale-up reliably fires for the 1→2
|
||||||
in exchange for predictable scale-up. The demo client defaults to
|
worker transition (the first 429 from an occupied worker activates a
|
||||||
`--session-cost 200`.
|
cold one), but **the 2→3 transition often fails to fire** — the
|
||||||
- **`max_queue_time = 0`** (or very small, e.g. `0.1`) — required.
|
third reservation 429s on both occupied workers and sits in the
|
||||||
The per-worker `wait_time` property used internally to reject
|
autoscaler's global queue indefinitely instead of activating a third
|
||||||
requests filters sessions out, but the **autoscaler** computes its
|
cold worker. Scale-to-zero also has known issues.
|
||||||
own queue-time estimate from `cur_load / max_perf` — and `cur_load`
|
|
||||||
*does* include sessions. With defaults around 30s, an occupied null
|
Fixes are pending on the Vast side. Until they land, a temporary
|
||||||
worker (`cur_load = 100`, `max_perf = 100`, queue estimate = 1s)
|
workaround is to over-provision by reporting `cost > max_perf` on
|
||||||
looks "available" and the autoscaler keeps routing extra reservations
|
session creation:
|
||||||
there, getting 429s and queueing them instead of scaling up. Setting
|
|
||||||
`max_queue_time = 0` makes any in-flight load mark the worker "full"
|
```bash
|
||||||
for routing.
|
python -m workers.null.client --demo --session-cost 200
|
||||||
- **`target_queue_time = 0`** — required. Aggressive scale-up trigger;
|
```
|
||||||
with `max_queue_time = 0` to keep occupied workers off the routing
|
|
||||||
table, this ensures the autoscaler provisions a new worker the
|
With `cost = 200, max_perf = 100`, each occupied worker reports
|
||||||
moment all existing ones are occupied rather than queueing on its
|
`cur_load / max_perf = 2.0` — clearly over capacity, so the autoscaler
|
||||||
side. The queue-time math conceptually assumes work *completes in
|
keeps one extra active worker warm per session. The next
|
||||||
proportion to load*, which doesn't hold for sessions (they last
|
`/session/create` lands on the warm worker directly with no queue.
|
||||||
hours, not `cur_load / max_perf` seconds). Zeroing both knobs tells
|
**This is a band-aid, not the design.** The intended steady state
|
||||||
the autoscaler "don't estimate when this worker will free up; route
|
is `cost = 100` with predictable elastic scale-up.
|
||||||
to a free one or make a new one."
|
|
||||||
- **`inactivity_timeout`** — works as expected: idle (no active
|
|
||||||
sessions) for N seconds → permitted to scale down past `min_load`.
|
|
||||||
|
|
||||||
## Client example
|
## Client example
|
||||||
|
|
||||||
|
|||||||
@@ -15,12 +15,12 @@ logging.basicConfig(
|
|||||||
log = logging.getLogger(__file__)
|
log = logging.getLogger(__file__)
|
||||||
|
|
||||||
ENDPOINT_NAME = "null-prod"
|
ENDPOINT_NAME = "null-prod"
|
||||||
# Default cost passed to /session/create. Bumping this above the worker's
|
# Default cost passed to /session/create. 100 matches the worker's
|
||||||
# max_perf (100) is how you tell the autoscaler "each session is more than
|
# max_perf for clean unit-occupancy semantics: one session = one worker.
|
||||||
# one worker of work" — keeps an extra active worker warm and ready, so
|
# If you hit autoscaler scale-up issues (queueing past the 2nd active
|
||||||
# the next /session/create lands on a free worker instead of queueing.
|
# worker), --session-cost 200 is a temporary over-provisioning workaround
|
||||||
# See README "Endpoint scaling parameters" for the math.
|
# until the known autoscaler fixes land.
|
||||||
DEFAULT_SESSION_COST = 200
|
DEFAULT_SESSION_COST = 100
|
||||||
|
|
||||||
|
|
||||||
async def reserve(
|
async def reserve(
|
||||||
|
|||||||
Reference in New Issue
Block a user