Default null pyworker session cost to 2x max_perf

Reporting cost == max_perf puts an occupied worker at exactly 100%
utilization, which the autoscaler reads as "at target, no action."
The 3rd session_create then 429s on both active workers and stalls in
the global queue instead of triggering a cold-worker activation
(observed: 1→2 active scales fine, 2→3 does not).

Bumping cost to 2 * max_perf makes each session look like more than
one worker's work, so the autoscaler always keeps an extra active
worker hot. Slight over-provisioning, but the 3rd reservation lands
directly on a free worker rather than queueing.

Expose --session-cost on the client so the value can be swept without
edits. README documents the trade-off.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Rob Ballantyne
2026-05-12 11:31:26 +01:00
parent 01eff874d8
commit 1d2caaf554
2 changed files with 39 additions and 3 deletions
+12
View File
@@ -144,6 +144,18 @@ session of `cost = 100`. Set the endpoint accordingly:
`target_util = 1.0`).
- **`max_workers`** — cap on total reservations the endpoint can ever
serve concurrently.
- **Session `cost = 2 × max_perf`** (e.g. `200` when `max_perf = 100`) —
recommended. Reporting `cost = max_perf` puts each occupied worker at
exactly 100% utilization, which the autoscaler reads as "at target,
no action needed." The third reservation then gets 429'd by both
occupied workers and stalls in the autoscaler's global queue
indefinitely instead of activating a cold worker.
Bumping `cost` above `max_perf` makes each session look like more than
one worker of work (`cur_load / max_perf > 1.0`), so the autoscaler
keeps an extra active worker hot per session. Slight over-provisioning
in exchange for predictable scale-up. The demo client defaults to
`--session-cost 200`.
- **`max_queue_time = 0`** (or very small, e.g. `0.1`) — required.
The per-worker `wait_time` property used internally to reject
requests filters sessions out, but the **autoscaler** computes its