Commit Graph

215 Commits

Author SHA1 Message Date
Rob Ballantyne 1d2caaf554 Default null pyworker session cost to 2x max_perf
Reporting cost == max_perf puts an occupied worker at exactly 100%
utilization, which the autoscaler reads as "at target, no action."
The 3rd session_create then 429s on both active workers and stalls in
the global queue instead of triggering a cold-worker activation
(observed: 1→2 active scales fine, 2→3 does not).

Bumping cost to 2 * max_perf makes each session look like more than
one worker's work, so the autoscaler always keeps an extra active
worker hot. Slight over-provisioning, but the 3rd reservation lands
directly on a free worker rather than queueing.

Expose --session-cost on the client so the value can be swept without
edits. README documents the trade-off.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 11:31:26 +01:00
Rob Ballantyne 01eff874d8 Correct queue-time guidance for null pyworker endpoints
Earlier note claimed max_queue_time / target_queue_time were no-ops
because the worker's internal wait_time property filters sessions out.
That filter only affects per-worker rejection on a given handler — the
autoscaler doesn't see the property and computes its own queue-time
estimate from cur_load / max_perf, which *does* include sessions.

With defaults around 30s, an occupied null worker (cur_load=100,
max_perf=100, implied queue=1s) still looks "available" to the
autoscaler, so a third reservation gets queued on an existing worker
via repeated 429-retries instead of triggering scale-up.

Fix: set max_queue_time = 0 and target_queue_time = 0 on the endpoint.
Any in-flight load marks the worker "full" for routing, and any
observed queue time triggers immediate scale-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 11:14:20 +01:00
Rob Ballantyne d51f04a176 Await endpoint.session() in null pyworker client
endpoint.session() forwards to start_endpoint_session, which is async
def — so the call returns a coroutine, not a Session, despite the
SDK's return-type annotation. Use 'async with await endpoint.session(...)'
so the coroutine resolves to a Session before entering the context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 11:07:32 +01:00
Rob Ballantyne ef248ef695 Document endpoint scaling parameters for null pyworker
Add a scaling-parameters section to the README covering target_util=1.0
(the critical one — the default 0.9 silently rounds up to one extra
worker), min_load math, and why max_queue_time / target_queue_time
don't matter here (sessions are filtered from wait_time so both signals
stay at zero).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 11:06:04 +01:00
Rob Ballantyne 6a562a1376 Rewrite null pyworker on the framework session model
Drop the held-/reserve approach in favour of the framework's session
primitive (max_sessions=1 + /session/create). Sessions are excluded from
the autoscaler's queue-wait math and don't suffer the cur_perf=0
degradation that a long-held request did, so this naturally produces the
"one request comes in and you get a worker; release and it scales back
down" model we were hand-rolling.

Server side:
  - max_sessions=1; framework auto-registers /session/* routes
  - Drop custom /reserve handler, _active_reservation event, max_queue_
    time=0.0, MAX_RESERVATION_SECONDS, _perf_heartbeat
  - Trivial /ping handler exists only to satisfy the framework's
    "at least one handler with BenchmarkConfig" requirement (and to give
    clients an extension/keepalive route)
  - /release on the internal control port is kept as a convenience for
    queue consumers that don't carry session_auth — calls the framework's
    __close_session via name-mangling, which bypasses the session_auth
    check but is fine for a localhost-only endpoint
  - Workload/perf back to 100 (conventional)

Client side:
  - Uses endpoint.session(cost, lifetime) instead of POST /reserve
  - async with the SDK Session; close on exit posts /session/end with
    proper auth → 200 success in metrics
  - Demo and single modes both ride the same reserve() helper

Sessions landed in vastai-sdk 0.4.2 (commit ec9ef59, 2026-01-20).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 10:51:24 +01:00
Rob Ballantyne 6c2f194b28 Add perf heartbeat to keep null pyworker reporting peak throughput
While a /reserve is held, no requests complete so workload_served stays
at 0 each metrics tick. The autoscaler sees cur_perf=0 against
max_perf=150, concludes the worker can't deliver claimed throughput,
downgrades it, and gets cautious about scaling up — so additional
/reserve requests pile up behind the held one instead of triggering a
new worker.

Add a 1Hz heartbeat coroutine that, while anything is in flight, sets
workload_served back to TARGET_PERF (150) and flags update_pending. The
metrics tick reads 150 and resets to 0; the heartbeat re-pins it before
the next tick. Net effect: the autoscaler sees a saturated worker
delivering at peak rate, which is the signal it needs to scale a new
worker up rather than queue.

The heartbeat needs the backend instance, which is only created inside
Worker(...) — stash a reference in a module-level dict between Worker()
and .run() so the lifecycle coroutine can reach it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 10:35:18 +01:00
Rob Ballantyne 2aada7b210 Add --plateau to null pyworker demo (default 5min)
Previously the first release fired only 30s after the third reservation
started, so the autoscaler often hadn't even finished provisioning the
third worker yet. Default plateau to 300s so all three workers are
visibly running before scale-down begins; configurable via --plateau.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 18:26:31 +01:00
Rob Ballantyne 8df562e243 Standardize null pyworker load/perf on 150
Bump workload_calculator, benchmark cache value, and client cost from 100
to 150.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 18:17:57 +01:00
Rob Ballantyne 4eef5e22af Pin null pyworker max_throughput to exactly 100
asyncio.sleep(1.0) takes slightly more than 1s due to event loop
scheduling, so workload/time landed at ~99.x instead of 100. Pre-populate
the framework's .has_benchmark cache file with "100" before the benchmark
runs — __run_benchmark short-circuits to the cached value and skips the
time-based calculation entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 18:13:16 +01:00
Rob Ballantyne 9d969e376e Standardize null pyworker load/perf on 100
Using 1 confused the serverless capacity math. Set workload_calculator,
benchmark target throughput, and client cost all to 100 — the conventional
default the rest of the system expects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 18:09:16 +01:00
Rob Ballantyne ef3f34a515 Restructure null pyworker --demo as a clean trapezoid
Three reservations 30s apart, each with a 90s duration. They end one at
a time, also 30s apart, then the client exits. Each reservation ends
via its duration cap (200 success) rather than the previous "cancel one,
leave two open" pattern that left two 499s pending.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 18:00:46 +01:00
Rob Ballantyne 147bf2597a Set null pyworker client cost to 1
Match the server-side workload_calculator (1.0) so the autoscaler routing
hint is consistent with what the worker reports. A null reservation is a
unitless slot — no reason for client cost to be 100.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 17:47:19 +01:00
Rob Ballantyne dc423e2999 Pin null pyworker benchmark to ~1.0 throughput
The startup benchmark previously returned instantly, producing
max_throughput around 339895. A null worker has no real throughput
concept (each reservation is a unitless slot), so sleep 1s during the
benchmark with workload=1 to record max_throughput ~= 1.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 17:22:45 +01:00
Rob Ballantyne 463f3de8ea Add staggered --demo mode to null pyworker client
Three concurrent /reserve calls 30s apart, then cancel the first to show
the early-release path. The remaining two run until their duration cap.
Useful for watching scale-up/scale-down behaviour in the autoscaler
dashboard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 17:08:44 +01:00
Rob Ballantyne ed0db198c3 Reject queued /reserve immediately on busy null workers
A held reservation runs for up to MAX_RESERVATION_SECONDS (default 1h), so
queueing a second /reserve behind it makes no sense — the wait would dwarf
any sane timeout. Set max_queue_time=0.0 so the framework rejects 429 as
soon as another reservation is in flight, and serverless routes the request
to a free worker or scales a new one up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 17:05:02 +01:00
Rob Ballantyne 3668d948be Simplify null pyworker README intro to serverless terminology
Drop the "autoscaler provisions a worker if none is free" phrasing in
favor of the simpler "request comes in and you get a worker; release and
it scales back down."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 17:02:41 +01:00
Rob Ballantyne 254ccdf181 Add /release control endpoint to null pyworker
The held /reserve now waits on an asyncio.Event and resolves when the local
queue consumer POSTs /release on the internal control port (127.0.0.1:18999
by default). This produces a 200 success in metrics instead of the 499
cancellation you got from disconnecting the client. The duration cap stays
as a safety net for stuck consumers.

The internal aiohttp server is now unconditional and hosts /release always;
the stub /health route is added only when BACKEND_HEALTH_URL is unset.
NULL_STUB_HEALTH_PORT is renamed to NULL_CONTROL_PORT to reflect the
broader role.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 16:59:46 +01:00
Rob Ballantyne 89761b378a Wire null pyworker healthcheck to a stub (and optional user URL)
Adds an in-process aiohttp stub on 127.0.0.1:18999/health so the framework's
periodic healthcheck has something live to talk to. Operators can override
with BACKEND_HEALTH_URL to point at their queue consumer's /health
endpoint, so the autoscaler marks the worker errored if the consumer dies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 16:53:26 +01:00
Rob Ballantyne 18974873e5 Add null pyworker for queue-driven autoscaling
A PyWorker that does not forward to any model server. POST /reserve holds
the worker busy until the client disconnects (or the duration cap elapses),
so users with their own job queue can drive Vast autoscaling without
exposing inbound model traffic on the instance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 16:48:52 +01:00
Lucas Armand 9bc9ba11c5 Increase TGI benchmark tokens to 500 2026-04-30 14:04:39 -07:00
LucasArmandVast 48fdc65e3d Update to vastai package (#84) 2026-04-14 10:41:31 -07:00
LucasArmandVast 2cd97315cd Add nltk requirement for openai worker (#83)
* Add nltk requirement for openai worker

* pin version
2026-04-13 11:30:06 -07:00
Lucas Armand 83c31e25a9 Add force update detection 2026-03-31 13:46:22 -07:00
Lucas Armand fbe1dca6fa more env_path fixes 2026-03-30 16:28:51 -07:00
Lucas Armand 4c3120dbc5 allow override env_path 2026-03-30 16:25:01 -07:00
Lucas Armand d7d9b915f6 allow break system packages 2026-03-30 16:09:17 -07:00
Lucas Armand 4660b337fb Check for USE_SYSTEM_PYTHON 2026-03-30 14:46:38 -07:00
edgaratvast 7506ecb6b5 directly invoke one stop shop setup executable exported by vastai pip package for deployments (#82) 2026-03-26 10:59:49 -07:00
LucasArmandVast 50633c5003 Update deployments script with retries. (#81) 2026-03-23 14:58:32 -07:00
LucasArmandVast 2e8f18276f Add beta deployments script (#80) 2026-03-23 14:14:06 -07:00
Scott Darden eba9c480eb Merge pull request #79 from vast-ai/update-requirements
Updated requirements to only require vastai-sdk
2026-01-14 12:07:33 -08:00
Lucas Armand aaca1c9645 Updated requirements to only require vastai-sdk 2026-01-14 10:47:07 -08:00
LucasArmandVast f319db6bd5 flag for model log rotate (#78) 2026-01-12 17:03:18 -08:00
LucasArmandVast 4d786b4d17 SDK Versioning Improvements (#77)
* Add SDK_BRANCH
2026-01-02 10:23:07 -08:00
LucasArmandVast bd3e0032a1 Add SDK version checking (#76) 2025-12-17 21:01:52 -08:00
Lucas Armand e02f4bc943 Lowered concurrency of vLLM and TGI benchmarks 2025-12-17 11:55:33 -08:00
Lucas Armand bcb04b9a32 add missing comma 2025-12-17 11:40:40 -08:00
Lucas Armand 9daf171487 Increase queue limits for vLLM and TGI 2025-12-17 11:38:55 -08:00
LucasArmandVast 29f836eb1a Backwards compatible vLLM payload (#75)
* Support old vLLM payloads
2025-12-15 19:58:02 -08:00
LucasArmandVast 4380d98c01 Use PyWorker SDK (#67)
* Change PyWorker to Worker SDK
* Moved /lib to vast-sdk (https://github.com/vast-ai/vast-sdk)
2025-12-15 19:33:03 -08:00
Abiola Akinnubi 2ce741a8b7 Merge pull request #74 from vast-ai/AUTO-912
Mark pyworkers as "Error" if startup script fails. to avoid silent fail that waits for autoscaler.
2025-12-11 17:05:13 -08:00
Abiola Akinnubi 4ecc07032f Mark pyworkers as "Error" if startup script fails. to avoid silent fail that waits for autoscaler. 2025-12-11 12:51:56 -08:00
edgaratvast df61e6e946 correct version pin for aiohttp (#73)
Co-authored-by: Edgar Lin <edgarlin2000@gmail.com>
2025-12-10 19:34:52 -08:00
LucasArmandVast 70f8a8f534 Merge pull request #72 from vast-ai/hotfix-pin-pycares
Hotfix: pin pycares
2025-12-10 20:41:44 -05:00
Lucas Armand 7be8aa6397 pin pycares 2025-12-10 17:38:03 -08:00
Colter-Downing 138fc3ac47 Merge pull request #71 from vast-ai/AUTO-comfyui-updates
Auto comfyui updates
2025-12-04 10:55:12 -08:00
Colter Downing 222ac2a0dd default endpoint name 2025-12-04 10:54:55 -08:00
Colter Downing 40aed9b5f8 adding s3 as an option 2025-12-04 10:52:57 -08:00
Colter Downing d4d36bf86e done with comfy updates 2025-12-03 20:45:55 -08:00
Colter Downing e839cfc6e8 include view in API wrapper 2025-12-03 20:22:45 -08:00