Files

T

Rob Ballantyne 47ad0ebe0a Add --instance flag to null pyworker client

Lets the demo target run-alpha.vast.ai (or candidate/local) without
editing code. Defaults to prod; respects VAST_INSTANCE env var.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-12 11:40:51 +01:00

__init__.py

Add null pyworker for queue-driven autoscaling

2026-05-11 16:48:52 +01:00

client.py

Add --instance flag to null pyworker client

2026-05-12 11:40:51 +01:00

README.md

Revert default session cost to 100; document the over-provision as a workaround

2026-05-12 11:34:52 +01:00

worker.py

Rewrite null pyworker on the framework session model

2026-05-12 10:51:24 +01:00

README.md

Null PyWorker

A PyWorker that does nothing — it does not forward requests to any model server. Reservations are modelled as framework sessions: a request comes in and you get a worker; release and it scales back down.

When to use it

Use this worker when you want to drive Vast Serverless autoscaling but you do not want inbound requests to reach a model on the instance. Typical setup:

You already have a job queue on your own infrastructure (Redis, SQS, NATS, etc.).
A separate worker process on the Vast instance pulls work from that queue directly. The Vast PyWorker is not involved in the request/response path. Your consumer can be any language — node, golang, python, a binary — this PyWorker is implementation-agnostic.
You want one Vast worker per active queue consumer, and you want the Serverless autoscaler to spin instances up and down based on demand on your side.

How it works

Reservations use the framework's session model. The SDK exposes endpoint.session(cost, lifetime) which POSTs to /session/create (a built-in framework route) and returns a Session object usable as async with. Closing the context (or calling await session.close()) POSTs to /session/end — counted as a normal success in metrics.
max_sessions=1 on the worker side means a second /session/create against an already-occupied worker returns 429. Serverless routes that request to a free worker or scales a new one up.
Sessions are excluded from queue-wait math (the framework filters if not request.is_session), so an occupied worker doesn't look like it has a request queue piling up. The autoscaler treats a session as occupancy, not as work-in-progress.
lifecycle is used instead of model_log_file, so there is no log to tail and no model server to start. The worker reports itself ready immediately after a trivial benchmark.

Healthchecking

The framework periodically GETs a healthcheck URL after startup; if it ever fails after the first success, the worker is marked errored and the autoscaler can decommission it. Two modes:

Stub (default) — the internal control server also answers GET /health with 200. Just enough to satisfy the framework while you wire up real consumers.
Point at your queue consumer (recommended) — set BACKEND_HEALTH_URL=http://127.0.0.1:9090/health (absolute URL) and the pyworker will healthcheck your consumer instead. If the consumer process crashes, the autoscaler will see the worker as broken.

API

Reservation: `POST /session/create` (external, signed)

Not implemented here — the framework provides this route automatically on every PyWorker. Use the SDK:

from vastai import Serverless

async with Serverless() as client:
    endpoint = await client.get_endpoint(name="my-null-endpoint")
    async with endpoint.session(cost=100, lifetime=600) as s:
        # Worker is now reserved. Your queue dispatcher does whatever it
        # needs to do (typically: enqueue a job that mentions s.session_id).
        ...
    # `async with` exit posts to /session/end → 200 success in metrics

Or raw HTTP (the SDK takes care of autoscaler signing for you, but the shape of the request is documented for non-Python clients):

POST /session/create
{
  "auth_data": { /* signed by autoscaler */ },
  "payload": {
    "lifetime": 600,
    "on_close_route": "https://your.callback/notify",
    "on_close_payload": {"job_id": "..."}
  }
}

Release from a local consumer: `POST /release` (internal, localhost-only)

Closes the active session, regardless of who created it. No body, no auth. Use this when the queue consumer doesn't have (and shouldn't need) the session's session_auth:

curl -X POST http://127.0.0.1:18999/release

Responses:

200 {"released": true, "session_ids": ["..."]} — closed; the held client-side /session/create completes and counts as a success.
200 {"released": false, "reason": "no active session"} — nothing active, no-op.

For setups where the dispatcher can hand the consumer session_auth (e.g. as part of the queue payload), the consumer can instead POST /session/end on the framework's HTTP-only port ($WORKER_HTTP_PORT, default WORKER_PORT+1) — the standard, fully authenticated release path.

Environment variables

BACKEND_HEALTH_URL — absolute URL the framework should healthcheck (e.g. http://127.0.0.1:9090/health). When set, the stub /health route is not registered on the internal server.
NULL_CONTROL_PORT — port for the internal control server (hosts /release and optionally /health). Defaults to 18999.

Deploying on Vast Serverless

Create a Serverless endpoint and point PYWORKER_REPO at this repository (or your fork).
Set BACKEND=null in the template so start_server.sh runs workers.null.worker.
There is no model server to configure; you can omit model-related env vars entirely.
Run your own queue-consumer process on the instance alongside the PyWorker. When it finishes its work:
```
curl -X POST http://127.0.0.1:18999/release
```

Endpoint scaling parameters

The null worker reports max_perf = 100 and each reservation is a session of cost = 100. The intended model is one session = one worker, scaling elastically from zero up to as many concurrent sessions as you ask for.

target_util = 1.0 — required. The default of 0.9 reserves ~11% spare capacity, which for a unit-occupancy worker rounds up to a whole extra worker (e.g. min_load = 100 becomes 100 / 0.9 = 111.1 → 2 active workers instead of 1). With target_util = 1.0 the math is clean: min_load = 100 * N keeps exactly N workers active.
min_load = 0 — required for scale-to-zero. With min_load = 0 and a positive inactivity_timeout, the endpoint can scale down to zero active workers when no sessions exist.
max_workers — cap on total reservations the endpoint can ever serve concurrently.
inactivity_timeout — positive value enables scale-to-zero after the configured number of seconds of no active sessions. Use alongside cold_workers = 0 to also drop the inactive pool.
max_queue_time = 0 and target_queue_time = 0 — recommended. The autoscaler computes per-worker queue-time as cur_load / max_perf and sessions are in cur_load. With the defaults (~30s), an occupied null worker (cur_load = 100, max_perf = 100, implied queue = 1s) looks "available" for routing, so a third reservation gets repeatedly 429'd and never triggers scale-up. Zeroing both knobs tells the autoscaler "don't estimate when this worker will free up; route to a free one or make a new one."

Known autoscaler quirk

In current Vast Serverless, scale-up reliably fires for the 1→2 worker transition (the first 429 from an occupied worker activates a cold one), but the 2→3 transition often fails to fire — the third reservation 429s on both occupied workers and sits in the autoscaler's global queue indefinitely instead of activating a third cold worker. Scale-to-zero also has known issues.

Fixes are pending on the Vast side. Until they land, a temporary workaround is to over-provision by reporting cost > max_perf on session creation:

python -m workers.null.client --demo --session-cost 200

With cost = 200, max_perf = 100, each occupied worker reports cur_load / max_perf = 2.0 — clearly over capacity, so the autoscaler keeps one extra active worker warm per session. The next /session/create lands on the warm worker directly with no queue. This is a band-aid, not the design. The intended steady state is cost = 100 with predictable elastic scale-up.

Client example

Single reservation (holds for 180s):

python -m workers.null.client --endpoint <ENDPOINT_NAME>

Staggered demo:

python -m workers.null.client --endpoint <ENDPOINT_NAME> --demo

Starts three sessions 30s apart (all held concurrently), holds the 3-worker plateau for 5 minutes so the autoscaler has time to actually provision the third worker before any scale-down starts, then closes the sessions one at a time, also 30s apart, and exits. Every session ends cleanly via the SDK's session.close() — 200 successes in metrics, no cancellations.

Tune the timing with --interval and --plateau. To exercise the local-release path, shell into a worker and run curl -X POST http://127.0.0.1:18999/release.

Notes and caveats

The reservation's lifetime caps how long the session can live without client activity. Set it comfortably longer than the work you expect to do, or have the client periodically POST /ping with session_id to extend.
The on_close_route payload (passed at /session/create) is POSTed by the framework when the session ends. Useful for notifying your queue consumer that the reservation is closing.
/release on the internal port is convenient but bypasses session_auth. If you need the standard authenticated release flow, pass session_auth to your consumer (e.g. through the queue payload) and have it POST to /session/end on the framework's HTTP port instead.

README.md

Null PyWorker

When to use it

How it works

Healthchecking

API

Reservation: POST /session/create (external, signed)

Release from a local consumer: POST /release (internal, localhost-only)

Environment variables

Deploying on Vast Serverless

Endpoint scaling parameters

Known autoscaler quirk

Client example

Notes and caveats

Reservation: `POST /session/create` (external, signed)

Release from a local consumer: `POST /release` (internal, localhost-only)