pyworker/workers/null/README.md

# Null PyWorker

A PyWorker that does **nothing** — it does not forward requests to any model
server. Reservations are modelled as framework **sessions**: a request
comes in and you get a worker; release and it scales back down.

## When to use it

Use this worker when you want to drive Vast Serverless autoscaling but you do
**not** want inbound requests to reach a model on the instance. Typical setup:

- You already have a job queue on your own infrastructure (Redis, SQS, NATS,
  etc.).
- A separate worker process on the Vast instance pulls work from that queue
  directly. The Vast PyWorker is not involved in the request/response path.
  Your consumer can be any language — node, golang, python, a binary —
  this PyWorker is implementation-agnostic.
- You want one Vast worker per active queue consumer, and you want the
  Serverless autoscaler to spin instances up and down based on demand on
  *your* side.

## How it works

- Reservations use the framework's **session** model. The SDK exposes
  `endpoint.session(cost, lifetime)` which POSTs to `/session/create` (a
  built-in framework route) and returns a `Session` object usable as
  `async with`. Closing the context (or calling `await session.close()`)
  POSTs to `/session/end` — counted as a normal success in metrics.
- `max_sessions=1` on the worker side means a second `/session/create`
  against an already-occupied worker returns `429`. Serverless routes
  that request to a free worker or scales a new one up.
- Sessions are **excluded from queue-wait math** (the framework filters
  `if not request.is_session`), so an occupied worker doesn't look like
  it has a request queue piling up. The autoscaler treats a session as
  occupancy, not as work-in-progress.
- `lifecycle` is used instead of `model_log_file`, so there is no log to
  tail and no model server to start. The worker reports itself ready
  immediately after a trivial benchmark.

## Healthchecking

The framework periodically GETs a healthcheck URL after startup; if it ever
fails after the first success, the worker is marked errored and the
autoscaler can decommission it. Two modes:

- **Stub (default)** — the internal control server also answers
  `GET /health` with `200`. Just enough to satisfy the framework while
  you wire up real consumers.
- **Point at your queue consumer (recommended)** — set
  `BACKEND_HEALTH_URL=http://127.0.0.1:9090/health` (absolute URL) and
  the pyworker will healthcheck *your* consumer instead. If the consumer
  process crashes, the autoscaler will see the worker as broken.

## API

### Reservation: `POST /session/create`  (external, signed)

Not implemented here — the framework provides this route automatically on
every PyWorker. Use the SDK:

```python
from vastai import Serverless

async with Serverless() as client:
    endpoint = await client.get_endpoint(name="my-null-endpoint")
    async with endpoint.session(cost=100, lifetime=600) as s:
        # Worker is now reserved. Your queue dispatcher does whatever it
        # needs to do (typically: enqueue a job that mentions s.session_id).
        ...
    # `async with` exit posts to /session/end → 200 success in metrics
```

Or raw HTTP (the SDK takes care of autoscaler signing for you, but the
shape of the request is documented for non-Python clients):

```
POST /session/create
{
  "auth_data": { /* signed by autoscaler */ },
  "payload": {
    "lifetime": 600,
    "on_close_route": "https://your.callback/notify",
    "on_close_payload": {"job_id": "..."}
  }
}
```

### Release from a local consumer: `POST /release`  (internal, localhost-only)

Closes the active session, regardless of who created it. No body, no
auth. Use this when the queue consumer doesn't have (and shouldn't need)
the session's `session_auth`:

```bash
curl -X POST http://127.0.0.1:18999/release
```

Responses:

- `200 {"released": true, "session_ids": ["..."]}` — closed; the held
  client-side `/session/create` completes and counts as a success.
- `200 {"released": false, "reason": "no active session"}` — nothing
  active, no-op.

For setups where the dispatcher can hand the consumer `session_auth`
(e.g. as part of the queue payload), the consumer can instead POST
`/session/end` on the framework's HTTP-only port
(`$WORKER_HTTP_PORT`, default `WORKER_PORT+1`) — the standard, fully
authenticated release path.

## Environment variables

- `BACKEND_HEALTH_URL` — absolute URL the framework should healthcheck
  (e.g. `http://127.0.0.1:9090/health`). When set, the stub `/health`
  route is not registered on the internal server.
- `NULL_CONTROL_PORT` — port for the internal control server (hosts
  `/release` and optionally `/health`). Defaults to `18999`.

## Deploying on Vast Serverless

1. Create a Serverless endpoint and point `PYWORKER_REPO` at this
   repository (or your fork).
2. Set `BACKEND=null` in the template so `start_server.sh` runs
   `workers.null.worker`.
3. There is no model server to configure; you can omit model-related env
   vars entirely.
4. Run your own queue-consumer process on the instance alongside the
   PyWorker. When it finishes its work:
   ```bash
   curl -X POST http://127.0.0.1:18999/release
   ```

### Endpoint scaling parameters

The null worker reports `max_perf = 100` and each reservation is a
session of `cost = 100`. Set the endpoint accordingly:

- **`target_util = 1.0`** — required. The default of `0.9` reserves
  ~11% spare capacity, which for a unit-occupancy worker rounds up to a
  whole extra worker (e.g. `min_load = 100` becomes `100 / 0.9 = 111.1`
  → 2 active workers instead of 1). With `target_util = 1.0` the math
  is clean: `min_load = 100 * N` keeps exactly `N` workers active.
- **`min_load`** — set to `100 * N` for `N` always-on workers (with
  `target_util = 1.0`).
- **`max_workers`** — cap on total reservations the endpoint can ever
  serve concurrently.
- **Session `cost = 2 × max_perf`** (e.g. `200` when `max_perf = 100`) —
  recommended. Reporting `cost = max_perf` puts each occupied worker at
  exactly 100% utilization, which the autoscaler reads as "at target,
  no action needed." The third reservation then gets 429'd by both
  occupied workers and stalls in the autoscaler's global queue
  indefinitely instead of activating a cold worker.

  Bumping `cost` above `max_perf` makes each session look like more than
  one worker of work (`cur_load / max_perf > 1.0`), so the autoscaler
  keeps an extra active worker hot per session. Slight over-provisioning
  in exchange for predictable scale-up. The demo client defaults to
  `--session-cost 200`.
- **`max_queue_time = 0`** (or very small, e.g. `0.1`) — required.
  The per-worker `wait_time` property used internally to reject
  requests filters sessions out, but the **autoscaler** computes its
  own queue-time estimate from `cur_load / max_perf` — and `cur_load`
  *does* include sessions. With defaults around 30s, an occupied null
  worker (`cur_load = 100`, `max_perf = 100`, queue estimate = 1s)
  looks "available" and the autoscaler keeps routing extra reservations
  there, getting 429s and queueing them instead of scaling up. Setting
  `max_queue_time = 0` makes any in-flight load mark the worker "full"
  for routing.
- **`target_queue_time = 0`** — required. Aggressive scale-up trigger;
  with `max_queue_time = 0` to keep occupied workers off the routing
  table, this ensures the autoscaler provisions a new worker the
  moment all existing ones are occupied rather than queueing on its
  side. The queue-time math conceptually assumes work *completes in
  proportion to load*, which doesn't hold for sessions (they last
  hours, not `cur_load / max_perf` seconds). Zeroing both knobs tells
  the autoscaler "don't estimate when this worker will free up; route
  to a free one or make a new one."
- **`inactivity_timeout`** — works as expected: idle (no active
  sessions) for N seconds → permitted to scale down past `min_load`.

## Client example

Single reservation (holds for 180s):

```bash
python -m workers.null.client --endpoint <ENDPOINT_NAME>
```

Staggered demo:

```bash
python -m workers.null.client --endpoint <ENDPOINT_NAME> --demo
```

Starts three sessions 30s apart (all held concurrently), holds the
3-worker plateau for 5 minutes so the autoscaler has time to actually
provision the third worker before any scale-down starts, then closes
the sessions one at a time, also 30s apart, and exits. Every session
ends cleanly via the SDK's `session.close()` — `200` successes in
metrics, no cancellations.

Tune the timing with `--interval` and `--plateau`. To exercise the
local-release path, shell into a worker and run
`curl -X POST http://127.0.0.1:18999/release`.

## Notes and caveats

- The reservation's lifetime caps how long the session can live without
  client activity. Set it comfortably longer than the work you expect to
  do, or have the client periodically POST `/ping` with `session_id` to
  extend.
- The `on_close_route` payload (passed at `/session/create`) is POSTed by
  the framework when the session ends. Useful for notifying your queue
  consumer that the reservation is closing.
- `/release` on the internal port is convenient but bypasses
  `session_auth`. If you need the standard authenticated release flow,
  pass `session_auth` to your consumer (e.g. through the queue payload)
  and have it POST to `/session/end` on the framework's HTTP port
  instead.