Follow the getting started guide for help with configuring your serverless setup. For testing, we recommend that you use the default options presented by the web interface.

Client Setup (Demo)

Clone the PyWorker repository to your local machine and install the necessary requirements for running the test client.

git clone https://github.com/vast-ai/pyworker
cd pyworker
pip install uv
uv venv -p 3.12
source .venv/bin/activate
uv pip install -r requirements.txt

Using the Test Client

The test client demonstrates both streaming and non-streaming generation using TGI's native API.

First, set your API key as an environment variable:

export VAST_API_KEY=<your_api_key>

The --endpoint flag is optional. If not provided, it defaults to my-tgi-endpoint.

Generate (Streaming)

Call to /generate_stream with streaming response:

python -m workers.tgi.client --generate-stream --endpoint <ENDPOINT_NAME>

Generate (Non-Streaming)

Call to /generate with json response:

python -m workers.tgi.client --generate --endpoint <ENDPOINT_NAME>

Interactive Session (Streaming)

Interactive session with streaming responses. Type quit to exit.

python -m workers.tgi.client --interactive --endpoint <ENDPOINT_NAME>

API Endpoints

TGI provides two primary endpoints:

Generate (Non-Streaming)

/generate - Returns the complete response in a single request.

{
  "inputs": "Your prompt here",
  "parameters": {
    "max_new_tokens": 1024,
    "temperature": 0.7,
    "return_full_text": false
  }
}

Generate Stream (Streaming)

/generate_stream - Streams the response token by token.

{
  "inputs": "Your prompt here",
  "parameters": {
    "max_new_tokens": 1024,
    "temperature": 0.7,
    "do_sample": true,
    "return_full_text": false
  }
}

Performance Notes

The max_new_tokens parameter (not the prompt size) primarily impacts performance. For example, if an instance is benchmarked to process 100 tokens per second, a request with max_new_tokens = 200 will take approximately 2 seconds to complete.