workers/tgi/README.md

# HuggingFace TGI PyWorker

This is the base PyWorker for HuggingFace Text Generation Inference (TGI) servers. See the [Serverless documentation](https://docs.vast.ai/serverless) for guides and how-to's.

## Instance Setup

1. Pick a template

This worker is compatible with any TGI backend. We have a template you can use or you can create your own.

- [HuggingFace TGI](https://cloud.vast.ai/?ref_id=62897&creator_id=62897&name=TGI%20(Serverless))

The template can be configured via the template interface. You may want to change the model or startup arguments.

2. Follow the [getting started guide](https://docs.vast.ai/documentation/serverless/quickstart) for help with configuring your serverless setup. For testing, we recommend that you use the default options presented by the web interface.

## Client Setup (Demo)

1. Clone the PyWorker repository to your local machine and install the necessary requirements for running the test client.

```bash
git clone https://github.com/vast-ai/pyworker
cd pyworker
pip install uv
uv venv -p 3.12
source .venv/bin/activate
uv pip install -r requirements.txt
```

## Using the Test Client

The test client demonstrates both streaming and non-streaming generation using TGI's native API.

First, set your API key as an environment variable:

```bash
export VAST_API_KEY=<your_api_key>
```

The `--endpoint` flag is optional. If not provided, it defaults to `my-tgi-endpoint`.

### Generate (Streaming)

Call to `/generate_stream` with streaming response:

```bash
python -m workers.tgi.client --generate-stream --endpoint <ENDPOINT_NAME>
```

### Generate (Non-Streaming)

Call to `/generate` with json response:

```bash
python -m workers.tgi.client --generate --endpoint <ENDPOINT_NAME>
```

### Interactive Session (Streaming)

Interactive session with streaming responses. Type `quit` to exit.

```bash
python -m workers.tgi.client --interactive --endpoint <ENDPOINT_NAME>
```

## API Endpoints

TGI provides two primary endpoints:

### Generate (Non-Streaming)

`/generate` - Returns the complete response in a single request.

```json
{
  "inputs": "Your prompt here",
  "parameters": {
    "max_new_tokens": 1024,
    "temperature": 0.7,
    "return_full_text": false
  }
}
```

### Generate Stream (Streaming)

`/generate_stream` - Streams the response token by token.

```json
{
  "inputs": "Your prompt here",
  "parameters": {
    "max_new_tokens": 1024,
    "temperature": 0.7,
    "do_sample": true,
    "return_full_text": false
  }
}
```

## Performance Notes

The `max_new_tokens` parameter (not the prompt size) primarily impacts performance. For example, if an instance is benchmarked to process 100 tokens per second, a request with `max_new_tokens = 200` will take approximately 2 seconds to complete.
update tgi client 2025-12-03 18:38:42 -08:00			`# HuggingFace TGI PyWorker`
initial commit 2024-09-04 11:19:30 -07:00
update tgi client 2025-12-03 18:38:42 -08:00			`This is the base PyWorker for HuggingFace Text Generation Inference (TGI) servers. See the [Serverless documentation](https://docs.vast.ai/serverless) for guides and how-to's.`
initial commit 2024-09-04 11:19:30 -07:00
update tgi client 2025-12-03 18:38:42 -08:00			`## Instance Setup`

			`1. Pick a template`

			`This worker is compatible with any TGI backend. We have a template you can use or you can create your own.`

			`- [HuggingFace TGI](https://cloud.vast.ai/?ref_id=62897&creator_id=62897&name=TGI%20(Serverless))`

			`The template can be configured via the template interface. You may want to change the model or startup arguments.`

			`2. Follow the [getting started guide](https://docs.vast.ai/documentation/serverless/quickstart) for help with configuring your serverless setup. For testing, we recommend that you use the default options presented by the web interface.`

			`## Client Setup (Demo)`

			`1. Clone the PyWorker repository to your local machine and install the necessary requirements for running the test client.`

			```bash
			`git clone https://github.com/vast-ai/pyworker`
			`cd pyworker`
			`pip install uv`
			`uv venv -p 3.12`
			`source .venv/bin/activate`
			`uv pip install -r requirements.txt`
			```

			`## Using the Test Client`

			`The test client demonstrates both streaming and non-streaming generation using TGI's native API.`

			`First, set your API key as an environment variable:`

			```bash
			`export VAST_API_KEY=<your_api_key>`
			```

			The `--endpoint` flag is optional. If not provided, it defaults to `my-tgi-endpoint`.

			`### Generate (Streaming)`

			Call to `/generate_stream` with streaming response:

			```bash
			`python -m workers.tgi.client --generate-stream --endpoint <ENDPOINT_NAME>`
			```

			`### Generate (Non-Streaming)`

			Call to `/generate` with json response:

			```bash
			`python -m workers.tgi.client --generate --endpoint <ENDPOINT_NAME>`
			```

			`### Interactive Session (Streaming)`

			Interactive session with streaming responses. Type `quit` to exit.

			```bash
			`python -m workers.tgi.client --interactive --endpoint <ENDPOINT_NAME>`
			```

			`## API Endpoints`

			`TGI provides two primary endpoints:`

			`### Generate (Non-Streaming)`

			`/generate` - Returns the complete response in a single request.
initial commit 2024-09-04 11:19:30 -07:00
			```json
			`{`
update tgi client 2025-12-03 18:38:42 -08:00			`"inputs": "Your prompt here",`
initial commit 2024-09-04 11:19:30 -07:00			`"parameters": {`
update tgi client 2025-12-03 18:38:42 -08:00			`"max_new_tokens": 1024,`
			`"temperature": 0.7,`
			`"return_full_text": false`
initial commit 2024-09-04 11:19:30 -07:00			`}`
			`}`
			```

update tgi client 2025-12-03 18:38:42 -08:00			`### Generate Stream (Streaming)`

			`/generate_stream` - Streams the response token by token.

			```json
			`{`
			`"inputs": "Your prompt here",`
			`"parameters": {`
			`"max_new_tokens": 1024,`
			`"temperature": 0.7,`
			`"do_sample": true,`
			`"return_full_text": false`
			`}`
			`}`
			```

			`## Performance Notes`

			The `max_new_tokens` parameter (not the prompt size) primarily impacts performance. For example, if an instance is benchmarked to process 100 tokens per second, a request with `max_new_tokens = 200` will take approximately 2 seconds to complete.