Files
pyworker/workers/tgi
2025-12-17 11:55:33 -08:00
..
2024-09-04 11:19:30 -07:00
2025-12-03 18:38:42 -08:00
2025-12-03 18:38:42 -08:00

HuggingFace TGI PyWorker

This is the base PyWorker for HuggingFace Text Generation Inference (TGI) servers. See the Serverless documentation for guides and how-to's.

Instance Setup

  1. Pick a template

This worker is compatible with any TGI backend. We have a template you can use or you can create your own.

The template can be configured via the template interface. You may want to change the model or startup arguments.

  1. Follow the getting started guide for help with configuring your serverless setup. For testing, we recommend that you use the default options presented by the web interface.

Client Setup (Demo)

  1. Clone the PyWorker repository to your local machine and install the necessary requirements for running the test client.
git clone https://github.com/vast-ai/pyworker
cd pyworker
pip install uv
uv venv -p 3.12
source .venv/bin/activate
uv pip install -r requirements.txt

Using the Test Client

The test client demonstrates both streaming and non-streaming generation using TGI's native API.

First, set your API key as an environment variable:

export VAST_API_KEY=<your_api_key>

The --endpoint flag is optional. If not provided, it defaults to my-tgi-endpoint.

Generate (Streaming)

Call to /generate_stream with streaming response:

python -m workers.tgi.client --generate-stream --endpoint <ENDPOINT_NAME>

Generate (Non-Streaming)

Call to /generate with json response:

python -m workers.tgi.client --generate --endpoint <ENDPOINT_NAME>

Interactive Session (Streaming)

Interactive session with streaming responses. Type quit to exit.

python -m workers.tgi.client --interactive --endpoint <ENDPOINT_NAME>

API Endpoints

TGI provides two primary endpoints:

Generate (Non-Streaming)

/generate - Returns the complete response in a single request.

{
  "inputs": "Your prompt here",
  "parameters": {
    "max_new_tokens": 1024,
    "temperature": 0.7,
    "return_full_text": false
  }
}

Generate Stream (Streaming)

/generate_stream - Streams the response token by token.

{
  "inputs": "Your prompt here",
  "parameters": {
    "max_new_tokens": 1024,
    "temperature": 0.7,
    "do_sample": true,
    "return_full_text": false
  }
}

Performance Notes

The max_new_tokens parameter (not the prompt size) primarily impacts performance. For example, if an instance is benchmarked to process 100 tokens per second, a request with max_new_tokens = 200 will take approximately 2 seconds to complete.