HuggingFace TGI PyWorker
This is the base PyWorker for HuggingFace Text Generation Inference (TGI) servers. See the Serverless documentation for guides and how-to's.
Instance Setup
- Pick a template
This worker is compatible with any TGI backend. We have a template you can use or you can create your own.
The template can be configured via the template interface. You may want to change the model or startup arguments.
- Follow the getting started guide for help with configuring your serverless setup. For testing, we recommend that you use the default options presented by the web interface.
Client Setup (Demo)
- Clone the PyWorker repository to your local machine and install the necessary requirements for running the test client.
git clone https://github.com/vast-ai/pyworker
cd pyworker
pip install uv
uv venv -p 3.12
source .venv/bin/activate
uv pip install -r requirements.txt
Using the Test Client
The test client demonstrates both streaming and non-streaming generation using TGI's native API.
First, set your API key as an environment variable:
export VAST_API_KEY=<your_api_key>
The --endpoint flag is optional. If not provided, it defaults to my-tgi-endpoint.
Generate (Streaming)
Call to /generate_stream with streaming response:
python -m workers.tgi.client --generate-stream --endpoint <ENDPOINT_NAME>
Generate (Non-Streaming)
Call to /generate with json response:
python -m workers.tgi.client --generate --endpoint <ENDPOINT_NAME>
Interactive Session (Streaming)
Interactive session with streaming responses. Type quit to exit.
python -m workers.tgi.client --interactive --endpoint <ENDPOINT_NAME>
API Endpoints
TGI provides two primary endpoints:
Generate (Non-Streaming)
/generate - Returns the complete response in a single request.
{
"inputs": "Your prompt here",
"parameters": {
"max_new_tokens": 1024,
"temperature": 0.7,
"return_full_text": false
}
}
Generate Stream (Streaming)
/generate_stream - Streams the response token by token.
{
"inputs": "Your prompt here",
"parameters": {
"max_new_tokens": 1024,
"temperature": 0.7,
"do_sample": true,
"return_full_text": false
}
}
Performance Notes
The max_new_tokens parameter (not the prompt size) primarily impacts performance. For example, if an instance is benchmarked to process 100 tokens per second, a request with max_new_tokens = 200 will take approximately 2 seconds to complete.