# HuggingFace TGI PyWorker This is the base PyWorker for HuggingFace Text Generation Inference (TGI) servers. See the [Serverless documentation](https://docs.vast.ai/serverless) for guides and how-to's. ## Instance Setup 1. Pick a template This worker is compatible with any TGI backend. We have a template you can use or you can create your own. - [HuggingFace TGI](https://cloud.vast.ai/?ref_id=62897&creator_id=62897&name=TGI%20(Serverless)) The template can be configured via the template interface. You may want to change the model or startup arguments. 2. Follow the [getting started guide](https://docs.vast.ai/documentation/serverless/quickstart) for help with configuring your serverless setup. For testing, we recommend that you use the default options presented by the web interface. ## Client Setup (Demo) 1. Clone the PyWorker repository to your local machine and install the necessary requirements for running the test client. ```bash git clone https://github.com/vast-ai/pyworker cd pyworker pip install uv uv venv -p 3.12 source .venv/bin/activate uv pip install -r requirements.txt ``` ## Using the Test Client The test client demonstrates both streaming and non-streaming generation using TGI's native API. First, set your API key as an environment variable: ```bash export VAST_API_KEY= ``` The `--endpoint` flag is optional. If not provided, it defaults to `my-tgi-endpoint`. ### Generate (Streaming) Call to `/generate_stream` with streaming response: ```bash python -m workers.tgi.client --generate-stream --endpoint ``` ### Generate (Non-Streaming) Call to `/generate` with json response: ```bash python -m workers.tgi.client --generate --endpoint ``` ### Interactive Session (Streaming) Interactive session with streaming responses. Type `quit` to exit. ```bash python -m workers.tgi.client --interactive --endpoint ``` ## API Endpoints TGI provides two primary endpoints: ### Generate (Non-Streaming) `/generate` - Returns the complete response in a single request. ```json { "inputs": "Your prompt here", "parameters": { "max_new_tokens": 1024, "temperature": 0.7, "return_full_text": false } } ``` ### Generate Stream (Streaming) `/generate_stream` - Streams the response token by token. ```json { "inputs": "Your prompt here", "parameters": { "max_new_tokens": 1024, "temperature": 0.7, "do_sample": true, "return_full_text": false } } ``` ## Performance Notes The `max_new_tokens` parameter (not the prompt size) primarily impacts performance. For example, if an instance is benchmarked to process 100 tokens per second, a request with `max_new_tokens = 200` will take approximately 2 seconds to complete.