1movmedia/pyworker

Fork 0

Files

T

History

Colter Downing 6b5b1341a7 update tgi client

2025-12-03 18:38:42 -08:00

__init__.py

initial commit

2024-09-04 11:19:30 -07:00

client.py

update tgi client

2025-12-03 18:38:42 -08:00

data_types.py

update tokenizers

2025-06-10 17:07:38 -07:00

README.md

update tgi client

2025-12-03 18:38:42 -08:00

server.py

Endpoint update pr one (#1 )

2025-06-02 18:43:27 -07:00

test_load.py

Merge pull request #1 from Nader-gator/main

2024-09-12 11:27:48 -07:00

README.md

HuggingFace TGI PyWorker

This is the base PyWorker for HuggingFace Text Generation Inference (TGI) servers. See the Serverless documentation for guides and how-to's.

Instance Setup

Pick a template

This worker is compatible with any TGI backend. We have a template you can use or you can create your own.

HuggingFace TGI

The template can be configured via the template interface. You may want to change the model or startup arguments.

Follow the getting started guide for help with configuring your serverless setup. For testing, we recommend that you use the default options presented by the web interface.

Client Setup (Demo)

Clone the PyWorker repository to your local machine and install the necessary requirements for running the test client.

git clone https://github.com/vast-ai/pyworker
cd pyworker
pip install uv
uv venv -p 3.12
source .venv/bin/activate
uv pip install -r requirements.txt

Using the Test Client

The test client demonstrates both streaming and non-streaming generation using TGI's native API.

First, set your API key as an environment variable:

export VAST_API_KEY=<your_api_key>

The --endpoint flag is optional. If not provided, it defaults to my-tgi-endpoint.

Generate (Streaming)

Call to /generate_stream with streaming response:

python -m workers.tgi.client --generate-stream --endpoint <ENDPOINT_NAME>

Generate (Non-Streaming)

Call to /generate with json response:

python -m workers.tgi.client --generate --endpoint <ENDPOINT_NAME>

Interactive Session (Streaming)

Interactive session with streaming responses. Type quit to exit.

python -m workers.tgi.client --interactive --endpoint <ENDPOINT_NAME>

API Endpoints

TGI provides two primary endpoints:

Generate (Non-Streaming)

/generate - Returns the complete response in a single request.

{
  "inputs": "Your prompt here",
  "parameters": {
    "max_new_tokens": 1024,
    "temperature": 0.7,
    "return_full_text": false
  }
}

Generate Stream (Streaming)

/generate_stream - Streams the response token by token.

{
  "inputs": "Your prompt here",
  "parameters": {
    "max_new_tokens": 1024,
    "temperature": 0.7,
    "do_sample": true,
    "return_full_text": false
  }
}

Performance Notes

The max_new_tokens parameter (not the prompt size) primarily impacts performance. For example, if an instance is benchmarked to process 100 tokens per second, a request with max_new_tokens = 200 will take approximately 2 seconds to complete.