Compare commits
4 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 6b5b1341a7 | |||
| 8be92c03de | |||
| 2f543c01ad | |||
| 0bcd2219ea |
@@ -8,14 +8,13 @@ This is the base PyWorker for OpenAI compatible inference servers. See the [Ser
|
|||||||
|
|
||||||
This worker is compatible with any backend API that properly implements the `/v1/completions` and `/v1/chat/completions` endpoints. We currently have three templates you can choose from but you can also create your own without having to modify the PyWorker.
|
This worker is compatible with any backend API that properly implements the `/v1/completions` and `/v1/chat/completions` endpoints. We currently have three templates you can choose from but you can also create your own without having to modify the PyWorker.
|
||||||
|
|
||||||
- [vLLM](https://cloud.vast.ai/?ref_id=62897&creator_id=62897&name=vLLM%20%2B%20Qwen%2FQwen3-8B%20(Serverless)) (recommended)
|
- [vLLM](https://cloud.vast.ai/?ref_id=62897&creator_id=62897&name=vLLM%20(Serverless)) (recommended)
|
||||||
- [Ollama](https://cloud.vast.ai/?ref_id=62897&creator_id=62897&name=Ollama%20%2B%20Qwen3%3A32b%20(Serverless))
|
- [Ollama](https://cloud.vast.ai/?ref_id=62897&creator_id=62897&name=Ollama%20%2B%20Qwen3%3A32b%20(Serverless))
|
||||||
- [HuggingFace TGI](https://cloud.vast.ai/?ref_id=62897&creator_id=62897&name=TGI%20%2B%20Qwen3-8B%20(Serverless))
|
|
||||||
|
|
||||||
|
|
||||||
All of these templates can be configured via the template interface. You may want to change the model or startup arguments, depending on the template you selected.
|
All of these templates can be configured via the template interface. You may want to change the model or startup arguments, depending on the template you selected.
|
||||||
|
|
||||||
2. Follow the [getting started guide](https://docs.vast.ai/serverless/getting-started) for help with configuring your serverless setup. For testing, we recommend that you use the default options presented by the web interface.
|
2. Follow the [getting started guide](https://docs.vast.ai/documentation/serverless/quickstart) for help with configuring your serverless setup. For testing, we recommend that you use the default options presented by the web interface.
|
||||||
|
|
||||||
## Client Setup (Demo)
|
## Client Setup (Demo)
|
||||||
|
|
||||||
|
|||||||
@@ -35,6 +35,7 @@ backend = Backend(
|
|||||||
model_server_url=os.environ["MODEL_SERVER_URL"],
|
model_server_url=os.environ["MODEL_SERVER_URL"],
|
||||||
model_log_file=os.environ["MODEL_LOG"],
|
model_log_file=os.environ["MODEL_LOG"],
|
||||||
allow_parallel_requests=True,
|
allow_parallel_requests=True,
|
||||||
|
max_wait_time=600.0,
|
||||||
benchmark_handler=CompletionsHandler(benchmark_runs=3, benchmark_words=256),
|
benchmark_handler=CompletionsHandler(benchmark_runs=3, benchmark_words=256),
|
||||||
log_actions=[
|
log_actions=[
|
||||||
*[(LogAction.ModelLoaded, info_msg) for info_msg in MODEL_SERVER_START_LOG_MSG],
|
*[(LogAction.ModelLoaded, info_msg) for info_msg in MODEL_SERVER_START_LOG_MSG],
|
||||||
|
|||||||
+93
-9
@@ -1,19 +1,103 @@
|
|||||||
This is the base PyWorker for TGI, designed to create PyWorkers that can utilize various LLMs. It offers two primary endpoints:
|
# HuggingFace TGI PyWorker
|
||||||
|
|
||||||
1. `generate`: Generates the LLM's response to a given prompt in a single request.
|
This is the base PyWorker for HuggingFace Text Generation Inference (TGI) servers. See the [Serverless documentation](https://docs.vast.ai/serverless) for guides and how-to's.
|
||||||
2. `generate_stream`: Streams the LLM's response token by token.
|
|
||||||
|
|
||||||
Both endpoints use the following API payload format:
|
## Instance Setup
|
||||||
|
|
||||||
|
1. Pick a template
|
||||||
|
|
||||||
|
This worker is compatible with any TGI backend. We have a template you can use or you can create your own.
|
||||||
|
|
||||||
|
- [HuggingFace TGI](https://cloud.vast.ai/?ref_id=62897&creator_id=62897&name=TGI%20(Serverless))
|
||||||
|
|
||||||
|
The template can be configured via the template interface. You may want to change the model or startup arguments.
|
||||||
|
|
||||||
|
2. Follow the [getting started guide](https://docs.vast.ai/documentation/serverless/quickstart) for help with configuring your serverless setup. For testing, we recommend that you use the default options presented by the web interface.
|
||||||
|
|
||||||
|
## Client Setup (Demo)
|
||||||
|
|
||||||
|
1. Clone the PyWorker repository to your local machine and install the necessary requirements for running the test client.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone https://github.com/vast-ai/pyworker
|
||||||
|
cd pyworker
|
||||||
|
pip install uv
|
||||||
|
uv venv -p 3.12
|
||||||
|
source .venv/bin/activate
|
||||||
|
uv pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
## Using the Test Client
|
||||||
|
|
||||||
|
The test client demonstrates both streaming and non-streaming generation using TGI's native API.
|
||||||
|
|
||||||
|
First, set your API key as an environment variable:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export VAST_API_KEY=<your_api_key>
|
||||||
|
```
|
||||||
|
|
||||||
|
The `--endpoint` flag is optional. If not provided, it defaults to `my-tgi-endpoint`.
|
||||||
|
|
||||||
|
### Generate (Streaming)
|
||||||
|
|
||||||
|
Call to `/generate_stream` with streaming response:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m workers.tgi.client --generate-stream --endpoint <ENDPOINT_NAME>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Generate (Non-Streaming)
|
||||||
|
|
||||||
|
Call to `/generate` with json response:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m workers.tgi.client --generate --endpoint <ENDPOINT_NAME>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Interactive Session (Streaming)
|
||||||
|
|
||||||
|
Interactive session with streaming responses. Type `quit` to exit.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m workers.tgi.client --interactive --endpoint <ENDPOINT_NAME>
|
||||||
|
```
|
||||||
|
|
||||||
|
## API Endpoints
|
||||||
|
|
||||||
|
TGI provides two primary endpoints:
|
||||||
|
|
||||||
|
### Generate (Non-Streaming)
|
||||||
|
|
||||||
|
`/generate` - Returns the complete response in a single request.
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"inputs": "PROMPT",
|
"inputs": "Your prompt here",
|
||||||
"parameters": {
|
"parameters": {
|
||||||
"max_new_tokens": 250
|
"max_new_tokens": 1024,
|
||||||
|
"temperature": 0.7,
|
||||||
|
"return_full_text": false
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
Note that the max_new_tokens parameter, rather than the prompt size, impacts performance. For example, if an
|
### Generate Stream (Streaming)
|
||||||
instance is benchmarked to process 100 tokens per second, a request with max_new_tokens = 200 will take
|
|
||||||
approximately 2 seconds to complete.
|
`/generate_stream` - Streams the response token by token.
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"inputs": "Your prompt here",
|
||||||
|
"parameters": {
|
||||||
|
"max_new_tokens": 1024,
|
||||||
|
"temperature": 0.7,
|
||||||
|
"do_sample": true,
|
||||||
|
"return_full_text": false
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Performance Notes
|
||||||
|
|
||||||
|
The `max_new_tokens` parameter (not the prompt size) primarily impacts performance. For example, if an instance is benchmarked to process 100 tokens per second, a request with `max_new_tokens = 200` will take approximately 2 seconds to complete.
|
||||||
|
|||||||
+186
-25
@@ -1,61 +1,222 @@
|
|||||||
|
import logging
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
import argparse
|
||||||
|
|
||||||
from vastai import Serverless
|
from vastai import Serverless
|
||||||
import asyncio
|
import asyncio
|
||||||
|
|
||||||
ENDPOINT_NAME = "my-tgi-endpoint" # Change this to match your endpoint name
|
# ---------------------- Logging ----------------------
|
||||||
|
logging.basicConfig(
|
||||||
|
level=logging.DEBUG,
|
||||||
|
format="%(asctime)s[%(levelname)-5s] %(message)s",
|
||||||
|
datefmt="%Y-%m-%d %H:%M:%S",
|
||||||
|
)
|
||||||
|
log = logging.getLogger(__file__)
|
||||||
|
|
||||||
|
# ---------------------- Defaults ----------------------
|
||||||
|
DEFAULT_PROMPT = "Think step by step: Tell me about the Python programming language."
|
||||||
|
|
||||||
|
ENDPOINT_NAME = "TGI-Prod2" # change this to your TGI endpoint name
|
||||||
MAX_TOKENS = 1024
|
MAX_TOKENS = 1024
|
||||||
PROMPT = "Think step by step: Tell me about the Python programming language."
|
DEFAULT_TEMPERATURE = 0.7
|
||||||
|
|
||||||
async def call_generate(client: Serverless) -> None:
|
|
||||||
endpoint = await client.get_endpoint(name=ENDPOINT_NAME)
|
# ---------------------- API Calls ----------------------
|
||||||
|
async def call_generate(client: Serverless, *, endpoint_name: str, prompt: str, **kwargs) -> dict:
|
||||||
|
"""Non-streaming generation via /generate endpoint"""
|
||||||
|
endpoint = await client.get_endpoint(name=endpoint_name)
|
||||||
|
|
||||||
payload = {
|
payload = {
|
||||||
"inputs": PROMPT,
|
"inputs": prompt,
|
||||||
"parameters": {
|
"parameters": {
|
||||||
"max_new_tokens": MAX_TOKENS,
|
"max_new_tokens": kwargs.get("max_tokens", MAX_TOKENS),
|
||||||
"temperature": 0.7,
|
"temperature": kwargs.get("temperature", DEFAULT_TEMPERATURE),
|
||||||
"return_full_text": False
|
"return_full_text": False,
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
log.debug("POST /generate %s", json.dumps(payload)[:500])
|
||||||
resp = await endpoint.request("/generate", payload, cost=MAX_TOKENS)
|
resp = await endpoint.request("/generate", payload, cost=payload["parameters"]["max_new_tokens"])
|
||||||
|
return resp["response"]
|
||||||
print(resp["response"]["generated_text"])
|
|
||||||
|
|
||||||
|
|
||||||
async def call_generate_stream(client: Serverless) -> None:
|
async def call_generate_stream(client: Serverless, *, endpoint_name: str, prompt: str, **kwargs):
|
||||||
endpoint = await client.get_endpoint(name=ENDPOINT_NAME)
|
"""Streaming generation via /generate_stream endpoint"""
|
||||||
|
endpoint = await client.get_endpoint(name=endpoint_name)
|
||||||
|
|
||||||
payload = {
|
payload = {
|
||||||
"inputs": PROMPT,
|
"inputs": prompt,
|
||||||
"parameters": {
|
"parameters": {
|
||||||
"max_new_tokens": MAX_TOKENS,
|
"max_new_tokens": kwargs.get("max_tokens", MAX_TOKENS),
|
||||||
"temperature": 0.7,
|
"temperature": kwargs.get("temperature", DEFAULT_TEMPERATURE),
|
||||||
"do_sample": True,
|
"do_sample": True,
|
||||||
"return_full_text": False,
|
"return_full_text": False,
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
log.debug("STREAM /generate_stream %s", json.dumps(payload)[:500])
|
||||||
resp = await endpoint.request(
|
resp = await endpoint.request(
|
||||||
"/generate_stream",
|
"/generate_stream",
|
||||||
payload,
|
payload,
|
||||||
cost=MAX_TOKENS,
|
cost=payload["parameters"]["max_new_tokens"],
|
||||||
stream=True,
|
stream=True,
|
||||||
)
|
)
|
||||||
stream = resp["response"]
|
return resp["response"] # async generator
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------- Demo Runner ----------------------
|
||||||
|
class APIDemo:
|
||||||
|
"""Demo and testing functionality for the TGI API client"""
|
||||||
|
|
||||||
|
def __init__(self, client: Serverless, endpoint_name: str):
|
||||||
|
self.client = client
|
||||||
|
self.endpoint_name = endpoint_name
|
||||||
|
|
||||||
|
async def handle_streaming_response(self, stream) -> str:
|
||||||
|
"""Process streaming response and print tokens"""
|
||||||
|
full_response = ""
|
||||||
printed_answer = False
|
printed_answer = False
|
||||||
|
|
||||||
async for event in stream:
|
async for event in stream:
|
||||||
tok = (event.get("token") or {}).get("text")
|
tok = (event.get("token") or {}).get("text")
|
||||||
if tok:
|
if tok:
|
||||||
if not printed_answer:
|
if not printed_answer:
|
||||||
printed_answer = True
|
printed_answer = True
|
||||||
print("Answer:\n", end="", flush=True)
|
print("\n💬 Response: ", end="", flush=True)
|
||||||
print(tok, end="", flush=True)
|
print(tok, end="", flush=True)
|
||||||
|
full_response += tok
|
||||||
|
|
||||||
async def main():
|
print() # newline
|
||||||
|
if printed_answer:
|
||||||
|
print(f"\nStreaming completed. Response tokens: {len(full_response.split())}")
|
||||||
|
|
||||||
|
return full_response
|
||||||
|
|
||||||
|
async def demo_generate(self) -> None:
|
||||||
|
"""Demo non-streaming generation"""
|
||||||
|
print("=" * 60)
|
||||||
|
print("GENERATE DEMO (NON-STREAMING)")
|
||||||
|
print("=" * 60)
|
||||||
|
|
||||||
|
response = await call_generate(
|
||||||
|
client=self.client,
|
||||||
|
endpoint_name=self.endpoint_name,
|
||||||
|
prompt=DEFAULT_PROMPT,
|
||||||
|
max_tokens=MAX_TOKENS,
|
||||||
|
temperature=DEFAULT_TEMPERATURE,
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"\n💬 Response: {response.get('generated_text', '')}")
|
||||||
|
print(f"\nFull Response:\n{json.dumps(response, indent=2)}")
|
||||||
|
|
||||||
|
async def demo_generate_stream(self) -> None:
|
||||||
|
"""Demo streaming generation"""
|
||||||
|
print("=" * 60)
|
||||||
|
print("GENERATE DEMO (STREAMING)")
|
||||||
|
print("=" * 60)
|
||||||
|
|
||||||
|
stream = await call_generate_stream(
|
||||||
|
client=self.client,
|
||||||
|
endpoint_name=self.endpoint_name,
|
||||||
|
prompt=DEFAULT_PROMPT,
|
||||||
|
max_tokens=MAX_TOKENS,
|
||||||
|
temperature=DEFAULT_TEMPERATURE,
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
await self.handle_streaming_response(stream)
|
||||||
|
except Exception as e:
|
||||||
|
log.error("\nError during streaming: %s", e, exc_info=True)
|
||||||
|
|
||||||
|
async def interactive_chat(self) -> None:
|
||||||
|
"""Interactive session with streaming generation"""
|
||||||
|
print("=" * 60)
|
||||||
|
print("INTERACTIVE STREAMING SESSION")
|
||||||
|
print("=" * 60)
|
||||||
|
print(f"Using endpoint: {self.endpoint_name}")
|
||||||
|
print("Type 'quit' to exit")
|
||||||
|
print()
|
||||||
|
|
||||||
|
while True:
|
||||||
|
try:
|
||||||
|
user_input = input("You: ").strip()
|
||||||
|
|
||||||
|
if user_input.lower() == "quit":
|
||||||
|
print("👋 Goodbye!")
|
||||||
|
break
|
||||||
|
elif not user_input:
|
||||||
|
continue
|
||||||
|
|
||||||
|
print("Assistant: ", end="", flush=True)
|
||||||
|
stream = await call_generate_stream(
|
||||||
|
client=self.client,
|
||||||
|
endpoint_name=self.endpoint_name,
|
||||||
|
prompt=user_input,
|
||||||
|
max_tokens=MAX_TOKENS,
|
||||||
|
temperature=DEFAULT_TEMPERATURE,
|
||||||
|
)
|
||||||
|
|
||||||
|
full_response = ""
|
||||||
|
async for event in stream:
|
||||||
|
tok = (event.get("token") or {}).get("text")
|
||||||
|
if tok:
|
||||||
|
print(tok, end="", flush=True)
|
||||||
|
full_response += tok
|
||||||
|
print() # newline
|
||||||
|
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
print("\n👋 Session interrupted. Goodbye!")
|
||||||
|
break
|
||||||
|
except Exception as e:
|
||||||
|
log.error("\nError: %s", e)
|
||||||
|
continue
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------- CLI ----------------------
|
||||||
|
def build_arg_parser() -> argparse.ArgumentParser:
|
||||||
|
p = argparse.ArgumentParser(description="Vast TGI Demo (Serverless SDK)")
|
||||||
|
p.add_argument("--endpoint", default=ENDPOINT_NAME, help=f"Vast endpoint name (default: {ENDPOINT_NAME})")
|
||||||
|
|
||||||
|
modes = p.add_mutually_exclusive_group(required=False)
|
||||||
|
modes.add_argument("--generate", action="store_true", help="Test generate endpoint (non-streaming)")
|
||||||
|
modes.add_argument("--generate-stream", action="store_true", help="Test generate endpoint with streaming")
|
||||||
|
modes.add_argument("--interactive", action="store_true", help="Start interactive streaming session")
|
||||||
|
return p
|
||||||
|
|
||||||
|
|
||||||
|
async def main_async():
|
||||||
|
args = build_arg_parser().parse_args()
|
||||||
|
|
||||||
|
selected = sum([args.generate, args.generate_stream, args.interactive])
|
||||||
|
if selected == 0:
|
||||||
|
print("Please specify exactly one test mode:")
|
||||||
|
print(" --generate : Test generate endpoint (non-streaming)")
|
||||||
|
print(" --generate-stream : Test generate endpoint with streaming")
|
||||||
|
print(" --interactive : Start interactive streaming session")
|
||||||
|
print(f"\nExample: python {os.path.basename(sys.argv[0])} --generate-stream --endpoint my-tgi-endpoint")
|
||||||
|
sys.exit(1)
|
||||||
|
elif selected > 1:
|
||||||
|
print("Please specify exactly one test mode")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
print("=" * 60)
|
||||||
|
print(f"Using endpoint: {args.endpoint}")
|
||||||
|
|
||||||
|
try:
|
||||||
async with Serverless() as client:
|
async with Serverless() as client:
|
||||||
await call_generate(client)
|
demo = APIDemo(client, args.endpoint)
|
||||||
await call_generate_stream(client)
|
|
||||||
|
if args.generate:
|
||||||
|
await demo.demo_generate()
|
||||||
|
elif args.generate_stream:
|
||||||
|
await demo.demo_generate_stream()
|
||||||
|
elif args.interactive:
|
||||||
|
await demo.interactive_chat()
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
log.error("Error during test: %s", e, exc_info=True)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
asyncio.run(main())
|
asyncio.run(main_async())
|
||||||
|
|||||||
Reference in New Issue
Block a user