669 B
669 B
This is the base PyWorker for TGI, designed to create PyWorkers that can utilize various LLMs. It offers two primary endpoints:
generate: Generates the LLM's response to a given prompt in a single request.generate_stream: Streams the LLM's response token by token.
Both endpoints use the following API payload format:
{
"inputs": "PROMPT",
"parameters": {
"max_new_tokens": 250
}
}
Note that the max_new_tokens parameter, rather than the prompt size, impacts performance. For example, if an instance is benchmarked to process 100 tokens per second, a request with max_new_tokens = 200 will take approximately 2 seconds to complete.