workers/tgi/README.md

This is the base PyWorker for TGI, designed to create PyWorkers that can utilize various LLMs. It offers two primary endpoints:

1. `generate`: Generates the LLM's response to a given prompt in a single request.
2. `generate_stream`: Streams the LLM's response token by token.

Both endpoints use the following API payload format:

```json
{
  "inputs": "PROMPT",
  "parameters": {
    "max_new_tokens": 250
  }
}
```

Note that the max_new_tokens parameter, rather than the prompt size, impacts performance. For example, if an
instance is benchmarked to process 100 tokens per second, a request with max_new_tokens = 200 will take
approximately 2 seconds to complete.
initial commit 2024-09-04 11:19:30 -07:00			`This is the base PyWorker for TGI, designed to create PyWorkers that can utilize various LLMs. It offers two primary endpoints:`

			1. `generate`: Generates the LLM's response to a given prompt in a single request.
			2. `generate_stream`: Streams the LLM's response token by token.

			`Both endpoints use the following API payload format:`

			```json
			`{`
			`"inputs": "PROMPT",`
			`"parameters": {`
			`"max_new_tokens": 250`
			`}`
			`}`
			```

			`Note that the max_new_tokens parameter, rather than the prompt size, impacts performance. For example, if an`
			`instance is benchmarked to process 100 tokens per second, a request with max_new_tokens = 200 will take`
			`approximately 2 seconds to complete.`