Documentation Index
Fetch the complete documentation index at: https://na-36-handover-docs-v2-into-docs-v2-dev-20260518.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
The Livepeer AI gateway exposes nine batch pipelines and one LLM pipeline through HTTP POST endpoints. Each pipeline accepts a JSON request body keyed by
model_id and pipeline-specific fields, and returns a JSON response with the result. Real-time video AI (live-video-to-video) runs through the trickle protocol and is covered separately in the real-time AI overview.
For warm models, VRAM requirements, and architecture support per pipeline, see model support. For SDK wrappers, see AI SDKs.
Shared conventions
Base URL: Any Livepeer gateway endpoint. The community gateway athttps://dream-gateway.livepeer.cloud accepts unauthenticated requests for development.
Authentication: Bearer token when the gateway requires it. The community gateway does not require a token.
Request format: POST /<pipeline-endpoint> with Content-Type: application/json.
model_id field: Every pipeline accepts a model_id field specifying the Hugging Face model ID (or Ollama model ID for LLM). Omitting model_id uses the pipeline’s default warm model.
Error responses: 400 for malformed requests, 422 for validation errors (invalid model_id, missing required fields), 500 for inference failures. Error bodies include a detail field with the failure reason.
Cold model latency: If no orchestrator has the requested model warm in GPU memory, the first request triggers a model load (30 seconds to 5 minutes depending on model size). Subsequent requests to the same model on the same orchestrator are immediate.
Pipeline reference
text-to-image
text-to-image
Generate images from text prompts using diffusion models (SDXL, SD 1.5, Flux).
Response: JSON object with
| Field | Type | Required | Description |
|---|---|---|---|
model_id | string | No | Hugging Face model ID. Default: SG161222/RealVisXL_V4.0_Lightning |
prompt | string | Yes | Text prompt for generation |
negative_prompt | string | No | Terms to avoid in generation |
width | integer | No | Output width in pixels (default: 1024) |
height | integer | No | Output height in pixels (default: 1024) |
guidance_scale | number | No | Classifier-free guidance scale (default: 7.5) |
num_inference_steps | integer | No | Denoising steps (default depends on model; Lightning models use 4-8) |
seed | integer | No | Random seed for reproducibility |
num_images_per_prompt | integer | No | Number of images to generate (default: 1) |
safety_check | boolean | No | Run NSFW safety filter (default: true) |
images array. Each image is a { url, seed } object.image-to-image
image-to-image
Transform images using style transfer, enhancement, or img2img diffusion.
Response: JSON with
| Field | Type | Required | Description |
|---|---|---|---|
model_id | string | No | Default: timbrooks/instruct-pix2pix |
image | file | Yes | Input image (multipart form upload) |
prompt | string | Yes | Transformation instruction |
strength | number | No | How much to transform (0.0 = no change, 1.0 = full regeneration) |
guidance_scale | number | No | Guidance scale (default: 7.5) |
num_inference_steps | integer | No | Denoising steps |
seed | integer | No | Random seed |
safety_check | boolean | No | NSFW filter (default: true) |
images array, same format as text-to-image.image-to-image uses
multipart/form-data, not application/json. The image is uploaded as a file field.image-to-video
image-to-video
Animate a still image into a short video clip using Stable Video Diffusion.
Response: JSON with
| Field | Type | Required | Description |
|---|---|---|---|
model_id | string | No | Default: stabilityai/stable-video-diffusion-img2vid-xt |
image | file | Yes | Input image (multipart form upload) |
fps | integer | No | Output frames per second (default: 6) |
motion_bucket_id | integer | No | Motion intensity (0-255; default: 127) |
seed | integer | No | Random seed |
safety_check | boolean | No | NSFW filter (default: true) |
frames array containing frame URLs, or a video URL.SVD outputs 14-25 frames at 576x1024 resolution. Text prompts are not used; the image is the sole conditioning input.
image-to-text
image-to-text
Generate captions or descriptions for images using BLIP or vision-language models.
Response: JSON with
| Field | Type | Required | Description |
|---|---|---|---|
model_id | string | No | Default: Salesforce/blip-image-captioning-large |
image | file | Yes | Input image (multipart form upload) |
prompt | string | No | Optional prompt to guide caption content |
text field containing the generated caption.audio-to-text
audio-to-text
Transcribe audio to text with per-chunk timestamps using Whisper.
Response: JSON with
| Field | Type | Required | Description |
|---|---|---|---|
model_id | string | No | Default: openai/whisper-large-v3 |
audio | file | Yes | Audio file (mp4, webm, mp3, flac, wav, m4a). Max 50 MB. |
text (full transcript) and chunks array (per-segment timestamps and text).text-to-speech
text-to-speech
Generate natural speech from text using Parler-TTS.
Response: JSON with
| Field | Type | Required | Description |
|---|---|---|---|
model_id | string | No | Default: parler-tts/parler-tts-large-v1 |
text | string | Yes | Text to synthesise. Max ~600 characters; chunk longer text. |
description | string | No | Voice characteristics (speaker identity, style, audio quality) |
audio object containing a URL to the generated audio file.Requires a pipeline-specific AI Runner container. Not all orchestrators have this pipeline active.
upscale
upscale
Upscale low-resolution images using the SD x4-Upscaler (4x super-resolution).
Response: JSON with
| Field | Type | Required | Description |
|---|---|---|---|
model_id | string | No | Default: stabilityai/stable-diffusion-x4-upscaler |
image | file | Yes | Input image (multipart form upload) |
prompt | string | No | Optional quality guidance prompt |
seed | integer | No | Random seed |
safety_check | boolean | No | NSFW filter (default: true) |
images array, same format as text-to-image.segment-anything-2
segment-anything-2
Promptable visual segmentation for images using SAM 2 (Meta AI).
Response: JSON with
| Field | Type | Required | Description |
|---|---|---|---|
model_id | string | No | Default: facebook/sam2-hiera-large |
image | file | Yes | Input image |
point_coords | array | No | Point prompts as [[x,y], ...] |
point_labels | array | No | Labels for points (1 = foreground, 0 = background) |
box | array | No | Bounding box prompt [x1, y1, x2, y2] |
masks, scores, and logits arrays.llm
llm
OpenAI-compatible chat completions using Ollama-based runner.
Response: OpenAI-compatible chat completion object with
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Ollama-compatible model ID |
messages | array | Yes | OpenAI-format message array (role + content) |
max_tokens | integer | No | Maximum output tokens |
temperature | number | No | Sampling temperature (0.0-2.0) |
stream | boolean | No | Stream response tokens (SSE) |
choices[0].message.content.The LLM pipeline is in beta. The request format follows the OpenAI
/v1/chat/completions shape. Supported models include Meta-Llama-3.1-8B-Instruct (warm, 8 GB VRAM), Mistral-7B-Instruct-v0.3, Gemma-2-9b-it, and Qwen2.5-7B-Instruct.Operational notes
Multipart vs JSON. Pipelines that accept file uploads (image-to-image, image-to-video, image-to-text, audio-to-text, upscale, segment-anything-2) usemultipart/form-data. Pipelines that accept only text input (text-to-image, text-to-speech, LLM) use application/json.
Gateway selection. The community gateway routes to whichever orchestrator in the active set has the requested model warm. For production, operate a self-hosted gateway with -maxPricePerUnit to control costs, or use a gateway provider with an API key.
safety_check filter. Enabled by default on image-generating pipelines. Set to false to disable. The filter runs on the orchestrator side; disabling it does not affect content moderation policies that the gateway operator may enforce.
The AI quickstart walks through the first inference call end-to-end with error handling.