Documentation Index
Fetch the complete documentation index at: https://na-36-handover-docs-v2-into-docs-v2-dev-20260518.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
By the end of this tutorial you’ll have a Next.js 15 chatbot that takes user messages, streams responses from the Livepeer LLM pipeline token-by-token, and maintains conversation history. The LLM pipeline is OpenAI-compatible at the wire level: it accepts
messages arrays, returns choices[0].delta.content chunks, and behaves like any other chat completions endpoint. The orchestrator pool runs Ollama-backed inference on GPUs as small as 8 GB.
This is the Persona 1 activation moment for text inference. The image generation tutorial proved the batch path; this one proves the streaming path. The wire format you’ll handle here works against any OpenAI-compatible endpoint, which means swapping providers is a URL change.
Required Tools
- Node.js 20 or later
npm,pnpm, oryarn- A code editor
dream-gateway.livepeer.cloud accepts unauthenticated POSTs to the LLM endpoint for experimentation.
Project Bootstrap
Streaming Route Handler
Server actions can’t stream responses cleanly. Route handlers can; the standard pattern for chat is aPOST /api/chat handler that proxies the request to the LLM endpoint and pipes the SSE response back to the client.
Save as src/app/api/chat/route.ts:
export const runtime = 'edge' runs the handler on Edge runtime, which keeps cold-start low and streams responses without buffering. The stream: true flag in the request body asks the LLM endpoint for Server-Sent Events instead of a single JSON response. The handler pipes the response body directly through; no SSE parsing on the server side, no JSON deserialisation. The browser parses the stream.
SSE Wire Format
The LLM endpoint streams chunks in this shape:data: line is one token (or a small group of tokens) wrapped in OpenAI’s chat completions chunk shape. The final chunk has empty content and finish_reason: "stop". The client concatenates the content fields as they arrive and renders them incrementally.
Chat UI Component
The UI maintains a list of messages and appends to the last assistant message as tokens stream in. Save assrc/app/components/Chat.tsx:
buffer handles the case where a chunk lands mid-line. For each complete data: line, the handler parses the JSON, extracts the token from choices[0].delta.content, and appends it to the last assistant message. The loop exits when finish_reason: "stop" arrives.
Page Composition
Save assrc/app/page.tsx:
http://localhost:3000. Type a message, hit Send, and tokens stream into the response bubble.
Model Selection
The community gateway routes anymodel value to whichever orchestrator has the requested weights warm. Llama 3.1 8B Instruct is the default warm model on the network. Three other Ollama-compatible models are commonly available:
| Model | VRAM | Notes |
|---|---|---|
meta-llama/Meta-Llama-3.1-8B-Instruct | 8 GB | Warm default, fastest first response |
mistralai/Mistral-7B-Instruct-v0.3 | 8 GB | Strong instruction-following |
google/gemma-2-9b-it | 10 GB | Google’s open instruction model |
Qwen/Qwen2.5-7B-Instruct | 8 GB | Strong on code and reasoning |
Production Considerations
The community gateway is shaped for experimentation. Production chat needs four changes. Authentication. Swap to a paid gateway and addAuthorization: Bearer ${process.env.LIVEPEER_API_KEY} to the fetch headers in the route handler.
Conversation persistence. The current implementation holds messages in client state, which means refresh loses the conversation. Persist to a database keyed by user and session.
Token usage and rate limits. The LLM pipeline charges per token of output. Add a per-user token budget enforced server-side, and a per-IP rate limit on the route handler.
Cold-start handling. If the requested model is cold, the first response can take a few minutes. Add a warming request on app start that sends a one-token completion in the background, so by the time a user opens chat the model is ready.
Full hardening guidance in .
Common Errors
Gateway returns 502 immediately
Gateway returns 502 immediately
The route handler couldn’t reach the gateway. Confirm
LIVEPEER_GATEWAY_URL is set; the Edge runtime doesn’t read variables from .env.local in production unless they’re declared in next.config.ts or as Edge-runtime env vars.Stream starts then stalls mid-response
Stream starts then stalls mid-response
The orchestrator timed out or the model unloaded. Retry the request; the network routes to a different orchestrator on retry.
Tokens arrive in big chunks instead of streaming
Tokens arrive in big chunks instead of streaming
A proxy (Cloudflare, nginx, Vercel) is buffering. Confirm the
Cache-Control: no-cache and Content-Type: text/event-stream headers are set on the response. For Cloudflare, disable response buffering on the route.JSON.parse fails on some chunks
JSON.parse fails on some chunks
Some chunks contain comments or empty lines. The handler skips empty lines and wraps parse in try/catch; if you see frequent parse errors, log the raw line to identify the format drift.
Cold model load takes minutes on first request
Cold model load takes minutes on first request
Expected for non-warm models. Either use the warm default (
meta-llama/Meta-Llama-3.1-8B-Instruct) or send a warming request on app start.model field to try Mistral, Gemma, or Qwen variants.
AI agent prompt
Next Steps
Eliza Plugin Tutorial
Build a full agent with character files, RAG, and multi-agent swarms.
AI Pipelines
The other ten pipelines: image gen, audio, vision, segmentation.
Model Support
Warm models, VRAM requirements, custom model paths.
Production Hardening
Rate limits, auth, observability, cold-start handling.