A complete rearchitecture of the VideoSDK AI voice pipeline.

We've been building AI voice agents for a while now. And the more we built, the more we ran into the same wall: the pipeline was in the way.

You couldn't swap a voice. You couldn't intercept what the LLM sees. You couldn't mix a custom STT with a realtime model. And when something broke in production, there was nothing to look at - no traces, no metrics, no logs.

So we rebuilt everything.

Today we're releasing Prism: Agents V1.0.0, a stable, production-ready rearchitecture of the VideoSDK Agents framework. It is not backward compatible with v0.x.

Full release notes and migration guide →

What was broken in v0.x

The old framework had two pipeline classes: CascadingPipeline for STT → LLM → TTS chains, and RealtimePipeline for speech-to-speech models. Every new capability we wanted to add was fighting the architecture.

Hybrid mode : running a custom STT into a realtime LLM was impossible. The two pipelines had no way to talk to each other. If you didn't like the voice of your realtime model, you were stuck with it. If you wanted to clean or normalize a transcript before inference, there was no hook to do it. And observability was completely absent.

Adding features meant breaking things. So instead of patching it, we redesigned from scratch.

Agent V1 architecture

The Agent Session orchestrates the entire workflow, combining the Agent with a Pipeline for real-time communication. The unified Pipeline automatically detects the best mode based on the components you provide whether that's a full cascade STT-LLM-TTS setup, a realtime speech-to-speech model, or a hybrid of both.

Agent - This is the base class for defining your agent's identity and behavior. Here, you can configure custom instructions, manage its state, and register function tools.
Pipeline - This unified component manages the real-time flow of audio and data between the user and the AI models. It auto-detects the optimal mode based on the components you provide:
- Cascade Mode - Provide STT, LLM, TTS, VAD, and Turn Detector for maximum flexibility and control over each processing stage.
- Realtime Mode - Provide a realtime model (e.g., OpenAI Realtime, Google Gemini Live, AWS Nova Sonic) for lowest-latency speech-to-speech processing.
- Hybrid Mode - Combine a realtime model with an external STT (for knowledge base support) or external TTS (for custom voice support).
Agent Session - This component brings together the agent and pipeline to manage the agent's lifecycle within a VideoSDK meeting.
Pipeline Hooks - A middleware system for intercepting and processing data at any stage of the pipeline. Use hooks for custom STT/TTS processing, observing or modifying LLM output, lifecycle events, and more.

One Pipeline Class

The core change in V1 is the replacement of CascadingPipeline and RealtimePipeline with a single Pipeline class.

#Before

from videosdk.agents import CascadingPipeline, RealtimePipeline

pipeline = CascadingPipeline(stt=..., llm=..., tts=..., vad=..., turn_detector=...)
pipeline = RealtimePipeline(llm=OpenAIRealtime(...))

#After

from videosdk.agents import Pipeline

pipeline = Pipeline(stt=..., llm=..., tts=..., vad=..., turn_detector=...)
pipeline = Pipeline(llm=OpenAIRealtime(...))

Pass any combination of components. The PipelineOrchestrator analyzes what you've given it and automatically selects the correct execution mode : cascade, realtime, or hybrid. You never configure it directly. You just pass components.

Three Modes, One Interface

Cascade Mode - full control over every stage.

pipeline = Pipeline(
    stt=DeepgramSTT(),
    llm=GoogleLLM(),
    tts=CartesiaTTS(),
    vad=SileroVAD(),
    turn_detector=TurnDetector(),
)

Realtime Mode - lowest latency, single model for the full voice pipeline.

pipeline = Pipeline(
    llm=GeminiRealtime(
        model="gemini-3.1-flash-live-preview",
        config=GeminiLiveConfig(voice="Leda", response_modalities=["AUDIO"]),
    )
)

Supported realtime models: OpenAIRealtime, GeminiRealtime, AWSNovaSonic, AzureVoiceLive

Hybrid : this is what was impossible before.

# Bring your own STT, use a realtime LLM
pipeline = Pipeline(stt=DeepgramSTT(), llm=OpenAIRealtime(...))

# Use a realtime LLM, bring your own voice
pipeline = Pipeline(llm=OpenAIRealtime(...), tts=ElevenLabsTTS(...))

You are no longer bounded by what the model provider gives you. Don't like the default voice? Swap it. Want your own transcription layer feeding a realtime model? Done.

Flexible Agent Composition

V1 also unlocks the ability to run partial pipelines for specific use cases.

Pipeline(stt=...)                     # Transcription agent
Pipeline(llm=...)                     # Text chatbot
Pipeline(stt=..., llm=..., tts=...)   # Voice + chat
Pipeline(stt=..., llm=..., tts=..., vad=..., turn_detector=...)      # Full voice agent
Pipeline(llm=OpenAIRealtime(...))     # Realtime voice agent

Same class, same structure, different capabilities depending on what you pass in.

Pipeline Hooks

ConversationalFlow is removed. In its place is @pipeline.on(...) : a decorator-based hooks system that lets you intercept and transform data at any stage, without subclassing anything.

@pipeline.on("stt")
async def on_transcript(text: str) -> str:
    return text.strip()                        # normalize before LLM

@pipeline.on("tts")
async def on_tts(text: str) -> str:
    return text.replace("SDK", "S D K")       # fix pronunciation

@pipeline.on("llm")
async def on_llm(messages):
    yield "Transferring you now."              # bypass LLM entirely

Hooks are available at every stage: stt, tts, llm, vision_frame, user_turn_start, user_turn_end, agent_turn_start, agent_turn_end. You can intercept raw audio streams at the stt and tts hooks — this is audio-level control, not just text.

Hooks walkthrough →

Observability, Built In

Every V1 pipeline ships with per-component metrics, structured logging, and OpenTelemetry tracing across cascade, realtime, and hybrid modes. No extra setup. Configure custom endpoints via RoomOptions.

This was the biggest missing piece in v0.x. When something breaks in production, you now have something to look at.

Docs MCP Server

Query VideoSDK documentation directly from your AI agent. The MCP server gives instant access to SDK references, implementation guides, and even source-level details inside your workflow. No manual searching, no context switching. Built for MCP-compatible agents like Claude, Cursor, or your own.

{
    "mcpServers": {
        "videosdkAgentDocs": {
            "serverUrl": "https://mcp.videosdk.live/mcp"
        }
    }
}

Explore the mcp-server docs ->

Agent Skills

Extend your agent with reusable capabilities. Define tools, actions, and behaviors once, and plug them into any pipeline. From API calls to complex workflows, Agent Skills let your agents do more than just respond - they can act, integrate, and automate.

Explore the agents-skills ->

Latency

The pipeline stages in V1 run concurrently. STT, LLM, and TTS never block each other. TTS audio streams to the room as soon as the first chunk is generated not after full synthesis. In realtime mode, raw audio is routed directly to the model with zero transcription overhead.

Interruptions are detected at the earliest possible point in the pipeline. In-flight LLM and TTS generation is cancelled immediately on user speech. Avatar audio flushes cleanly on interrupt with no residual artifacts.

We benchmarked V1 against the fastest pipelines in the industry. The numbers are in the release notes.

22 Production Ready Templates

We've shipped 22 production-ready agent templates covering the most common real-world use cases. Your logic. Your pipeline. These are the starting points.

Browse all examples →

Breaking Changes

v0.x	V1.0.0
`CascadingPipeline`	`Pipeline`
`RealtimePipeline`	`Pipeline`
`ConversationalFlow`	`@pipeline.on(...)` hooks

Function tools, agent lifecycle, AgentSession, WorkerJob, fallback providers, MCP tools, knowledge base, VAD, and turn detection all continue to work as before.

Full migration guide →

Resources

Learn how to deploy your agents.
Follow the docs to start building your AI Voice Agents today
Contact our sales team to explore solutions tailored to your needs.
👉 Share your thoughts, roadblocks, or success stories in the comments or join our Discord community ↗. We’re excited to learn from your journey and help you build even better AI-powered communication tools!