Building AI voice agents has always been powerful, but slow. You had models STT, LLMs, TTS and the tools to use them. But maintaining accounts across multiple vendors for speech recognition, language models, and speech synthesis, each with its own keys, quotas, billing, and APIs was a major challenge.
Today, that changes.
We’re thrilled to announce Inferencing in VideoSDK AI Voice Agents a unified way to run STT, LLM, TTS, and Realtime models directly inside your voice pipeline without managing multiple accounts through Agent Runtime Dashboard and Python Agents SDK.
Inferencing works in both the CascadingPipeline and the RealtimePipeline, giving you full flexibility to build modular or fully streaming voice agents. Whether you want incremental transcripts, staged execution, or fully native realtime audio, Inferencing makes it easy.
What is VideoSDK Inference?
VideoSDK Inference is a managed gateway that gives you access to multiple AI models. All without providing your own API keys for providers like Sarvam AI or Google Gemini.
Authentication, routing, retries, and billing are handled by VideoSDK usage is simply charged against your VideoSDK account balance.
Supported Categories
- STT: Sarvam, Google, Deepgram
- LLMs: Google Gemini
- TTS: Sarvam, Google, Cartesia
- Realtime: Gemini Native Audio
Inferencing via Agent Runtime Dashboard
Inferencing in VideoSDK is now fully accessible through the dashboard, giving developers direct control over model selection and pipeline configuration without needing to manage infrastructure manually.
From the dashboard, developers can:
- Select STT, LLM, TTS, or Realtime models and enable them in the pipeline with a single click.
- Switch providers instantly, allowing rapid experimentation and iteration .
- Attach deployment endpoints for web or telephony, making the agent immediately accessible to users.
With this approach, ideas move from configuration to live, interactive conversations in minutes, making it possible to test new workflows, swap models, or iterate on conversational design almost instantly.
Inferencing via Code (Agents SDK)
With VideoSDK Inferencing, developers can now integrate STT, LLM, TTS, and Realtime models directly into their voice agents all handled inside the VideoSDK. This enables rapid experimentation, modular pipelines, and low-latency real-time conversations.
Installation
The Inference plugin is included in the core VideoSDK Agents SDK. Install it via
pip install videosdk-agents
Importing Inference Classes
You can import the Inference classes directly from videosdk.agents.inference:
from videosdk.agents.inference import STT, LLM, TTS, Realtime
CascadingPipeline Example
The CascadingPipeline is ideal for modular, stage-by-stage processing. Here’s an example of building a simple agent using STT, LLM, and TTS via the VideoSDK Inference Gateway:
pipeline = CascadingPipeline(
stt=STT.sarvam(model_id="saarika:v2.5", language="en-IN"),
llm=LLM.google(model="gemini-2.5-flash"),
tts=TTS.sarvam(model_id="bulbul:v2", speaker="anushka", language="en-IN"),
vad=SileroVAD()
)
RealTimePipeline Example
For low-latency, fully streaming voice agents, the RealTimePipeline handles Realtime inference with minimal delay. Here’s an example using Gemini Live Native Audio:
pipeline = RealTimePipeline(
model=Realtime.gemini(
model="gemini-2.5-flash-native-audio-preview-12-2025",
voice="Puck",
language_code="en-US",
response_modalities=["AUDIO"],
temperature=0.7
)
)With this approach, developers retain:
- Full programmatic control over pipeline stages, model parameters, and execution behavior.
- Modular provider replacement, making it easy to swap STT, LLM, or TTS engines.
The result: a fully configurable, production-ready AI voice agent that can be deployed in minutes.
Conclusion
Voice AI is no longer limited by model capability. It’s limited by how fast you can deploy it. With Inferencing in VideoSDK AI Voice Agents, deployment becomes effortless. Whether through the dashboard or programmatically via the SDK, you can build, select, enable, and go live in minutes.
The era of modular, low-latency, real-time voice agents is here. With Inferencing, your ideas move from concept to conversation faster than ever.
Build. Select. Configure. Go live.
Resources and Next Steps
- For More Information Read the Inference documentation.
- Learn how to deploy your AI Agents.
- Sign up at VideoSDK Dashboard
- 👉 Share your thoughts, roadblocks, or success stories in the comments or join our Discord community ↗. We’re excited to learn from your journey and help you build even better AI-powered communication tools!
