How does the CascadingPipeline work?

The CascadingPipeline manages the flow of audio processing, linking STT, LLM, and TTS components to create a seamless interaction experience.

What are the prerequisites for building an AI Voice Agent?

You need Python 3.11+, a VideoSDK account, and the necessary API keys to build an AI Voice Agent.

How can I generate a VideoSDK meeting ID?

Use the provided `curl` command with your API key to generate a meeting ID via the VideoSDK API.

What should I do if I encounter audio input/output problems?

Check your microphone and speaker settings to ensure the correct devices are selected in your system settings.

Build an AI Voice Agent for Audio Streaming

Q: What is an AI Voice Agent?

An AI Voice Agent is a software entity that interacts with users through voice commands, processing spoken language to understand and respond appropriately.

Create a real-time AI Voice Agent for audio streaming using VideoSDK. Follow our detailed guide to build and test your agent.

Introduction to AI Voice Agents in Streaming Audio Generation

What is an AI
Voice Agent
?

An AI

Voice Agent

is a software entity that can interact with users through voice commands. It processes spoken language, understands the intent, and responds appropriately, often using natural language processing (NLP) techniques. These agents are becoming increasingly prevalent in various industries, including customer service, smart home devices, and more recently, in audio streaming.

Why are they important for the streaming audio generation industry?

In the streaming audio generation industry, AI Voice Agents can enhance user experience by providing real-time assistance, automating tasks, and offering personalized content recommendations. They can also help in setting up streaming configurations, troubleshooting issues, and providing insights into audio technologies.

Core Components of a
Voice Agent

Speech-to-Text (STT): Converts spoken language into text.
Language Learning Model (LLM): Processes the text to understand and generate appropriate responses.
Text-to-Speech (TTS): Converts text responses back into spoken language.

What You'll Build in This Tutorial

In this tutorial, you will build a fully functional AI

Voice Agent

using the VideoSDK framework. The agent will be capable of generating high-quality audio streams in real-time, providing explanations of audio streaming processes, and assisting users with setup and troubleshooting.

Architecture and Core Concepts

High-Level Architecture Overview

The AI Voice Agent you'll build follows a structured data flow: it captures user speech, processes it through a series of components, and generates a spoken response. This flow is managed by the VideoSDK framework, which provides the necessary tools and plugins.

Understanding Key Concepts in the VideoSDK Framework

Agent: The core class representing your bot, responsible for handling user interactions.
Cascading Pipeline in AI voice Agents
: Manages the flow of audio processing, linking STT, LLM, and TTS components.
VAD &
Turn Detector for AI voice Agents
: These components help the agent determine when to listen and when to speak, ensuring smooth interaction.

Setting Up the Development Environment

Prerequisites

Before you begin, ensure you have Python 3.11+ installed and a VideoSDK account, which you can create at app.videosdk.live.

Step 1: Create a Virtual Environment

To keep your project dependencies isolated, create a virtual environment:

1python3 -m venv venv
2source venv/bin/activate  # On Windows use `venv\\Scripts\\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip:

1pip install videosdk-agents videosdk-plugins
2

Step 3: Configure API Keys in a `.env` file

Create a .env file in your project directory and add your VideoSDK API keys:

1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

Complete Code Block

Here is the complete code to set up your AI Voice Agent using the VideoSDK framework:

1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "{\n  \"persona\": \"Innovative Audio Streaming Specialist\",\n  \"capabilities\": [\n    \"Generate high-quality audio streams in real-time based on user input.\",\n    \"Provide detailed explanations of audio streaming processes and technologies.\",\n    \"Assist users in setting up and optimizing their audio streaming setups.\",\n    \"Offer troubleshooting advice for common audio streaming issues.\"\n  ],\n  \"constraints\": [\n    \"You are not a certified audio engineer and should advise users to consult professionals for complex technical issues.\",\n    \"You cannot provide legal advice regarding audio content rights and licensing.\",\n    \"Ensure user privacy by not storing or sharing any personal data.\"\n  ]\n}"
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = [AI voice Agent Sessions](https://docs.videosdk.live/ai_agents/core-components/agent-session)(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Step 4.1: Generating a VideoSDK Meeting ID

To generate a meeting ID, use the following curl command:

1curl -X POST "https://api.videosdk.live/v1/rooms" \
2-H "Authorization: Bearer YOUR_API_KEY" \
3-H "Content-Type: application/json" \
4-d '{"name":"My Meeting Room"}'
5

This command will return a JSON response containing the meeting ID, which you can use to join or create sessions.

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class is where you define the behavior of your AI Voice Agent. It inherits from the Agent class and uses the agent_instructions to guide its interactions.

1class MyVoiceAgent(Agent):
2    def __init__(self):
3        super().__init__(instructions=agent_instructions)
4    async def on_enter(self): await self.session.say("Hello! How can I help?")
5    async def on_exit(self): await self.session.say("Goodbye!")
6

Step 4.3: Defining the Core Pipeline

The CascadingPipeline is crucial as it defines how the audio is processed. It links the STT, LLM, TTS, VAD, and TurnDetector plugins to create a seamless interaction flow.

1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8

Step 4.4: Managing the Session and Startup Logic

The start_session function manages the lifecycle of the agent's session, ensuring it connects, starts, and shuts down gracefully.

1async def start_session(context: JobContext):
2    agent = MyVoiceAgent()
3    conversation_flow = ConversationFlow(agent)
4    pipeline = CascadingPipeline(
5        stt=DeepgramSTT(model="nova-2", language="en"),
6        llm=OpenAILLM(model="gpt-4o"),
7        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
8        vad=SileroVAD(threshold=0.35),
9        turn_detector=TurnDetector(threshold=0.8)
10    )
11    session = AgentSession(
12        agent=agent,
13        pipeline=pipeline,
14        conversation_flow=conversation_flow
15    )
16    try:
17        await context.connect()
18        await session.start()
19        await asyncio.Event().wait()
20    finally:
21        await session.close()
22        await context.shutdown()
23

The make_context function defines the room options and prepares the job context for the agent.

1def make_context() -> JobContext:
2    room_options = RoomOptions(
3        name="VideoSDK Cascaded Agent",
4        playground=True
5    )
6    return JobContext(room_options=room_options)
7

The main block initializes and starts the agent job.

1if __name__ == "__main__":
2    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
3    job.start()
4

Running and Testing the Agent

Step 5.1: Running the Python Script

To run your agent, execute the script:

1python main.py
2

Step 5.2: Interacting with the Agent in the
AI Agent playground

Once the script is running, you will see a playground link in the console. Open this link in your browser to interact with your agent. Speak into your microphone, and the agent will process your input and respond accordingly.

Advanced Features and Customizations

Extending Functionality with Custom Tools

You can enhance your agent by integrating custom tools and plugins. The VideoSDK framework supports various plugins for STT, LLM, and TTS, allowing you to tailor the agent's capabilities to your needs.

Exploring Other Plugins

Consider experimenting with different plugins for STT, LLM, and TTS to optimize performance and quality. VideoSDK offers a range of options, including Cartesia, Deepgram, and ElevenLabs.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correctly set in the .env file. Check for any typos or missing keys.

Audio Input/Output Problems

Verify your microphone and speaker settings. Ensure the correct devices are selected in your system settings.

Dependency and Version Conflicts

If you encounter dependency issues, ensure all packages are up-to-date and compatible with Python 3.11+.

Conclusion

Summary of What You've Built

In this tutorial, you created a fully functional AI Voice Agent capable of real-time audio streaming and interaction. You learned how to set up the development environment, build the agent, and test it in a playground environment.

Next Steps and Further Learning

Explore additional plugins and customizations to enhance your agent's capabilities. Consider integrating more advanced features or

AI voice Agent deployment

in a production environment.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ

Free $20 Balance for AI Voice Agents & Video Calls