Introduction to AI Voice Agents in Voice Bot Builder
In today's fast-evolving technological landscape, AI voice agents have become pivotal in enhancing user interaction across various platforms. These agents are designed to interpret human speech, process it, and respond in a way that simulates human-like conversation. This tutorial will guide you through building a voice bot using the VideoSDK framework, a powerful tool for creating interactive voice applications.
What is an AI Voice Agent
?
An AI
Voice Agent
is a software entity capable of understanding and responding to human speech. It utilizes technologies like Speech-to-Text (STT), Text-to-Speech (TTS), and Language Learning Models (LLM) to process and generate responses. These agents are increasingly used in customer service, virtual assistants, and automated systems.Why are they important for the Voice Bot Builder industry?
In the voice bot builder industry, AI voice agents provide a seamless way to automate interactions, improve customer service, and enhance user experience. They can handle a variety of tasks such as answering queries, providing information, and even executing commands, making them invaluable in creating efficient and responsive voice applications.
Core Components of a Voice Agent
The core components of a
voice agent
include:- STT (Speech-to-Text): Converts spoken language into text.
- TTS (Text-to-Speech): Converts text back into spoken language.
- LLM (Language Learning Model): Processes the text to understand and generate responses.
What You'll Build in This Tutorial
In this tutorial, you'll learn how to build a simple yet effective voice bot using the VideoSDK framework. We’ll cover everything from setting up your development environment to running and testing your
voice agent
.Architecture and Core Concepts
Understanding the architecture and core concepts of AI voice agents is crucial for building effective applications. For a comprehensive understanding, refer to the
AI voice Agent core components overview
.High-Level Architecture Overview
The architecture of a voice bot involves several key components working together to process and respond to user input. Here's a high-level overview of the process:

Understanding Key Concepts in the VideoSDK Framework
- Agent: The core class representing your bot. It handles interactions and manages the conversation flow.
- CascadingPipeline: This is the flow of audio processing, where each component (STT, LLM, TTS) plays a critical role in transforming and understanding the user's input. Learn more about the
Cascading pipeline in AI voice Agents
. - VAD & TurnDetector: These components help the agent know when to listen and when to respond, ensuring smooth and natural interactions. Discover more about the
Turn detector for AI voice Agents
.
Setting Up the Development Environment
Before we dive into building the voice agent, let's set up the necessary development environment.
Prerequisites
To get started, ensure you have the following:
- Python 3.11+
- A VideoSDK account, which you can create at app.videosdk.live
Step 1: Create a Virtual Environment
Creating a virtual environment helps manage dependencies and avoid conflicts. Run the following commands:
1python3 -m venv venv
2source venv/bin/activate # On Windows use `venv\\Scripts\\activate`
3Step 2: Install Required Packages
Install the necessary packages using pip:
1pip install videosdk
2Step 3: Configure API Keys in a .env file
Create a
.env file in your project directory and add your VideoSDK API keys:1VIDEOSDK_API_KEY=your_api_key_here
2Building the AI Voice Agent: A Step-by-Step Guide
Let's build the AI voice agent using the complete code provided below. We’ll then break it down to understand each part.
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are a 'voice bot builder' assistant designed to help users create and deploy voice bots efficiently using the VideoSDK framework. Your persona is that of a knowledgeable and supportive tech guide. Your capabilities include providing step-by-step guidance on setting up a voice bot, explaining the features of the VideoSDK framework, and offering troubleshooting tips for common issues. You can also suggest best practices for optimizing voice bot performance and user engagement. However, you are not a substitute for professional technical support, and users should be directed to consult VideoSDK's official documentation or support team for complex technical issues. Additionally, you must remind users to comply with privacy laws and regulations when deploying voice bots."
14
15class MyVoiceAgent(Agent):
16 def __init__(self):
17 super().__init__(instructions=agent_instructions)
18 async def on_enter(self): await self.session.say("Hello! How can I help?")
19 async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22 # Create agent and conversation flow
23 agent = MyVoiceAgent()
24 conversation_flow = ConversationFlow(agent)
25
26 # Create pipeline
27 pipeline = CascadingPipeline(
28 stt=DeepgramSTT(model="nova-2", language="en"),
29 llm=OpenAILLM(model="gpt-4o"),
30 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31 vad=SileroVAD(threshold=0.35),
32 turn_detector=TurnDetector(threshold=0.8)
33 )
34
35 session = AgentSession(
36 agent=agent,
37 pipeline=pipeline,
38 conversation_flow=conversation_flow
39 )
40
41 try:
42 await context.connect()
43 await session.start()
44 # Keep the session running until manually terminated
45 await asyncio.Event().wait()
46 finally:
47 # Clean up resources when done
48 await session.close()
49 await context.shutdown()
50
51def make_context() -> JobContext:
52 room_options = RoomOptions(
53 # room_id="YOUR_MEETING_ID", # Set to join a pre-created room; omit to auto-create
54 name="VideoSDK Cascaded Agent",
55 playground=True
56 )
57
58 return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61 job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62 job.start()
63Step 4.1: Generating a VideoSDK Meeting ID
To interact with your voice bot, you'll need a meeting ID. You can generate one using the following
curl command:1curl -X POST \\
2 https://api.videosdk.live/v1/meetings \\
3 -H 'Authorization: Bearer YOUR_API_KEY' \\
4 -H 'Content-Type: application/json'
5Step 4.2: Creating the Custom Agent Class
The
MyVoiceAgent class is where you define the behavior of your voice agent. It inherits from the Agent class and implements methods like on_enter and on_exit to handle session events.1class MyVoiceAgent(Agent):
2 def __init__(self):
3 super().__init__(instructions=agent_instructions)
4 async def on_enter(self): await self.session.say("Hello! How can I help?")
5 async def on_exit(self): await self.session.say("Goodbye!")
6Step 4.3: Defining the Core Pipeline
The
CascadingPipeline is central to processing user input. It chains together components for STT, LLM, and TTS, allowing seamless conversion from speech to text and back to speech.1pipeline = CascadingPipeline(
2 stt=DeepgramSTT(model="nova-2", language="en"),
3 llm=OpenAILLM(model="gpt-4o"),
4 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5 vad=SileroVAD(threshold=0.35),
6 turn_detector=TurnDetector(threshold=0.8)
7)
8Step 4.4: Managing the Session and Startup Logic
The
start_session function initializes the agent session and manages its lifecycle. It connects the session, starts it, and ensures it runs until manually terminated. For more details on managing sessions, refer to AI voice Agent Sessions
.1async def start_session(context: JobContext):
2 # Create agent and conversation flow
3 agent = MyVoiceAgent()
4 conversation_flow = ConversationFlow(agent)
5
6 # Create pipeline
7 pipeline = CascadingPipeline(
8 stt=DeepgramSTT(model="nova-2", language="en"),
9 llm=OpenAILLM(model="gpt-4o"),
10 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
11 vad=SileroVAD(threshold=0.35),
12 turn_detector=TurnDetector(threshold=0.8)
13 )
14
15 session = AgentSession(
16 agent=agent,
17 pipeline=pipeline,
18 conversation_flow=conversation_flow
19 )
20
21 try:
22 await context.connect()
23 await session.start()
24 # Keep the session running until manually terminated
25 await asyncio.Event().wait()
26 finally:
27 # Clean up resources when done
28 await session.close()
29 await context.shutdown()
30The
make_context function sets up the job context, which includes options for the room where the agent will operate.1def make_context() -> JobContext:
2 room_options = RoomOptions(
3 # room_id="YOUR_MEETING_ID", # Set to join a pre-created room; omit to auto-create
4 name="VideoSDK Cascaded Agent",
5 playground=True
6 )
7
8 return JobContext(room_options=room_options)
9Finally, the main entry point starts the job using the
WorkerJob class.1if __name__ == "__main__":
2 job = WorkerJob(entrypoint=start_session, jobctx=make_context)
3 job.start()
4Running and Testing the Agent
Once you've built your agent, it's time to test it.
Step 5.1: Running the Python Script
Run your script with the following command:
1python main.py
2Step 5.2: Interacting with the Agent in the Playground
After starting the script, you'll find a playground link in the console. Use this link to join the session and interact with your voice bot.
Advanced Features and Customizations
Extending Functionality with Custom Tools
The VideoSDK framework allows you to extend your voice bot's functionality by integrating custom tools. This enables you to add specialized features tailored to your application's needs.
Exploring Other Plugins
While this tutorial uses specific plugins for STT, LLM, and TTS, the VideoSDK framework supports a variety of other options. Explore these to find the best fit for your project.
Troubleshooting Common Issues
API Key and Authentication Errors
Ensure your API keys are correctly configured in the
.env file. Double-check the authorization headers in your requests.Audio Input/Output Problems
Verify your microphone and speaker settings. Ensure they are properly configured and accessible by the application.
Dependency and Version Conflicts
Ensure all dependencies are installed in the virtual environment. Check for version conflicts and resolve them by updating or downgrading packages as needed.
Conclusion
Summary of What You've Built
Congratulations! You've successfully built a basic AI voice agent using the VideoSDK framework. You've learned how to set up the development environment, create the agent, and test it in a live session.
Next Steps and Further Learning
To further enhance your voice bot, consider exploring additional plugins and custom tools. Continue learning by diving into the VideoSDK documentation and experimenting with more advanced features. For deployment guidance, refer to
AI voice Agent deployment
.Want to level-up your learning? Subscribe now
Subscribe to our newsletter for more tech based insights
FAQ