Introduction to AI Voice Agents in voice agent for businesses
AI Voice Agents are sophisticated systems designed to interact with users through natural language processing. They are essential in the business industry for automating customer service, scheduling, and providing information efficiently. Core components of a voice agent include Speech-to-Text (STT), Language Learning Models (LLM), and Text-to-Speech (TTS) technologies. In this tutorial, you will build a business-focused AI Voice Agent using the VideoSDK framework.
Architecture and Core Concepts
AI Voice Agents operate through a series of interconnected processes that convert user speech into actionable responses. For a comprehensive understanding of these processes, refer to the
AI voice Agent core components overview
. Here is a high-level overview of the architecture:
Understanding Key Concepts in the VideoSDK Framework
- Agent: The core class that represents your bot.
- CascadingPipeline: Manages the flow of audio processing from STT to LLM to TTS. Learn more about the
Cascading pipeline in AI voice Agents
. - VAD & TurnDetector: Ensure the agent listens and responds at appropriate times. Discover more about the
Turn detector for AI voice Agents
.
Setting Up the Development Environment
Prerequisites
Ensure you have Python 3.11+ and a VideoSDK account at app.videosdk.live.
Step 1: Create a Virtual Environment
Create a virtual environment to manage your project dependencies:
1python -m venv venv
2source venv/bin/activate # On Windows use `venv\\Scripts\\activate`
3Step 2: Install Required Packages
Install the necessary packages using pip:
1pip install videosdk-agents videosdk-plugins
2Step 3: Configure API Keys
Create a
.env file in your project directory and add your API keys:1VIDEOSDK_API_KEY=your_api_key_here
2Building the AI Voice Agent: A Step-by-Step Guide
Here is the complete code to build your AI Voice Agent:
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10pre_download_model()
11
12agent_instructions = "You are a professional and efficient voice agent designed specifically for businesses. Your primary role is to assist business clients by providing accurate information, scheduling meetings, and managing inquiries related to business operations. You can handle a wide range of business-related queries, including product information, service details, and customer support. However, you must always maintain a professional tone and ensure that all interactions are conducted with the utmost respect and confidentiality. You are not authorized to provide financial advice or make business decisions on behalf of the company. Always remind users to consult with a qualified professional for any financial or strategic business decisions. Your goal is to enhance business efficiency and customer satisfaction through seamless voice interactions."
13
14class MyVoiceAgent(Agent):
15 def __init__(self):
16 super().__init__(instructions=agent_instructions)
17 async def on_enter(self): await self.session.say("Hello! How can I help?")
18 async def on_exit(self): await self.session.say("Goodbye!")
19
20async def start_session(context: JobContext):
21 agent = MyVoiceAgent()
22 conversation_flow = ConversationFlow(agent)
23
24 pipeline = CascadingPipeline(
25 stt=DeepgramSTT(model="nova-2", language="en"),
26 llm=OpenAILLM(model="gpt-4o"),
27 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
28 vad=SileroVAD(threshold=0.35),
29 turn_detector=TurnDetector(threshold=0.8)
30 )
31
32 session = AgentSession(
33 agent=agent,
34 pipeline=pipeline,
35 conversation_flow=conversation_flow
36 )
37
38 try:
39 await context.connect()
40 await session.start()
41 await asyncio.Event().wait()
42 finally:
43 await session.close()
44 await context.shutdown()
45
46def make_context() -> JobContext:
47 room_options = RoomOptions(
48 name="VideoSDK Cascaded Agent",
49 playground=True
50 )
51
52 return JobContext(room_options=room_options)
53
54if __name__ == "__main__":
55 job = WorkerJob(entrypoint=start_session, jobctx=make_context)
56 job.start()
57Step 4.1: Generating a VideoSDK Meeting ID
To interact with your agent, you need a meeting ID. Use the following
curl command to generate one:1curl -X POST https://api.videosdk.live/v1/meetings -H "Authorization: Bearer YOUR_API_KEY"
2Step 4.2: Creating the Custom Agent Class
The
MyVoiceAgent class extends the Agent class, providing custom behavior for entering and exiting interactions:1class MyVoiceAgent(Agent):
2 def __init__(self):
3 super().__init__(instructions=agent_instructions)
4 async def on_enter(self): await self.session.say("Hello! How can I help?")
5 async def on_exit(self): await self.session.say("Goodbye!")
6Step 4.3: Defining the Core Pipeline
The
CascadingPipeline orchestrates the flow of data through various plugins, including the Deepgram STT Plugin for voice agent
,OpenAI LLM Plugin for voice agent
, andElevenLabs TTS Plugin for voice agent
:1pipeline = CascadingPipeline(
2 stt=DeepgramSTT(model="nova-2", language="en"),
3 llm=OpenAILLM(model="gpt-4o"),
4 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5 vad=SileroVAD(threshold=0.35),
6 turn_detector=TurnDetector(threshold=0.8)
7)
8Step 4.4: Managing the Session and Startup Logic
The
start_session function initializes and manages the agent session, as detailed in the AI voice Agent Sessions
:1async def start_session(context: JobContext):
2 agent = MyVoiceAgent()
3 conversation_flow = ConversationFlow(agent)
4
5 session = AgentSession(
6 agent=agent,
7 pipeline=pipeline,
8 conversation_flow=conversation_flow
9 )
10
11 try:
12 await context.connect()
13 await session.start()
14 await asyncio.Event().wait()
15 finally:
16 await session.close()
17 await context.shutdown()
18The
make_context function sets up the job context with room options:1def make_context() -> JobContext:
2 room_options = RoomOptions(
3 name="VideoSDK Cascaded Agent",
4 playground=True
5 )
6
7 return JobContext(room_options=room_options)
8Running and Testing the Agent
Step 5.1: Running the Python Script
Execute the script with:
1python main.py
2Step 5.2: Interacting with the Agent in the Playground
Once the script is running, find the playground link in the console output. Join the session and interact with your agent.
Advanced Features and Customizations
Extending Functionality with Custom Tools
Enhance your agent by integrating custom tools using the
function_tool concept.Exploring Other Plugins
Explore alternative STT, LLM, and TTS plugins to tailor the agent to your needs. For a quick setup, refer to the
Voice Agent Quick Start Guide
.Troubleshooting Common Issues
API Key and Authentication Errors
Ensure your API keys are correctly configured in the
.env file.Audio Input/Output Problems
Check your audio device settings and ensure they are correctly configured.
Dependency and Version Conflicts
Verify that all dependencies are installed with compatible versions.
Conclusion
Summary of What You've Built
You have successfully built an AI Voice Agent for businesses using VideoSDK, capable of handling various business-related tasks.
Next Steps and Further Learning
Explore additional features and plugins to enhance your agent's capabilities. For more guidance, revisit the
Voice Agent Quick Start Guide
.Want to level-up your learning? Subscribe now
Subscribe to our newsletter for more tech based insights
FAQ