Introduction to AI Voice Agents in ai voice agent for bpo
What is an AI Voice Agent?
AI Voice Agents are intelligent software systems that can understand, process, and respond to human speech in real time. They leverage automatic speech recognition (ASR), natural language processing (NLP), and text-to-speech (TTS) technologies to interact with users over the phone or other voice channels.
Why are they important for the ai voice agent for bpo industry?
Business Process Outsourcing (BPO) companies handle large volumes of customer interactions. AI Voice Agents help BPOs scale their operations, reduce costs, and provide 24/7 support. They can handle routine queries, process transactions, and escalate complex issues to human agents, all while maintaining a consistent and professional tone.
Core Components of a Voice Agent
- Speech-to-Text (STT): Converts spoken language into text.
- Natural Language Understanding (NLU/LLM): Interprets the meaning of the text.
- Text-to-Speech (TTS): Converts the agent's response back into natural-sounding speech.
- Voice Activity Detection (VAD) & Turn Detection: Determines when the user is speaking and when it's the agent's turn to respond.
If you're new to building these systems, the
Voice Agent Quick Start Guide
provides a step-by-step introduction to get you started quickly.What You'll Build in This Tutorial
In this tutorial, you'll build a fully functional AI Voice Agent tailored for BPO use cases using the VideoSDK AI Agents framework. You'll learn how to set up your environment, implement the agent, and test it in a real-time voice playground.
Architecture and Core Concepts
High-Level Architecture Overview
The AI Voice Agent processes audio in a pipeline: user speech is captured, transcribed to text, interpreted by a language model, and then synthesized back to speech for the response. Each component is modular and can be swapped for different plugins. For a detailed explanation of these elements, see the
AI voice Agent core components overview
.Data Flow Sequence
1sequenceDiagram
2 participant User
3 participant AgentSession
4 participant CascadingPipeline
5 participant DeepgramSTT
6 participant OpenAILLM
7 participant ElevenLabsTTS
8 participant VAD
9 participant TurnDetector
10 User->>AgentSession: Speaks
11 AgentSession->>VAD: Detects speech activity
12 VAD->>TurnDetector: Detects turn end
13 TurnDetector->>DeepgramSTT: Sends audio for transcription
14 DeepgramSTT->>OpenAILLM: Sends transcript
15 OpenAILLM->>ElevenLabsTTS: Gets response
16 ElevenLabsTTS->>AgentSession: Plays audio response
17 AgentSession->>User: Responds
18
The
Cascading pipeline in AI voice Agents
is central to this process, ensuring seamless integration and flow between each plugin and component.Understanding Key Concepts in the VideoSDK Framework
- Agent: The core class that defines the agent's persona and behavior.
- CascadingPipeline: Manages the flow of audio through STT, LLM, TTS, VAD, and turn detection.
- VAD & TurnDetector: Ensure the agent listens and responds at the right moments, creating a natural conversation flow.
To learn more about managing real-time agent interactions, refer to the
AI voice Agent Sessions
documentation.Setting Up the Development Environment
Prerequisites
- Python 3.11+ (ensure compatibility with VideoSDK agents)
- A VideoSDK Account: Sign up and access your dashboard to obtain API credentials.
Step 1: Create a Virtual Environment
It's best practice to isolate your project dependencies.
bash
python3.11 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
Step 2: Install Required Packages
Install the VideoSDK AI Agents framework and required plugins.
bash
pip install videosdk-agents videosdk-plugins-silero videosdk-plugins-turn-detector videosdk-plugins-deepgram videosdk-plugins-openai videosdk-plugins-elevenlabs
Step 3: Configure API Keys in a .env File
Create a
.env
file in your project directory and add your API keys.
env
VIDEOSDK_API_KEY=your_videosdk_api_key
DEEPGRAM_API_KEY=your_deepgram_api_key
OPENAI_API_KEY=your_openai_api_key
ELEVENLABS_API_KEY=your_elevenlabs_api_key
Building the AI Voice Agent: A Step-by-Step Guide
Let's look at the complete, runnable code for the AI Voice Agent. We'll then break down each section to understand how it works.
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are an efficient and professional AI Voice Agent designed specifically for Business Process Outsourcing (BPO) environments. Your persona is that of a courteous, knowledgeable, and patient customer service representative. Your primary capabilities include: answering customer queries related to products or services, handling basic troubleshooting, processing simple transactions, escalating complex issues to human agents, and providing information about company policies and procedures. You must always maintain a polite and empathetic tone, ensure customer data privacy, and strictly adhere to provided scripts and compliance guidelines. You are not authorized to make decisions outside predefined protocols, provide personal opinions, or handle sensitive financial or legal matters. Always inform the customer when you are escalating their issue to a human agent. If you are unsure or unable to assist, politely suggest that a human representative will follow up. Never collect or store sensitive personal information beyond what is explicitly permitted by company policy."
14
15class MyVoiceAgent(Agent):
16 def __init__(self):
17 super().__init__(instructions=agent_instructions)
18 async def on_enter(self): await self.session.say("Hello! How can I help?")
19 async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22 # Create agent and conversation flow
23 agent = MyVoiceAgent()
24 conversation_flow = ConversationFlow(agent)
25
26 # Create pipeline
27 pipeline = CascadingPipeline(
28 stt=DeepgramSTT(model="nova-2", language="en"),
29 llm=OpenAILLM(model="gpt-4o"),
30 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31 vad=SileroVAD(threshold=0.35),
32 turn_detector=TurnDetector(threshold=0.8)
33 )
34
35 session = AgentSession(
36 agent=agent,
37 pipeline=pipeline,
38 conversation_flow=conversation_flow
39 )
40
41 try:
42 await context.connect()
43 await session.start()
44 # Keep the session running until manually terminated
45 await asyncio.Event().wait()
46 finally:
47 # Clean up resources when done
48 await session.close()
49 await context.shutdown()
50
51def make_context() -> JobContext:
52 room_options = RoomOptions(
53 # room_id="YOUR_MEETING_ID", # Set to join a pre-created room; omit to auto-create
54 name="VideoSDK Cascaded Agent",
55 playground=True
56 )
57
58 return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61 job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62 job.start()
63
Now, let's break down each part of the code and explain its function.
Step 4.1: Generating a VideoSDK Meeting ID
Before you can run your agent, you'll need a meeting room where the agent can interact with users. You can generate a meeting ID using the VideoSDK API.
1curl -X POST \
2 -H "Authorization: YOUR_VIDEOSDK_API_KEY" \
3 -H "Content-Type: application/json" \
4 -d '{"region":"sg001"}' \
5 https://api.videosdk.live/v2/rooms
6
The response will include a
roomId
you can use. For testing, you can let the agent auto-create the room by omitting the room_id
in RoomOptions
.Step 4.2: Creating the Custom Agent Class (MyVoiceAgent)
The agent's persona and behavior are defined in a custom class that inherits from
Agent
.1agent_instructions = "You are an efficient and professional AI Voice Agent designed specifically for Business Process Outsourcing (BPO) environments. ..."
2
3class MyVoiceAgent(Agent):
4 def __init__(self):
5 super().__init__(instructions=agent_instructions)
6 async def on_enter(self):
7 await self.session.say("Hello! How can I help?")
8 async def on_exit(self):
9 await self.session.say("Goodbye!")
10
- The
agent_instructions
string guides the LLM on how the agent should behave. on_enter
andon_exit
provide greetings and farewells when the session starts and ends.
Step 4.3: Defining the Core Pipeline (CascadingPipeline and plugins)
The pipeline orchestrates the flow of audio and text through the agent.
1pipeline = CascadingPipeline(
2 stt=DeepgramSTT(model="nova-2", language="en"),
3 llm=OpenAILLM(model="gpt-4o"),
4 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5 vad=SileroVAD(threshold=0.35),
6 turn_detector=TurnDetector(threshold=0.8)
7)
8
- STT: Deepgram's "nova-2" model for English transcription. For more details on integrating this, see the
Deepgram STT Plugin for voice agent
. - LLM: OpenAI's GPT-4o for natural language understanding and response generation. Learn more about configuration in the
OpenAI LLM Plugin for voice agent
. - TTS: ElevenLabs for high-quality voice synthesis. See the
ElevenLabs TTS Plugin for voice agent
for setup instructions. - VAD: SileroVAD detects when the user is speaking. For implementation details, visit
Silero Voice Activity Detection
. - TurnDetector: Determines when the user's turn ends, so the agent can respond. Read more in the
Turn detector for AI voice Agents
documentation.
Step 4.4: Managing the Session and Startup Logic
The session brings together the agent, pipeline, and conversation flow, and manages their lifecycle.
1async def start_session(context: JobContext):
2 agent = MyVoiceAgent()
3 conversation_flow = ConversationFlow(agent)
4 pipeline = CascadingPipeline(...)
5 session = AgentSession(agent=agent, pipeline=pipeline, conversation_flow=conversation_flow)
6 try:
7 await context.connect()
8 await session.start()
9 await asyncio.Event().wait()
10 finally:
11 await session.close()
12 await context.shutdown()
13
14def make_context() -> JobContext:
15 room_options = RoomOptions(
16 name="VideoSDK Cascaded Agent",
17 playground=True
18 )
19 return JobContext(room_options=room_options)
20
21if __name__ == "__main__":
22 job = WorkerJob(entrypoint=start_session, jobctx=make_context)
23 job.start()
24
start_session
initializes and starts the session, keeping it alive until manually stopped.make_context
configures the meeting room and enables the playground for browser-based testing.- The
__main__
block launches the agent.
If you'd like to experiment with your agent in a browser-based environment, the
AI Agent playground
provides an interactive space for real-time testing and iteration.Running and Testing the Agent
Step 5.1: Running the Python Script
- Ensure your
.env
file is set up with all required API keys. - Run the agent script:
bash python main.py
- The console will display a Playground URL.
Step 5.2: Interacting with the Agent in the Playground
- Open the Playground link in your browser.
- Join as a participant; you can now speak with your AI Voice Agent in real time.
- The agent will greet you and respond to your queries.
- To stop the agent, press
Ctrl+C
in your terminal. This gracefully shuts down the session and cleans up resources.
Advanced Features and Customizations
Extending Functionality with Custom Tools
- You can add custom function tools to the agent for handling specific BPO workflows, such as ticket creation or CRM integration.
- Implement a function and register it with your agent to enable advanced automation.
Exploring Other Plugins
- STT: Try Cartesia for best accuracy, or Rime for lower cost.
- TTS: Deepgram offers a cost-effective alternative to ElevenLabs.
- LLM: Experiment with Google Gemini or other supported models.
Troubleshooting Common Issues
API Key and Authentication Errors
- Double-check your
.env
file and ensure all keys are correct and active. - If you see authentication errors, regenerate your API keys from the dashboard.
Audio Input/Output Problems
- Ensure your microphone and speakers are working and permitted in your browser.
- Test in different browsers if you encounter issues.
Dependency and Version Conflicts
- Use a fresh virtual environment to avoid conflicts.
- Check package versions if you encounter import errors.
Conclusion
Congratulations! You've built a fully functional AI Voice Agent for BPO using the VideoSDK AI Agents framework. You learned how to set up the environment, implement the agent with best-in-class plugins, and test it live.
For next steps, explore advanced function tools, integrate with your BPO systems, and experiment with different plugins to optimize performance and cost. The VideoSDK framework is highly extensible, enabling you to build production-ready AI voice solutions for any BPO workflow.
Want to level-up your learning? Subscribe now
Subscribe to our newsletter for more tech based insights
FAQ