How do I get a VideoSDK API key for my agent?

Sign up for a VideoSDK account, then navigate to your dashboard to generate and copy your API key. Add it to your .env file as VIDEOSDK_API_KEY.

Can I use a different STT or TTS provider with VideoSDK agents?

Yes, VideoSDK supports multiple STT and TTS plugins. You can swap Deepgram for Cartesia or Rime (STT), and ElevenLabs for Deepgram (TTS) as needed.

How do I test my AI Voice Agent before deploying to production?

Use the playground feature in RoomOptions. When playground=True, the console prints a browser link where you can interact with your agent in real time.

What happens if my agent cannot handle a customer query?

Your agent is instructed to politely escalate complex or unsupported queries to a human agent and inform the customer accordingly.

How do I ensure my agent complies with BPO data privacy policies?

Define clear instructions for your agent to avoid collecting sensitive data, and use VideoSDK's secure environment to handle all audio and text processing.

How do I gracefully shut down the agent?

Press Ctrl+C in your terminal. This triggers cleanup logic to close the session and release resources safely.

Build an AI Voice Agent for BPO with VideoSDK

Comprehensive tutorial to build, test, and customize an AI Voice Agent for BPO using VideoSDK and Python. Includes full code, setup, and troubleshooting.

Introduction to AI Voice Agents in ai voice agent for bpo

What is an AI Voice Agent?

AI Voice Agents are intelligent software systems that can understand, process, and respond to human speech in real time. They leverage automatic speech recognition (ASR), natural language processing (NLP), and text-to-speech (TTS) technologies to interact with users over the phone or other voice channels.

Why are they important for the ai voice agent for bpo industry?

Business Process Outsourcing (BPO) companies handle large volumes of customer interactions. AI Voice Agents help BPOs scale their operations, reduce costs, and provide 24/7 support. They can handle routine queries, process transactions, and escalate complex issues to human agents, all while maintaining a consistent and professional tone.

Core Components of a Voice Agent

Speech-to-Text (STT): Converts spoken language into text.
Natural Language Understanding (NLU/LLM): Interprets the meaning of the text.
Text-to-Speech (TTS): Converts the agent's response back into natural-sounding speech.
Voice Activity Detection (VAD) & Turn Detection: Determines when the user is speaking and when it's the agent's turn to respond.

If you're new to building these systems, the

Voice Agent Quick Start Guide

provides a step-by-step introduction to get you started quickly.

What You'll Build in This Tutorial

In this tutorial, you'll build a fully functional AI Voice Agent tailored for BPO use cases using the VideoSDK AI Agents framework. You'll learn how to set up your environment, implement the agent, and test it in a real-time voice playground.

Architecture and Core Concepts

High-Level Architecture Overview

The AI Voice Agent processes audio in a pipeline: user speech is captured, transcribed to text, interpreted by a language model, and then synthesized back to speech for the response. Each component is modular and can be swapped for different plugins. For a detailed explanation of these elements, see the

AI voice Agent core components overview

Data Flow Sequence

1sequenceDiagram
2    participant User
3    participant AgentSession
4    participant CascadingPipeline
5    participant DeepgramSTT
6    participant OpenAILLM
7    participant ElevenLabsTTS
8    participant VAD
9    participant TurnDetector
10    User->>AgentSession: Speaks
11    AgentSession->>VAD: Detects speech activity
12    VAD->>TurnDetector: Detects turn end
13    TurnDetector->>DeepgramSTT: Sends audio for transcription
14    DeepgramSTT->>OpenAILLM: Sends transcript
15    OpenAILLM->>ElevenLabsTTS: Gets response
16    ElevenLabsTTS->>AgentSession: Plays audio response
17    AgentSession->>User: Responds
18

The

Cascading pipeline in AI voice Agents

is central to this process, ensuring seamless integration and flow between each plugin and component.

Understanding Key Concepts in the VideoSDK Framework

Agent: The core class that defines the agent's persona and behavior.
CascadingPipeline: Manages the flow of audio through STT, LLM, TTS, VAD, and turn detection.
VAD & TurnDetector: Ensure the agent listens and responds at the right moments, creating a natural conversation flow.

To learn more about managing real-time agent interactions, refer to the

AI voice Agent Sessions

documentation.

Setting Up the Development Environment

Prerequisites

Python 3.11+ (ensure compatibility with VideoSDK agents)
A VideoSDK Account: Sign up and access your dashboard to obtain API credentials.

Step 1: Create a Virtual Environment

It's best practice to isolate your project dependencies.

bash
python3.11 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Step 2: Install Required Packages

Install the VideoSDK AI Agents framework and required plugins.

bash
pip install videosdk-agents videosdk-plugins-silero videosdk-plugins-turn-detector videosdk-plugins-deepgram videosdk-plugins-openai videosdk-plugins-elevenlabs

Step 3: Configure API Keys in a .env File

Create a .env file in your project directory and add your API keys.

env
VIDEOSDK_API_KEY=your_videosdk_api_key
DEEPGRAM_API_KEY=your_deepgram_api_key
OPENAI_API_KEY=your_openai_api_key
ELEVENLABS_API_KEY=your_elevenlabs_api_key

Building the AI Voice Agent: A Step-by-Step Guide

Let's look at the complete, runnable code for the AI Voice Agent. We'll then break down each section to understand how it works.

1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are an efficient and professional AI Voice Agent designed specifically for Business Process Outsourcing (BPO) environments. Your persona is that of a courteous, knowledgeable, and patient customer service representative. Your primary capabilities include: answering customer queries related to products or services, handling basic troubleshooting, processing simple transactions, escalating complex issues to human agents, and providing information about company policies and procedures. You must always maintain a polite and empathetic tone, ensure customer data privacy, and strictly adhere to provided scripts and compliance guidelines. You are not authorized to make decisions outside predefined protocols, provide personal opinions, or handle sensitive financial or legal matters. Always inform the customer when you are escalating their issue to a human agent. If you are unsure or unable to assist, politely suggest that a human representative will follow up. Never collect or store sensitive personal information beyond what is explicitly permitted by company policy."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = AgentSession(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Now, let's break down each part of the code and explain its function.

Step 4.1: Generating a VideoSDK Meeting ID

Before you can run your agent, you'll need a meeting room where the agent can interact with users. You can generate a meeting ID using the VideoSDK API.

1curl -X POST \
2  -H "Authorization: YOUR_VIDEOSDK_API_KEY" \
3  -H "Content-Type: application/json" \
4  -d '{"region":"sg001"}' \
5  https://api.videosdk.live/v2/rooms
6

The response will include a roomId you can use. For testing, you can let the agent auto-create the room by omitting the room_id in RoomOptions.

Step 4.2: Creating the Custom Agent Class (MyVoiceAgent)

The agent's persona and behavior are defined in a custom class that inherits from Agent.

1agent_instructions = "You are an efficient and professional AI Voice Agent designed specifically for Business Process Outsourcing (BPO) environments. ..."
2
3class MyVoiceAgent(Agent):
4    def __init__(self):
5        super().__init__(instructions=agent_instructions)
6    async def on_enter(self):
7        await self.session.say("Hello! How can I help?")
8    async def on_exit(self):
9        await self.session.say("Goodbye!")
10

The agent_instructions string guides the LLM on how the agent should behave.
on_enter and on_exit provide greetings and farewells when the session starts and ends.

Step 4.3: Defining the Core Pipeline (CascadingPipeline and plugins)

The pipeline orchestrates the flow of audio and text through the agent.

1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8

STT: Deepgram's "nova-2" model for English transcription. For more details on integrating this, see the
Deepgram STT Plugin for voice agent
.
LLM: OpenAI's GPT-4o for natural language understanding and response generation. Learn more about configuration in the
OpenAI LLM Plugin for voice agent
.
TTS: ElevenLabs for high-quality voice synthesis. See the
ElevenLabs TTS Plugin for voice agent
for setup instructions.
VAD: SileroVAD detects when the user is speaking. For implementation details, visit
Silero Voice Activity Detection
.
TurnDetector: Determines when the user's turn ends, so the agent can respond. Read more in the
Turn detector for AI voice Agents
documentation.

Step 4.4: Managing the Session and Startup Logic

The session brings together the agent, pipeline, and conversation flow, and manages their lifecycle.

1async def start_session(context: JobContext):
2    agent = MyVoiceAgent()
3    conversation_flow = ConversationFlow(agent)
4    pipeline = CascadingPipeline(...)
5    session = AgentSession(agent=agent, pipeline=pipeline, conversation_flow=conversation_flow)
6    try:
7        await context.connect()
8        await session.start()
9        await asyncio.Event().wait()
10    finally:
11        await session.close()
12        await context.shutdown()
13
14def make_context() -> JobContext:
15    room_options = RoomOptions(
16        name="VideoSDK Cascaded Agent",
17        playground=True
18    )
19    return JobContext(room_options=room_options)
20
21if __name__ == "__main__":
22    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
23    job.start()
24

start_session initializes and starts the session, keeping it alive until manually stopped.
make_context configures the meeting room and enables the playground for browser-based testing.
The __main__ block launches the agent.

If you'd like to experiment with your agent in a browser-based environment, the

AI Agent playground

provides an interactive space for real-time testing and iteration.

Running and Testing the Agent

Step 5.1: Running the Python Script

Ensure your .env file is set up with all required API keys.
Run the agent script: bash python main.py
The console will display a Playground URL.

Step 5.2: Interacting with the Agent in the Playground

Open the Playground link in your browser.
Join as a participant; you can now speak with your AI Voice Agent in real time.
The agent will greet you and respond to your queries.
To stop the agent, press Ctrl+C in your terminal. This gracefully shuts down the session and cleans up resources.

Advanced Features and Customizations

Extending Functionality with Custom Tools

You can add custom function tools to the agent for handling specific BPO workflows, such as ticket creation or CRM integration.
Implement a function and register it with your agent to enable advanced automation.

Exploring Other Plugins

STT: Try Cartesia for best accuracy, or Rime for lower cost.
TTS: Deepgram offers a cost-effective alternative to ElevenLabs.
LLM: Experiment with Google Gemini or other supported models.

Troubleshooting Common Issues

API Key and Authentication Errors

Double-check your .env file and ensure all keys are correct and active.
If you see authentication errors, regenerate your API keys from the dashboard.

Audio Input/Output Problems

Ensure your microphone and speakers are working and permitted in your browser.
Test in different browsers if you encounter issues.

Dependency and Version Conflicts

Use a fresh virtual environment to avoid conflicts.
Check package versions if you encounter import errors.

Conclusion

Congratulations! You've built a fully functional AI Voice Agent for BPO using the VideoSDK AI Agents framework. You learned how to set up the environment, implement the agent with best-in-class plugins, and test it live.

For next steps, explore advanced function tools, integrate with your BPO systems, and experiment with different plugins to optimize performance and cost. The VideoSDK framework is highly extensible, enabling you to build production-ready AI voice solutions for any BPO workflow.

Get 10,000 Free Minutes Every Months

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ

Free 10,000 minutes for video calls

RELEVANT BLOGS