Build an AI Voice Agent for BPO with VideoSDK

Comprehensive tutorial to build, test, and customize an AI Voice Agent for BPO using VideoSDK and Python. Includes full code, setup, and troubleshooting.

Introduction to AI Voice Agents in ai voice agent for bpo

What is an AI Voice Agent?

AI Voice Agents are intelligent software systems that can understand, process, and respond to human speech in real time. They leverage automatic speech recognition (ASR), natural language processing (NLP), and text-to-speech (TTS) technologies to interact with users over the phone or other voice channels.

Why are they important for the ai voice agent for bpo industry?

Business Process Outsourcing (BPO) companies handle large volumes of customer interactions. AI Voice Agents help BPOs scale their operations, reduce costs, and provide 24/7 support. They can handle routine queries, process transactions, and escalate complex issues to human agents, all while maintaining a consistent and professional tone.

Core Components of a Voice Agent

  • Speech-to-Text (STT): Converts spoken language into text.
  • Natural Language Understanding (NLU/LLM): Interprets the meaning of the text.
  • Text-to-Speech (TTS): Converts the agent's response back into natural-sounding speech.
  • Voice Activity Detection (VAD) & Turn Detection: Determines when the user is speaking and when it's the agent's turn to respond.
If you're new to building these systems, the

Voice Agent Quick Start Guide

provides a step-by-step introduction to get you started quickly.

What You'll Build in This Tutorial

In this tutorial, you'll build a fully functional AI Voice Agent tailored for BPO use cases using the VideoSDK AI Agents framework. You'll learn how to set up your environment, implement the agent, and test it in a real-time voice playground.

Architecture and Core Concepts

High-Level Architecture Overview

The AI Voice Agent processes audio in a pipeline: user speech is captured, transcribed to text, interpreted by a language model, and then synthesized back to speech for the response. Each component is modular and can be swapped for different plugins. For a detailed explanation of these elements, see the

AI voice Agent core components overview

.

Data Flow Sequence

1sequenceDiagram
2    participant User
3    participant AgentSession
4    participant CascadingPipeline
5    participant DeepgramSTT
6    participant OpenAILLM
7    participant ElevenLabsTTS
8    participant VAD
9    participant TurnDetector
10    User->>AgentSession: Speaks
11    AgentSession->>VAD: Detects speech activity
12    VAD->>TurnDetector: Detects turn end
13    TurnDetector->>DeepgramSTT: Sends audio for transcription
14    DeepgramSTT->>OpenAILLM: Sends transcript
15    OpenAILLM->>ElevenLabsTTS: Gets response
16    ElevenLabsTTS->>AgentSession: Plays audio response
17    AgentSession->>User: Responds
18
The

Cascading pipeline in AI voice Agents

is central to this process, ensuring seamless integration and flow between each plugin and component.

Understanding Key Concepts in the VideoSDK Framework

  • Agent: The core class that defines the agent's persona and behavior.
  • CascadingPipeline: Manages the flow of audio through STT, LLM, TTS, VAD, and turn detection.
  • VAD & TurnDetector: Ensure the agent listens and responds at the right moments, creating a natural conversation flow.
To learn more about managing real-time agent interactions, refer to the

AI voice Agent Sessions

documentation.

Setting Up the Development Environment

Prerequisites

  • Python 3.11+ (ensure compatibility with VideoSDK agents)
  • A VideoSDK Account: Sign up and access your dashboard to obtain API credentials.

Step 1: Create a Virtual Environment

It's best practice to isolate your project dependencies. bash python3.11 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate

Step 2: Install Required Packages

Install the VideoSDK AI Agents framework and required plugins. bash pip install videosdk-agents videosdk-plugins-silero videosdk-plugins-turn-detector videosdk-plugins-deepgram videosdk-plugins-openai videosdk-plugins-elevenlabs

Step 3: Configure API Keys in a .env File

Create a .env file in your project directory and add your API keys. env VIDEOSDK_API_KEY=your_videosdk_api_key DEEPGRAM_API_KEY=your_deepgram_api_key OPENAI_API_KEY=your_openai_api_key ELEVENLABS_API_KEY=your_elevenlabs_api_key

Building the AI Voice Agent: A Step-by-Step Guide

Let's look at the complete, runnable code for the AI Voice Agent. We'll then break down each section to understand how it works.
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are an efficient and professional AI Voice Agent designed specifically for Business Process Outsourcing (BPO) environments. Your persona is that of a courteous, knowledgeable, and patient customer service representative. Your primary capabilities include: answering customer queries related to products or services, handling basic troubleshooting, processing simple transactions, escalating complex issues to human agents, and providing information about company policies and procedures. You must always maintain a polite and empathetic tone, ensure customer data privacy, and strictly adhere to provided scripts and compliance guidelines. You are not authorized to make decisions outside predefined protocols, provide personal opinions, or handle sensitive financial or legal matters. Always inform the customer when you are escalating their issue to a human agent. If you are unsure or unable to assist, politely suggest that a human representative will follow up. Never collect or store sensitive personal information beyond what is explicitly permitted by company policy."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = AgentSession(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63
Now, let's break down each part of the code and explain its function.

Step 4.1: Generating a VideoSDK Meeting ID

Before you can run your agent, you'll need a meeting room where the agent can interact with users. You can generate a meeting ID using the VideoSDK API.
1curl -X POST \
2  -H "Authorization: YOUR_VIDEOSDK_API_KEY" \
3  -H "Content-Type: application/json" \
4  -d '{"region":"sg001"}' \
5  https://api.videosdk.live/v2/rooms
6
The response will include a roomId you can use. For testing, you can let the agent auto-create the room by omitting the room_id in RoomOptions.

Step 4.2: Creating the Custom Agent Class (MyVoiceAgent)

The agent's persona and behavior are defined in a custom class that inherits from Agent.
1agent_instructions = "You are an efficient and professional AI Voice Agent designed specifically for Business Process Outsourcing (BPO) environments. ..."
2
3class MyVoiceAgent(Agent):
4    def __init__(self):
5        super().__init__(instructions=agent_instructions)
6    async def on_enter(self):
7        await self.session.say("Hello! How can I help?")
8    async def on_exit(self):
9        await self.session.say("Goodbye!")
10
  • The agent_instructions string guides the LLM on how the agent should behave.
  • on_enter and on_exit provide greetings and farewells when the session starts and ends.

Step 4.3: Defining the Core Pipeline (CascadingPipeline and plugins)

The pipeline orchestrates the flow of audio and text through the agent.
1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8

Step 4.4: Managing the Session and Startup Logic

The session brings together the agent, pipeline, and conversation flow, and manages their lifecycle.
1async def start_session(context: JobContext):
2    agent = MyVoiceAgent()
3    conversation_flow = ConversationFlow(agent)
4    pipeline = CascadingPipeline(...)
5    session = AgentSession(agent=agent, pipeline=pipeline, conversation_flow=conversation_flow)
6    try:
7        await context.connect()
8        await session.start()
9        await asyncio.Event().wait()
10    finally:
11        await session.close()
12        await context.shutdown()
13
14def make_context() -> JobContext:
15    room_options = RoomOptions(
16        name="VideoSDK Cascaded Agent",
17        playground=True
18    )
19    return JobContext(room_options=room_options)
20
21if __name__ == "__main__":
22    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
23    job.start()
24
  • start_session initializes and starts the session, keeping it alive until manually stopped.
  • make_context configures the meeting room and enables the playground for browser-based testing.
  • The __main__ block launches the agent.
If you'd like to experiment with your agent in a browser-based environment, the

AI Agent playground

provides an interactive space for real-time testing and iteration.

Running and Testing the Agent

Step 5.1: Running the Python Script

  1. Ensure your .env file is set up with all required API keys.
  2. Run the agent script: bash python main.py
  3. The console will display a Playground URL.

Step 5.2: Interacting with the Agent in the Playground

  • Open the Playground link in your browser.
  • Join as a participant; you can now speak with your AI Voice Agent in real time.
  • The agent will greet you and respond to your queries.
  • To stop the agent, press Ctrl+C in your terminal. This gracefully shuts down the session and cleans up resources.

Advanced Features and Customizations

Extending Functionality with Custom Tools

  • You can add custom function tools to the agent for handling specific BPO workflows, such as ticket creation or CRM integration.
  • Implement a function and register it with your agent to enable advanced automation.

Exploring Other Plugins

  • STT: Try Cartesia for best accuracy, or Rime for lower cost.
  • TTS: Deepgram offers a cost-effective alternative to ElevenLabs.
  • LLM: Experiment with Google Gemini or other supported models.

Troubleshooting Common Issues

API Key and Authentication Errors

  • Double-check your .env file and ensure all keys are correct and active.
  • If you see authentication errors, regenerate your API keys from the dashboard.

Audio Input/Output Problems

  • Ensure your microphone and speakers are working and permitted in your browser.
  • Test in different browsers if you encounter issues.

Dependency and Version Conflicts

  • Use a fresh virtual environment to avoid conflicts.
  • Check package versions if you encounter import errors.

Conclusion

Congratulations! You've built a fully functional AI Voice Agent for BPO using the VideoSDK AI Agents framework. You learned how to set up the environment, implement the agent with best-in-class plugins, and test it live.
For next steps, explore advanced function tools, integrate with your BPO systems, and experiment with different plugins to optimize performance and cost. The VideoSDK framework is highly extensible, enabling you to build production-ready AI voice solutions for any BPO workflow.

Get 10,000 Free Minutes Every Months

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ