What is an AI Voice Agent?

An AI Voice Agent is a system that interacts with users through speech, using technologies like STT, LLM, and TTS to process and respond to voice inputs.

Why are AI Voice Agents important for BPO companies?

AI Voice Agents enhance customer service by handling routine inquiries, providing quick responses, and assisting human agents, thereby improving efficiency.

How do I set up the development environment for building an AI Voice Agent?

Install Python 3.11+, create a virtual environment, install the required packages, and configure API keys in a `.env` file.

What plugins are used in the AI Voice Agent?

The agent uses Deepgram for STT, OpenAI for LLM, ElevenLabs for TTS, Silero for VAD, and TurnDetector for turn detection.

Build an AI Voice Assistant for BPO

Q: What are the core components of a Voice Agent?

The core components include Speech-to-Text (STT), Language Learning Models (LLM), and Text-to-Speech (TTS) technologies.

Step-by-step guide to building an AI voice assistant for BPO companies using VideoSDK.

Introduction to AI Voice Agents in BPO Companies

AI voice agents are automated systems designed to interact with humans using natural language. They play a crucial role in enhancing customer service experiences, especially in Business Process Outsourcing (BPO) companies. These agents can handle routine inquiries, provide quick responses, and assist human agents, thereby improving efficiency and customer satisfaction.

In this tutorial, we will build an AI voice assistant tailored for BPO companies using the VideoSDK framework. This assistant will utilize Speech-to-Text (STT), Language Learning Models (LLM), and Text-to-Speech (TTS) technologies to create a seamless interaction flow.

Architecture and Core Concepts

High-Level Architecture Overview

The architecture of an AI

voice agent

involves several components that work together to process user input and generate a response. The flow typically starts with capturing user speech, converting it to text, processing it through a language model, and then converting the response back to speech.

Understanding Key Concepts in the VideoSDK Framework

Agent: The core class representing your bot, responsible for managing interactions.
Cascading pipeline in AI voice Agents
: This structure defines the flow of audio processing, integrating STT, LLM, and TTS.
VAD & TurnDetector: These components help the agent determine when to listen and respond.

Setting Up the Development Environment

Prerequisites

Before you begin, ensure you have Python 3.11+ installed and a VideoSDK account. You can sign up at app.videosdk.live.

Step 1: Create a Virtual Environment

Create a virtual environment to manage dependencies:

1python -m venv venv
2source venv/bin/activate  # On Windows use `venv\Scripts\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip:

1pip install videosdk-agents videosdk-plugins
2

Step 3: Configure API Keys

Create a .env file in your project directory and add your VideoSDK API key:

1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

Here is the complete code for the AI voice agent:

1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are an AI Voice Assistant designed specifically for BPO (Business Process Outsourcing) companies. Your primary role is to assist customer service representatives by providing quick and accurate information to enhance customer interactions. \n\n**Persona:**\n- You are a knowledgeable and efficient assistant, always ready to support BPO agents in delivering excellent customer service.\n\n**Capabilities:**\n- Provide real-time information on company policies, procedures, and product details.\n- Assist in handling common customer queries and issues.\n- Offer suggestions for upselling or cross-selling based on customer interactions.\n- Log customer interactions and feedback for quality assurance purposes.\n- Support agents in managing call queues and prioritizing tasks.\n\n**Constraints and Limitations:**\n- You are not authorized to make decisions on behalf of the company or handle sensitive customer data.\n- You must always defer to a human agent for complex issues or when unsure about the information.\n- You cannot process payments or handle financial transactions.\n- Ensure compliance with data protection regulations and company policies at all times.\n- Include a disclaimer that you are an AI assistant and not a human representative."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=[Deepgram STT Plugin for voice agent](https://docs.videosdk.live/ai_agents/plugins/stt/deepgram)(model="nova-2", language="en"),
29        llm=[OpenAI LLM Plugin for voice agent](https://docs.videosdk.live/ai_agents/plugins/llm/openai)(model="gpt-4o"),
30        tts=[ElevenLabs TTS Plugin for voice agent](https://docs.videosdk.live/ai_agents/plugins/tts/eleven-labs)(model="eleven_flash_v2_5"),
31        vad=[Silero Voice Activity Detection](https://docs.videosdk.live/ai_agents/plugins/silero-vad)(threshold=0.35),
32        turn_detector=[Turn detector for AI voice Agents](https://docs.videosdk.live/ai_agents/plugins/turn-detector)(threshold=0.8)
33    )
34
35    session = [AI voice Agent Sessions](https://docs.videosdk.live/ai_agents/core-components/agent-session)(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Step 4.1: Generating a VideoSDK Meeting ID

To create a meeting ID, use the following curl command:

1curl -X POST \
2  https://api.videosdk.live/v1/rooms \
3  -H "Authorization: YOUR_API_KEY" \
4  -H "Content-Type: application/json" \
5  -d '{"name": "Test Room"}'
6

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class extends the Agent class, providing a custom implementation for entering and exiting interactions. This class is where you define how the agent greets and says goodbye to users.

Step 4.3: Defining the Core Pipeline

The CascadingPipeline is crucial as it defines the flow of data through the system. It integrates the STT, LLM, and TTS components, allowing the agent to process and respond to user inputs seamlessly.

STT (DeepgramSTT): Converts user speech to text.
LLM (OpenAILLM): Processes the text to generate a response.
TTS (ElevenLabsTTS): Converts the response text back to speech.
VAD (SileroVAD): Detects when the user is speaking.
TurnDetector: Identifies when the agent should respond.

Step 4.4: Managing the Session and Startup Logic

The start_session function initializes the agent session and manages the lifecycle of the interaction. The make_context function sets up the room options, and the main block starts the job, ensuring the agent is ready to interact.

Running and Testing the Agent

Step 5.1: Running the Python Script

To start the agent, run the following command in your terminal:

1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

Once the agent is running, you will receive a

playground link

in the console. Open this link in your browser to interact with the agent.

Advanced Features and Customizations

Extending Functionality with Custom Tools

You can extend the agent's capabilities by integrating custom tools. This allows for specialized processing or additional data handling.

Exploring Other Plugins

The VideoSDK framework supports various plugins for STT, LLM, and TTS. Explore alternatives to find the best fit for your needs.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API key is correctly set in the .env file and that it has the necessary permissions.

Audio Input/Output Problems

Check your microphone and speaker settings. Ensure they are correctly configured and not muted.

Dependency and Version Conflicts

Ensure all dependencies are up to date and compatible with your Python version.

Conclusion

Summary of What You've Built

In this guide, you've built an AI voice assistant tailored for BPO companies using the VideoSDK framework. This agent can effectively assist customer service representatives by providing real-time information and handling routine inquiries.

Next Steps and Further Learning

Explore additional features and plugins to enhance the agent's capabilities. Consider diving deeper into the VideoSDK documentation for more advanced use cases and integrations.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ

Free $20 Balance for AI Voice Agents & Video Calls