What are the core components of an AI Voice Agent?

The core components include Speech-to-Text (STT), Language Model (LLM), and Text-to-Speech (TTS).

How do I generate a VideoSDK meeting ID?

Use the provided `curl` command with your API key to generate a meeting ID.

What is the purpose of the CascadingPipeline?

It manages the flow of data through STT, LLM, and TTS, ensuring smooth processing of user interactions.

How can I test the AI Voice Agent?

Run the script and use the playground link in the console to interact with the agent.

What should I do if I encounter API key errors?

Ensure your API keys are correctly set in the `.env` file.

Build AI Voice Assistants for Support

Step-by-step guide to building AI voice assistants for customer support using VideoSDK.

Introduction to AI Voice Agents in Customer Support

AI Voice Agents are sophisticated software systems designed to interact with users through voice commands. They leverage technologies like Speech-to-Text (STT), Text-to-Speech (TTS), and Language Models (LLM) to understand and respond to human speech. These agents are particularly valuable in customer support, where they can handle routine inquiries, provide information, and guide users through troubleshooting processes.

In the customer support industry, AI Voice Agents enhance efficiency by reducing wait times and providing consistent, accurate information. They can manage a wide range of tasks from answering product queries to assisting with order tracking. By automating these interactions, businesses can focus human resources on more complex issues that require personal attention.

In this tutorial, you will learn how to build a functional AI

Voice Agent

using the VideoSDK framework. We will cover the core components such as STT, LLM, and TTS, and guide you through the process of setting up and testing your agent.

Architecture and Core Concepts

The architecture of an AI

Voice Agent

involves a series of steps that transform user speech into actionable responses. The process begins with capturing audio input, which is then converted into text using STT. This text is processed by a Language Model to generate a response, which is finally converted back into speech using TTS.

Understanding Key Concepts in the VideoSDK Framework

Agent: The core class representing your bot, responsible for handling interactions.
Cascading pipeline in AI voice Agents
: Manages the flow of audio processing, integrating STT, LLM, and TTS.
VAD &
Turn detector for AI voice Agents
: These components help the agent determine when to listen and when to respond, ensuring smooth interactions.

Setting Up the Development Environment

Before you begin, ensure you have Python 3.11+ installed and a VideoSDK account. Follow these steps to set up your environment:

Step 1: Create a Virtual Environment

1python -m venv venv
2source venv/bin/activate  # On Windows use `venv\\Scripts\\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip:

1pip install videosdk-agents videosdk-plugins
2

Step 3: Configure API Keys

Create a .env file in your project directory and add your VideoSDK API keys:

1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

Let's start by presenting the complete code for our AI Voice Agent:

1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10pre_download_model()
11
12agent_instructions = "You are a friendly and efficient AI Voice Assistant designed to enhance customer support experiences. Your primary role is to assist customers by answering their queries, providing information about products and services, and guiding them through troubleshooting processes. You can handle a wide range of customer support topics, including order status, product information, and basic troubleshooting steps. However, you must always maintain a polite and professional tone.\n\nCapabilities:\n1. Answer customer queries related to product details, order status, and service information.\n2. Provide step-by-step guidance for basic troubleshooting issues.\n3. Escalate complex issues to human support agents when necessary.\n4. Collect customer feedback to improve service quality.\n\nConstraints:\n1. You are not authorized to handle sensitive personal information such as credit card details or passwords.\n2. You must always inform customers that you are an AI assistant and not a human representative.\n3. You cannot make decisions on refunds or compensation; these must be escalated to a human agent.\n4. Ensure compliance with data protection regulations and maintain customer privacy at all times."
13
14class MyVoiceAgent(Agent):
15    def __init__(self):
16        super().__init__(instructions=agent_instructions)
17    async def on_enter(self): await self.session.say("Hello! How can I help?")
18    async def on_exit(self): await self.session.say("Goodbye!")
19
20async def start_session(context: JobContext):
21    agent = MyVoiceAgent()
22    conversation_flow = ConversationFlow(agent)
23
24    pipeline = CascadingPipeline(
25        stt=[Deepgram STT Plugin for voice agent](https://docs.videosdk.live/ai_agents/plugins/stt/deepgram)(model="nova-2", language="en"),
26        llm=[OpenAI LLM Plugin for voice agent](https://docs.videosdk.live/ai_agents/plugins/llm/openai)(model="gpt-4o"),
27        tts=[ElevenLabs TTS Plugin for voice agent](https://docs.videosdk.live/ai_agents/plugins/tts/eleven-labs)(model="eleven_flash_v2_5"),
28        vad=SileroVAD(threshold=0.35),
29        turn_detector=TurnDetector(threshold=0.8)
30    )
31
32    session = [AI voice Agent Sessions](https://docs.videosdk.live/ai_agents/core-components/agent-session)(
33        agent=agent,
34        pipeline=pipeline,
35        conversation_flow=conversation_flow
36    )
37
38    try:
39        await context.connect()
40        await session.start()
41        await asyncio.Event().wait()
42    finally:
43        await session.close()
44        await context.shutdown()
45
46def make_context() -> JobContext:
47    room_options = RoomOptions(
48        name="VideoSDK Cascaded Agent",
49        playground=True
50    )
51
52    return JobContext(room_options=room_options)
53
54if __name__ == "__main__":
55    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
56    job.start()
57

Step 4.1: Generating a VideoSDK Meeting ID

To interact with your agent, you'll need a meeting ID. Use the following curl command to generate one:

1curl -X POST https://api.videosdk.live/v1/meetings \
2-H "Authorization: Bearer YOUR_API_KEY" \
3-H "Content-Type: application/json"
4

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class inherits from the Agent class. It initializes with specific instructions detailing its capabilities and constraints. The on_enter and on_exit methods define what the agent says when a session starts or ends.

Step 4.3: Defining the Core Pipeline

The

AI voice Agent core components overview

is crucial as it defines the flow of data through the system:

STT: Converts speech to text using DeepgramSTT.
LLM: Processes text to generate responses with OpenAILLM.
TTS: Converts text back to speech via ElevenLabsTTS.
VAD & TurnDetector: Manage when the agent listens and responds.

Step 4.4: Managing the Session and Startup Logic

The start_session function initializes the agent and its conversation flow, then starts the session. The make_context function sets up the room options, and the if __name__ == "__main__": block runs the agent.

Running and Testing the Agent

Step 5.1: Running the Python Script

Execute the script using:

1python main.py
2

Step 5.2: Interacting with the Agent in the
AI Agent playground

After running the script, find the playground link in the console. Join the session and interact with your agent. Use Ctrl+C to gracefully shut down the session.

Advanced Features and Customizations

Extending Functionality with Custom Tools

You can enhance your agent by integrating custom tools that provide additional functionalities, such as advanced analytics or specialized data processing.

Exploring Other Plugins

Consider experimenting with different STT, LLM, and TTS plugins to optimize performance and cost.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correctly configured in the .env file.

Audio Input/Output Problems

Check your microphone and speaker settings, and ensure they are correctly configured.

Dependency and Version Conflicts

Ensure all dependencies are up-to-date and compatible with your Python version.

Conclusion

In this tutorial, you've built a functional AI Voice Agent for customer support using the VideoSDK framework. As next steps, consider exploring advanced features and customizations to further enhance your agent's capabilities.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ

Free $20 Balance for AI Voice Agents & Video Calls