How do I get my VideoSDK authentication token?

Sign up for a free VideoSDK account, navigate to the dashboard, and copy your API key from the developer section.

Can I use my own meeting room for testing?

Yes. Generate a meeting ID via the VideoSDK API or dashboard, then set it in the RoomOptions as room_id.

What should I do if the agent doesn't respond in the playground?

Check your API keys, ensure all dependencies are installed, and verify your microphone permissions in the browser.

How do I add custom business logic or integrations?

Implement Python functions for your business logic and register them as function tools with your agent class.

Is this solution production-ready?

The core architecture is production-ready, but you should add error handling, logging, and security checks before deploying to customers.

Can I use other LLMs or TTS/STT providers?

Yes. VideoSDK supports multiple plugins for STT, TTS, and LLM. Swap out providers as needed in the CascadingPipeline.

Build an AI Voice Agent for Customer Support (Python Guide)

Step-by-step guide to building a production-ready AI voice agent for customer support using Python and VideoSDK. Includes code, setup, and testing.

1. Introduction to AI Voice Agents in ai voice agent for customer support

What is an AI Voice Agent?

An AI Voice Agent is an intelligent, automated system that interacts with humans via natural spoken language. It leverages speech-to-text (STT), large language models (LLMs), and text-to-speech (TTS) to understand, process, and respond to user queries in real time.

Why are they important for the ai voice agent for customer support industry?

In customer support, AI Voice Agents provide 24/7 assistance, reduce wait times, and handle routine queries efficiently. They free up human agents to focus on complex issues, improve customer satisfaction, and scale support operations without proportional increases in cost.

Core Components of a Voice Agent

Speech-to-Text (STT): Converts user speech to text.
Large Language Model (LLM): Understands and generates human-like responses.
Text-to-Speech (TTS): Converts text responses back to natural-sounding speech.
Voice Activity Detection (VAD): Detects when the user is speaking.
Turn Detection: Determines when it's the agent's turn to respond.

What You'll Build in This Tutorial

You'll build a production-ready AI Voice Agent for customer support using Python and the VideoSDK AI Agents framework. The agent will handle common support tasks, escalate complex issues, and can be tested in a web-based playground.

2. Architecture and Core Concepts

High-Level Architecture Overview

The AI voice agent system consists of several modular components working together to enable real-time, conversational support. Here's a high-level overview:

1sequenceDiagram
2    participant User
3    participant Mic
4    participant STT
5    participant LLM
6    participant TTS
7    participant Agent
8    participant Speaker
9    User->>Mic: Speaks question
10    Mic->>STT: Audio stream
11    STT->>LLM: Transcribed text
12    LLM->>Agent: Generates response
13    Agent->>TTS: Response text
14    TTS->>Speaker: Synthesized speech
15    Speaker->>User: Plays response
16

Understanding Key Concepts in the VideoSDK Framework

Agent: The core class that defines the agent's persona and logic.
CascadingPipeline: Orchestrates the flow between STT, LLM, TTS, VAD, and turn detection plugins. For a detailed breakdown of these elements, see the
AI voice Agent core components overview
.
VAD & TurnDetector: VAD (Voice Activity Detection) identifies when the user is speaking, while TurnDetector determines when the agent should respond, ensuring natural conversations.

If you're looking for a practical walkthrough to get started, check out the

Voice Agent Quick Start Guide

for step-by-step instructions.

The

Cascading pipeline in AI voice Agents

is a crucial architectural pattern that enables seamless integration of STT, LLM, TTS, and other plugins, ensuring efficient and natural conversational flow.

3. Setting Up the Development Environment

Prerequisites

Python 3.11+
A VideoSDK account (for API keys and testing)

Step 1: Create a Virtual Environment

Open your terminal and run:

1python3 -m venv venv
2source venv/bin/activate  # On Windows: venv\Scripts\activate
3

Step 2: Install Required Packages

Install the VideoSDK AI Agents framework and plugin dependencies:

1pip install videosdk-agents videosdk-plugins-silero videosdk-plugins-turn-detector videosdk-plugins-deepgram videosdk-plugins-openai videosdk-plugins-elevenlabs
2

Step 3: Configure API Keys in a .env file

Create a .env file in your project directory and add your VideoSDK, Deepgram, OpenAI, and ElevenLabs API keys:

1VIDEOSDK_API_KEY=your_videosdk_api_key
2DEEPGRAM_API_KEY=your_deepgram_api_key
3OPENAI_API_KEY=your_openai_api_key
4ELEVENLABS_API_KEY=your_elevenlabs_api_key
5

4. Building the AI Voice Agent: A Step-by-Step Guide

Let's dive into the code! Here's the complete, runnable implementation for our customer support AI Voice Agent:

1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are an AI Voice Agent specializing in customer support. Your persona is that of a friendly, patient, and knowledgeable customer service representative. Your primary goal is to assist customers by answering their questions, resolving common issues, providing information about products or services, and guiding users through troubleshooting steps. You can handle inquiries related to order status, account information, product details, returns, and basic troubleshooting. Always communicate clearly, empathetically, and professionally. If a request is outside your scope or requires human intervention (such as handling sensitive personal data, processing payments, or making policy decisions), politely inform the customer and offer to escalate the issue to a human representative. Never provide legal, financial, or medical advice. Do not collect or store sensitive personal information. Always prioritize customer privacy and data security. If you are unsure about an answer, admit it and suggest connecting with a human agent for further assistance."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = AgentSession(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Now, let's break down the code step by step.

Step 4.1: Generating a VideoSDK Meeting ID

To test your agent, you'll need a VideoSDK meeting room. You can generate a meeting ID using the VideoSDK API or Dashboard. Here's a quick way using curl:

1curl -X POST \
2  -H "Authorization: YOUR_VIDEOSDK_API_KEY" \
3  -H "Content-Type: application/json" \
4  -d '{}' \
5  https://api.videosdk.live/v2/rooms
6

The response will include a roomId you can use for testing. For playground testing, you can omit the room_id and let the agent create a room automatically.

Step 4.2: Creating the Custom Agent Class (MyVoiceAgent)

The MyVoiceAgent class defines your agent's persona and greeting/exit logic.

1agent_instructions = "You are an AI Voice Agent specializing in customer support. Your persona is that of a friendly, patient, and knowledgeable customer service representative. ..."
2
3class MyVoiceAgent(Agent):
4    def __init__(self):
5        super().__init__(instructions=agent_instructions)
6    async def on_enter(self):
7        await self.session.say("Hello! How can I help?")
8    async def on_exit(self):
9        await self.session.say("Goodbye!")
10

The instructions string sets the agent's capabilities, tone, and boundaries.
on_enter and on_exit handle greetings and farewells.

Step 4.3: Defining the Core Pipeline (CascadingPipeline and plugins)

The pipeline connects all the voice agent's core components:

1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8

STT: Deepgram's Nova-2 model for fast, accurate transcription. To learn more about integrating this, see the
Deepgram STT Plugin for voice agent
.
LLM: OpenAI's GPT-4o for advanced conversational intelligence. For setup details, refer to the
OpenAI LLM Plugin for voice agent
.
TTS: ElevenLabs for natural, expressive speech. Explore the
ElevenLabs TTS Plugin for voice agent
for more options.
VAD: SileroVAD detects when the user is speaking. Check out the
Silero Voice Activity Detection
documentation for configuration tips.
TurnDetector: Ensures the agent responds at the right time. Learn more about the
Turn detector for AI voice Agents
.

Step 4.4: Managing the Session and Startup Logic

The session orchestrates the agent's lifecycle and connects all components.

1async def start_session(context: JobContext):
2    agent = MyVoiceAgent()
3    conversation_flow = ConversationFlow(agent)
4    pipeline = CascadingPipeline(
5        stt=DeepgramSTT(model="nova-2", language="en"),
6        llm=OpenAILLM(model="gpt-4o"),
7        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
8        vad=SileroVAD(threshold=0.35),
9        turn_detector=TurnDetector(threshold=0.8)
10    )
11    session = AgentSession(
12        agent=agent,
13        pipeline=pipeline,
14        conversation_flow=conversation_flow
15    )
16    try:
17        await context.connect()
18        await session.start()
19        await asyncio.Event().wait()
20    finally:
21        await session.close()
22        await context.shutdown()
23
24def make_context() -> JobContext:
25    room_options = RoomOptions(
26        name="VideoSDK Cascaded Agent",
27        playground=True
28    )
29    return JobContext(room_options=room_options)
30
31if __name__ == "__main__":
32    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
33    job.start()
34

start_session sets up the agent, pipeline, and conversation flow, then starts the session.
make_context configures the meeting room (with playground=True for easy web testing).
The main block runs the agent as a worker job.

For more on session management, see

AI voice Agent Sessions

5. Running and Testing the Agent

Step 5.1: Running the Python Script

Make sure your .env file is set up with all required API keys.
Run the agent script:

1python main.py
2

In the console output, you'll see a Playground URL. Copy this link.

Step 5.2: Interacting with the Agent in the Playground

Open the Playground URL in your browser.
Join the meeting as a participant.
Speak into your microphone and interact with the AI Voice Agent.
To stop the agent, press Ctrl+C in your terminal. The agent will gracefully shut down and release resources.

You can also experiment with your agent in the

AI Agent playground

, which provides a web-based environment for real-time testing and iteration.

6. Advanced Features and Customizations

Extending Functionality with Custom Tools

You can add custom function tools to the agent for actions like checking order status, fetching account info, or integrating with CRMs. Implement these as Python functions and register them with your agent.

Exploring Other Plugins

VideoSDK supports a wide range of plugins for STT, TTS, and LLM. Try alternatives like Cartesia STT, Google Gemini LLM, or Deepgram TTS for different needs.

7. Troubleshooting Common Issues

API Key and Authentication Errors

Double-check your .env file and ensure all API keys are valid and active.
Make sure your VideoSDK account has sufficient credits and permissions.

Audio Input/Output Problems

Ensure your microphone and speakers are working.
Try a different browser or device if you encounter issues in the playground.

Dependency and Version Conflicts

Use a fresh Python virtual environment.
Run pip list to check for conflicting package versions.

8. Conclusion

You've built a fully functional AI Voice Agent for customer support, ready for real-world testing. This agent can handle common queries, escalate complex issues, and be extended with custom tools.

For further learning, explore VideoSDK's documentation, experiment with different plugins, and integrate your agent with real backend systems to unlock even more powerful customer support automation.

Get 10,000 Free Minutes Every Months

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ

Free 10,000 minutes for video calls

RELEVANT BLOGS