Why use VideoSDK for building voice agents?

VideoSDK provides a comprehensive framework with plugins for STT, LLM, and TTS, making it easier to build and deploy voice agents.

How do I get a VideoSDK API key?

Sign up at app.videosdk.live, and generate your API key from the dashboard under the API section.

Can I use other STT or TTS services with VideoSDK?

Yes, VideoSDK supports various plugins, allowing you to integrate different STT and TTS services.

What are the prerequisites for building an AI Voice Agent?

You need Python 3.11+, a VideoSDK account, and the necessary API keys to start building an AI Voice Agent.

How to Build an AI Voice Agent

Q: What is an AI Voice Agent?

An AI Voice Agent is a software application that interacts with users through voice commands, processing spoken language to understand and respond.

Step-by-step guide to building an AI Voice Agent using VideoSDK, complete with code examples and testing instructions.

Introduction to AI Voice Agents in how to build an ai voice agent

What is an AI Voice Agent?

An AI Voice Agent is a software application designed to interact with users through voice commands. It processes spoken language, interprets the intent, and responds appropriately, often mimicking human-like conversations. These agents leverage technologies such as Speech-to-Text (STT), Text-to-Speech (TTS), and Language Learning Models (LLM) to facilitate seamless communication.

Why are they important for the how to build an ai voice agent industry?

AI Voice Agents are crucial in various industries, providing customer support, automating routine tasks, and enhancing user experiences. In the context of building AI voice agents, they serve as practical examples to understand the integration of multiple AI technologies.

Core Components of a Voice Agent

STT (Speech-to-Text): Converts spoken language into text.
LLM (Language Learning Model): Processes the text to understand and generate responses.
TTS (Text-to-Speech): Converts text responses back into spoken language.

What You'll Build in This Tutorial

In this tutorial, you will build a fully functional AI Voice Agent using the VideoSDK framework. This agent will guide users on how to build AI voice agents, providing step-by-step instructions and answering common questions. For a comprehensive overview, refer to the

Voice Agent Quick Start Guide

Architecture and Core Concepts

High-Level Architecture Overview

The AI Voice Agent architecture involves a data flow that starts with user speech, which is converted to text using STT. The text is then processed by an LLM to generate a response, which is converted back to speech using TTS. This cycle repeats as the agent interacts with the user.

Understanding Key Concepts in the VideoSDK Framework

Agent: The core class representing your bot, handling interactions and responses.
CascadingPipeline: Manages the flow of audio processing, linking STT, LLM, and TTS components. Learn more about the
Cascading pipeline in AI voice Agents
.
VAD & TurnDetector: These components help the agent determine when to listen and when to speak, ensuring smooth interaction. Explore the
Turn detector for AI voice Agents
for more details.

Setting Up the Development Environment

Prerequisites

To get started, ensure you have Python 3.11+ installed and a VideoSDK account, which you can create at app.videosdk.live.

Step 1: Create a Virtual Environment

To avoid conflicts, create a virtual environment:

bash
python -m venv voice-agent-env
source voice-agent-env/bin/activate  # On Windows use `voice-agent-env\Scripts\activate`

Step 2: Install Required Packages

Install the necessary packages using pip:

bash
pip install videosdk
pip install python-dotenv

Step 3: Configure API Keys in a `.env` file

Create a .env file in your project directory and add your VideoSDK API keys: VIDEOSDK_API_KEY=your_api_key_here

Building the AI Voice Agent: A Step-by-Step Guide

Here is the complete, runnable code for the AI Voice Agent:

1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are an AI Voice Agent specialized in guiding users on 'how to build an AI voice agent'. Your persona is that of a knowledgeable and friendly tech mentor. Your primary capabilities include providing step-by-step instructions, offering tips on best practices, and suggesting tools and frameworks for building AI voice agents. You can also answer common questions related to AI voice agent development. However, you must clarify that you are not a substitute for professional software development training and recommend consulting with experienced developers for complex issues. Always encourage users to test their implementations thoroughly before deployment."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = AgentSession(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Step 4.1: Generating a VideoSDK Meeting ID

To interact with the AI Voice Agent, you need a meeting ID. You can generate one using the VideoSDK API:

bash
curl -X POST "https://api.videosdk.live/v1/meetings" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json"

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class extends the Agent class, defining the agent's behavior:

python
class MyVoiceAgent(Agent):
    def __init__(self):
        super().__init__(instructions=agent_instructions)
    async def on_enter(self): await self.session.say("Hello! How can I help?")
    async def on_exit(self): await self.session.say("Goodbye!")

This class initializes the agent with specific instructions and defines actions upon entering and exiting a session.

Step 4.3: Defining the Core Pipeline

The CascadingPipeline orchestrates the flow of audio processing:

python
pipeline = CascadingPipeline(
    stt=DeepgramSTT(model="nova-2", language="en"),
    llm=OpenAILLM(model="gpt-4o"),
    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
    vad=SileroVAD(threshold=0.35),
    turn_detector=TurnDetector(threshold=0.8)
)

Each component in the pipeline has a specific role, from converting speech to text, processing the text, and converting the response back to speech. For more information on the TTS component, check out the

ElevenLabs TTS Plugin for voice agent

Step 4.4: Managing the Session and Startup Logic

The start_session function manages the session lifecycle: ```python async def start_session(context: JobContext):

1# Create agent and conversation flow
2agent = MyVoiceAgent()
3conversation_flow = ConversationFlow(agent)
4
5# Create pipeline
6pipeline = CascadingPipeline(
7    stt=DeepgramSTT(model="nova-2", language="en"),
8    llm=OpenAILLM(model="gpt-4o"),
9    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
10    vad=SileroVAD(threshold=0.35),
11    turn_detector=TurnDetector(threshold=0.8)
12)
13
14session = AgentSession(
15    agent=agent,
16    pipeline=pipeline,
17    conversation_flow=conversation_flow
18)
19
20try:
21    await context.connect()
22    await session.start()
23    # Keep the session running until manually terminated
24    await asyncio.Event().wait()
25finally:
26    # Clean up resources when done
27    await session.close()
28    await context.shutdown()

1This function sets up the agent, pipeline, and conversation flow, and handles the connection and cleanup processes. For a deeper understanding of the session management, refer to the [AI voice Agent Sessions](https://docs.videosdk.live/ai_agents/core-components/agent-session).
2
3## Running and Testing the Agent
4
5### Step 5.1: Running the Python Script
6To run the agent, execute the Python script:
7

bash python main.py ```

Step 5.2: Interacting with the Agent in the Playground

Once the script is running, a playground link will be displayed in the console. Use this link to join the session and interact with the agent. You can test the agent's responses and functionality.

Advanced Features and Customizations

Extending Functionality with Custom Tools

The VideoSDK framework allows you to extend the agent's capabilities by integrating custom tools, enhancing the agent's functionality.

Exploring Other Plugins

Beyond the default plugins, you can explore other STT, LLM, and TTS options to better suit your specific needs. For instance, consider using the

Deepgram STT Plugin for voice agent

for advanced speech-to-text capabilities or the

Silero Voice Activity Detection

for improved voice activity detection.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correctly set in the .env file and that you have the necessary permissions.

Audio Input/Output Problems

Check your microphone and speaker settings, and ensure they are configured correctly.

Dependency and Version Conflicts

Ensure all dependencies are installed with compatible versions as specified in the documentation.

Conclusion

Summary of What You've Built

You have successfully built an AI Voice Agent using the VideoSDK framework, capable of interacting with users and providing guidance on building AI voice agents. For a complete understanding of the components involved, review the

AI voice Agent core components overview

Next Steps and Further Learning

Explore additional plugins and customizations to enhance your agent's capabilities. Continue learning about AI technologies to build more sophisticated voice agents.

Get 10,000 Free Minutes Every Months

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ

Free 10,000 minutes for video calls

RELEVANT BLOGS