Build a Business AI Voice Agent

Step-by-step guide to building an AI Voice Agent for businesses using VideoSDK. Includes code, setup, and testing instructions.

Introduction to AI Voice Agents in voice agent for businesses

AI Voice Agents are sophisticated systems designed to interact with users through natural language processing. They are essential in the business industry for automating customer service, scheduling, and providing information efficiently. Core components of a voice agent include Speech-to-Text (STT), Language Learning Models (LLM), and Text-to-Speech (TTS) technologies. In this tutorial, you will build a business-focused AI Voice Agent using the VideoSDK framework.

Architecture and Core Concepts

AI Voice Agents operate through a series of interconnected processes that convert user speech into actionable responses. For a comprehensive understanding of these processes, refer to the

AI voice Agent core components overview

. Here is a high-level overview of the architecture:
Diagram

Understanding Key Concepts in the VideoSDK Framework

Setting Up the Development Environment

Prerequisites

Ensure you have Python 3.11+ and a VideoSDK account at app.videosdk.live.

Step 1: Create a Virtual Environment

Create a virtual environment to manage your project dependencies:
1python -m venv venv
2source venv/bin/activate  # On Windows use `venv\\Scripts\\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip:
1pip install videosdk-agents videosdk-plugins
2

Step 3: Configure API Keys

Create a .env file in your project directory and add your API keys:
1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

Here is the complete code to build your AI Voice Agent:
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10pre_download_model()
11
12agent_instructions = "You are a professional and efficient voice agent designed specifically for businesses. Your primary role is to assist business clients by providing accurate information, scheduling meetings, and managing inquiries related to business operations. You can handle a wide range of business-related queries, including product information, service details, and customer support. However, you must always maintain a professional tone and ensure that all interactions are conducted with the utmost respect and confidentiality. You are not authorized to provide financial advice or make business decisions on behalf of the company. Always remind users to consult with a qualified professional for any financial or strategic business decisions. Your goal is to enhance business efficiency and customer satisfaction through seamless voice interactions."
13
14class MyVoiceAgent(Agent):
15    def __init__(self):
16        super().__init__(instructions=agent_instructions)
17    async def on_enter(self): await self.session.say("Hello! How can I help?")
18    async def on_exit(self): await self.session.say("Goodbye!")
19
20async def start_session(context: JobContext):
21    agent = MyVoiceAgent()
22    conversation_flow = ConversationFlow(agent)
23
24    pipeline = CascadingPipeline(
25        stt=DeepgramSTT(model="nova-2", language="en"),
26        llm=OpenAILLM(model="gpt-4o"),
27        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
28        vad=SileroVAD(threshold=0.35),
29        turn_detector=TurnDetector(threshold=0.8)
30    )
31
32    session = AgentSession(
33        agent=agent,
34        pipeline=pipeline,
35        conversation_flow=conversation_flow
36    )
37
38    try:
39        await context.connect()
40        await session.start()
41        await asyncio.Event().wait()
42    finally:
43        await session.close()
44        await context.shutdown()
45
46def make_context() -> JobContext:
47    room_options = RoomOptions(
48        name="VideoSDK Cascaded Agent",
49        playground=True
50    )
51
52    return JobContext(room_options=room_options)
53
54if __name__ == "__main__":
55    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
56    job.start()
57

Step 4.1: Generating a VideoSDK Meeting ID

To interact with your agent, you need a meeting ID. Use the following curl command to generate one:
1curl -X POST https://api.videosdk.live/v1/meetings -H "Authorization: Bearer YOUR_API_KEY"
2

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class extends the Agent class, providing custom behavior for entering and exiting interactions:
1class MyVoiceAgent(Agent):
2    def __init__(self):
3        super().__init__(instructions=agent_instructions)
4    async def on_enter(self): await self.session.say("Hello! How can I help?")
5    async def on_exit(self): await self.session.say("Goodbye!")
6

Step 4.3: Defining the Core Pipeline

The CascadingPipeline orchestrates the flow of data through various plugins, including the

Deepgram STT Plugin for voice agent

,

OpenAI LLM Plugin for voice agent

, and

ElevenLabs TTS Plugin for voice agent

:
1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8

Step 4.4: Managing the Session and Startup Logic

The start_session function initializes and manages the agent session, as detailed in the

AI voice Agent Sessions

:
1async def start_session(context: JobContext):
2    agent = MyVoiceAgent()
3    conversation_flow = ConversationFlow(agent)
4
5    session = AgentSession(
6        agent=agent,
7        pipeline=pipeline,
8        conversation_flow=conversation_flow
9    )
10
11    try:
12        await context.connect()
13        await session.start()
14        await asyncio.Event().wait()
15    finally:
16        await session.close()
17        await context.shutdown()
18
The make_context function sets up the job context with room options:
1def make_context() -> JobContext:
2    room_options = RoomOptions(
3        name="VideoSDK Cascaded Agent",
4        playground=True
5    )
6
7    return JobContext(room_options=room_options)
8

Running and Testing the Agent

Step 5.1: Running the Python Script

Execute the script with:
1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

Once the script is running, find the playground link in the console output. Join the session and interact with your agent.

Advanced Features and Customizations

Extending Functionality with Custom Tools

Enhance your agent by integrating custom tools using the function_tool concept.

Exploring Other Plugins

Explore alternative STT, LLM, and TTS plugins to tailor the agent to your needs. For a quick setup, refer to the

Voice Agent Quick Start Guide

.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correctly configured in the .env file.

Audio Input/Output Problems

Check your audio device settings and ensure they are correctly configured.

Dependency and Version Conflicts

Verify that all dependencies are installed with compatible versions.

Conclusion

Summary of What You've Built

You have successfully built an AI Voice Agent for businesses using VideoSDK, capable of handling various business-related tasks.

Next Steps and Further Learning

Explore additional features and plugins to enhance your agent's capabilities. For more guidance, revisit the

Voice Agent Quick Start Guide

.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ