Introduction to AI Voice Agents in Conversational AI for Sales
AI Voice Agents are intelligent systems designed to interact with users through voice commands. In the context of sales, these agents can significantly enhance customer engagement by providing instant responses to inquiries, guiding potential customers through product catalogs, and even scheduling follow-up meetings with sales representatives. This tutorial will guide you through building a Conversational AI for Sales using the VideoSDK framework.
What is an AI Voice Agent
?
An AI
Voice Agent
is a software application that uses speech recognition, natural language processing, and text-to-speech technologies to interact with users vocally. It listens to user inputs, processes them to understand the intent, and responds appropriately.Why are they important for the conversational AI for sales industry?
In sales, AI Voice Agents can automate repetitive tasks, provide 24/7 customer support, and offer personalized interactions that enhance the customer experience. They can handle initial inquiries, qualify leads, and assist in converting prospects into customers.
Core Components of a Voice Agent
- Speech-to-Text (STT): Converts spoken language into text using the
Deepgram STT Plugin for voice agent
. - Large Language Model (LLM): Processes the text to understand and generate responses.
- Text-to-Speech (TTS): Converts text responses back into spoken language, which can be enhanced with the
ElevenLabs TTS Plugin for voice agent
.
What You'll Build in This Tutorial
You will build a Conversational AI for Sales that can engage with customers, answer questions, and schedule meetings using the VideoSDK framework.
Architecture and Core Concepts
High-Level Architecture Overview
The architecture of a Conversational AI involves several components working together to process user inputs and generate responses. Here is a high-level overview of the data flow:

Understanding Key Concepts in the VideoSDK Framework
- Agent: The core class representing your bot, handling interactions.
- CascadingPipeline: Manages the flow of audio processing from STT to LLM to TTS, detailed in the
Cascading pipeline in AI voice Agents
. - VAD & TurnDetector: Ensure the agent listens and responds at the right times, utilizing
Silero Voice Activity Detection
and theTurn detector for AI voice Agents
.
Setting Up the Development Environment
Prerequisites
To get started, ensure you have Python 3.11+ installed and a VideoSDK account. Sign up at app.videosdk.live.
Step 1: Create a Virtual Environment
1python3 -m venv venv
2source venv/bin/activate # On Windows use `venv\\Scripts\\activate`
3Step 2: Install Required Packages
1pip install videosdk
2Step 3: Configure API Keys in a .env file
Create a
.env file in your project directory and add your VideoSDK API keys.Building the AI Voice Agent: A Step-by-Step Guide
First, let's present the complete code for building the AI Voice Agent:
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are a Conversational AI for Sales, designed to assist sales teams in engaging with potential customers. Your primary role is to provide information about products, answer frequently asked questions, and guide customers through the sales process. You can also schedule follow-up calls and meetings with sales representatives. However, you are not authorized to finalize sales transactions or provide personalized financial advice. Always remind users to consult with a sales representative for detailed inquiries and final decisions. Your tone should be professional yet friendly, aiming to enhance customer experience and satisfaction."
14
15class MyVoiceAgent(Agent):
16 def __init__(self):
17 super().__init__(instructions=agent_instructions)
18 async def on_enter(self): await self.session.say("Hello! How can I help?")
19 async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22 # Create agent and conversation flow
23 agent = MyVoiceAgent()
24 conversation_flow = ConversationFlow(agent)
25
26 # Create pipeline
27 pipeline = CascadingPipeline(
28 stt=DeepgramSTT(model="nova-2", language="en"),
29 llm=OpenAILLM(model="gpt-4o"),
30 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31 vad=SileroVAD(threshold=0.35),
32 turn_detector=TurnDetector(threshold=0.8)
33 )
34
35 session = AgentSession(
36 agent=agent,
37 pipeline=pipeline,
38 conversation_flow=conversation_flow
39 )
40
41 try:
42 await context.connect()
43 await session.start()
44 # Keep the session running until manually terminated
45 await asyncio.Event().wait()
46 finally:
47 # Clean up resources when done
48 await session.close()
49 await context.shutdown()
50
51def make_context() -> JobContext:
52 room_options = RoomOptions(
53 # room_id="YOUR_MEETING_ID", # Set to join a pre-created room; omit to auto-create
54 name="VideoSDK Cascaded Agent",
55 playground=True
56 )
57
58 return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61 job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62 job.start()
63Step 4.1: Generating a VideoSDK Meeting ID
To generate a meeting ID, you can use the following
curl command:1curl -X POST -H "Authorization: Bearer YOUR_API_TOKEN" -H "Content-Type: application/json" -d '{}' https://api.videosdk.live/v1/rooms
2Step 4.2: Creating the Custom Agent Class
The
MyVoiceAgent class is where you define the behavior of your voice agent. It extends the Agent class and uses the agent_instructions to guide its interactions.1class MyVoiceAgent(Agent):
2 def __init__(self):
3 super().__init__(instructions=agent_instructions)
4 async def on_enter(self): await self.session.say("Hello! How can I help?")
5 async def on_exit(self): await self.session.say("Goodbye!")
6Step 4.3: Defining the Core Pipeline
The
CascadingPipeline is the backbone of the voice agent, connecting the STT, LLM, and TTS components.1pipeline = CascadingPipeline(
2 stt=DeepgramSTT(model="nova-2", language="en"),
3 llm=OpenAILLM(model="gpt-4o"),
4 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5 vad=SileroVAD(threshold=0.35),
6 turn_detector=TurnDetector(threshold=0.8)
7)
8Step 4.4: Managing the Session and Startup Logic
The
start_session function initializes the agent session and manages the connection lifecycle.1async def start_session(context: JobContext):
2 agent = MyVoiceAgent()
3 conversation_flow = ConversationFlow(agent)
4 pipeline = CascadingPipeline(
5 stt=DeepgramSTT(model="nova-2", language="en"),
6 llm=OpenAILLM(model="gpt-4o"),
7 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
8 vad=SileroVAD(threshold=0.35),
9 turn_detector=TurnDetector(threshold=0.8)
10 )
11 session = AgentSession(
12 agent=agent,
13 pipeline=pipeline,
14 conversation_flow=conversation_flow
15 )
16 try:
17 await context.connect()
18 await session.start()
19 await asyncio.Event().wait()
20 finally:
21 await session.close()
22 await context.shutdown()
23The
make_context function sets up the room options for the agent.1def make_context() -> JobContext:
2 room_options = RoomOptions(
3 name="VideoSDK Cascaded Agent",
4 playground=True
5 )
6 return JobContext(room_options=room_options)
7Finally, the script is started with the following block:
1if __name__ == "__main__":
2 job = WorkerJob(entrypoint=start_session, jobctx=make_context)
3 job.start()
4Running and Testing the Agent
Step 5.1: Running the Python Script
To run your agent, execute the following command:
1python main.py
2Step 5.2: Interacting with the Agent in the Playground
After starting the script, you will see a playground link in the console. Open this link in a browser to interact with your AI Voice Agent.
Advanced Features and Customizations
Extending Functionality with Custom Tools
You can enhance your agent by integrating custom tools for specific tasks, such as data retrieval or advanced analytics.
Exploring Other Plugins
The VideoSDK framework supports various plugins for STT, LLM, and TTS. Experiment with different models to find the best fit for your needs.
Troubleshooting Common Issues
API Key and Authentication Errors
Ensure your API keys are correctly configured in the
.env file and that your account has the necessary permissions.Audio Input/Output Problems
Check your microphone and speaker settings if you encounter audio issues. Ensure they are correctly configured and not muted.
Dependency and Version Conflicts
Verify that all dependencies are installed with compatible versions. Use a virtual environment to manage dependencies effectively.
Conclusion
Summary of What You've Built
In this tutorial, you have built a Conversational AI for Sales using the VideoSDK framework. Your agent can interact with customers, answer questions, and schedule meetings. For a comprehensive understanding of the components involved, refer to the
AI voice Agent core components overview
.Next Steps and Further Learning
Explore additional features and plugins offered by VideoSDK to enhance your agent's capabilities. Consider integrating with CRM systems for a more comprehensive sales solution.
Want to level-up your learning? Subscribe now
Subscribe to our newsletter for more tech based insights
FAQ