Introduction to AI Voice Agents in Live Commerce
What is an AI Voice Agent
?
An AI
Voice Agent
is a software application designed to interact with users through voice commands. These agents use technologies like Speech-to-Text (STT), Text-to-Speech (TTS), and Natural Language Processing (NLP) to understand and respond to human speech. They are becoming increasingly popular in various industries, including live commerce, where they enhance user experience by providing real-time assistance and information.Why are they important for the live commerce industry?
In the live commerce industry, AI Voice Agents play a crucial role in enhancing customer engagement. They can provide instant product information, assist in navigating through live shopping platforms, and answer customer queries. This leads to a more interactive and personalized shopping experience, increasing customer satisfaction and potentially boosting sales.
Core Components of a Voice Agent
- Speech-to-Text (STT): Converts spoken language into text.
- Text-to-Speech (TTS): Converts text back into spoken language.
- Large Language Model (LLM): Processes the text to understand and generate human-like responses.
What You’ll Build in This Tutorial
In this tutorial, you will learn how to build an AI Voice Assistant tailored for live commerce using the VideoSDK framework. You’ll integrate various plugins for STT, TTS, and LLM, and set up a complete working environment ready for testing.
Architecture and Core Concepts
High-Level Architecture Overview
The AI
Voice Agent
architecture involves several key components working together to process user input and generate responses. The flow typically starts with capturing user speech, converting it to text, processing the text to understand the intent, generating a response, and finally converting the response text back to speech.
Understanding Key Concepts in the VideoSDK Framework
- Agent: The core class representing your bot, responsible for managing interactions.
Cascading Pipeline in AI voice Agents
: Manages the flow of audio processing through STT, LLM, and TTS.- VAD & TurnDetector: These components help the agent determine when to listen and when to speak, ensuring smooth interactions.
Setting Up the Development Environment
Prerequisites
Before you begin, ensure you have Python 3.11+ installed and a VideoSDK account, which you can create at the VideoSDK website.
Step 1: Create a Virtual Environment
Creating a virtual environment helps manage dependencies and avoid conflicts. Run the following commands:
1python -m venv venv
2source venv/bin/activate # On Windows use `venv\Scripts\activate`
3Step 2: Install Required Packages
Install the necessary packages using pip:
1pip install videosdk
2pip install python-dotenv
3Step 3: Configure API Keys in a .env file
Create a
.env file in your project directory and add your VideoSDK API key:1VIDEOSDK_API_KEY=your_api_key_here
2Building the AI Voice Agent: A Step-by-Step Guide
Complete Code Block
Here is the complete code for building the AI Voice Agent:
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are an AI Voice Assistant specialized in live commerce. Your persona is that of a knowledgeable and engaging shopping guide. Your primary capabilities include assisting users in navigating live commerce platforms, providing detailed product information, answering questions about product availability, and guiding users through the purchasing process. You can also offer personalized product recommendations based on user preferences and past interactions. However, you must adhere to certain constraints: you cannot process payments or handle sensitive financial information, and you must always remind users to verify product details and prices on the official platform before making a purchase. Additionally, you should not provide medical, legal, or financial advice beyond general product information. Your goal is to enhance the live shopping experience by being informative, engaging, and helpful, while ensuring user privacy and security."
14
15class MyVoiceAgent(Agent):
16 def __init__(self):
17 super().__init__(instructions=agent_instructions)
18 async def on_enter(self): await self.session.say("Hello! How can I help?")
19 async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22 # Create agent and conversation flow
23 agent = MyVoiceAgent()
24 conversation_flow = ConversationFlow(agent)
25
26 # Create pipeline
27 pipeline = CascadingPipeline(
28 stt=DeepgramSTT(model="nova-2", language="en"),
29 llm=OpenAILLM(model="gpt-4o"),
30 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31 vad=SileroVAD(threshold=0.35),
32 turn_detector=TurnDetector(threshold=0.8)
33 )
34
35 session = AgentSession(
36 agent=agent,
37 pipeline=pipeline,
38 conversation_flow=conversation_flow
39 )
40
41 try:
42 await context.connect()
43 await session.start()
44 # Keep the session running until manually terminated
45 await asyncio.Event().wait()
46 finally:
47 # Clean up resources when done
48 await session.close()
49 await context.shutdown()
50
51def make_context() -> JobContext:
52 room_options = RoomOptions(
53 # room_id="YOUR_MEETING_ID", # Set to join a pre-created room; omit to auto-create
54 name="VideoSDK Cascaded Agent",
55 playground=True
56 )
57
58 return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61 job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62 job.start()
63Step 4.1: Generating a VideoSDK Meeting ID
To create a meeting ID, use the following
curl command:1curl -X POST https://api.videosdk.live/v1/meetings -H "Authorization: YOUR_API_KEY" -H "Content-Type: application/json"
2Step 4.2: Creating the Custom Agent Class
The
MyVoiceAgent class extends the Agent class. It initializes with specific instructions tailored for live commerce. The on_enter and on_exit methods define what the agent says when a session starts and ends.1class MyVoiceAgent(Agent):
2 def __init__(self):
3 super().__init__(instructions=agent_instructions)
4 async def on_enter(self): await self.session.say("Hello! How can I help?")
5 async def on_exit(self): await self.session.say("Goodbye!")
6Step 4.3: Defining the Core Pipeline
The
CascadingPipeline orchestrates the flow of audio data through various processing stages. Each plugin, like Deepgram STT Plugin for voice agent
for speech recognition andOpenAI LLM Plugin for voice agent
for language understanding, plays a specific role.1pipeline = CascadingPipeline(
2 stt=DeepgramSTT(model="nova-2", language="en"),
3 llm=OpenAILLM(model="gpt-4o"),
4 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5 vad=SileroVAD(threshold=0.35),
6 turn_detector=TurnDetector(threshold=0.8)
7)
8Step 4.4: Managing the Session and Startup Logic
The
start_session function manages the session lifecycle. It sets up the agent, pipeline, and conversation flow, and ensures the session runs until manually terminated.1async def start_session(context: JobContext):
2 agent = MyVoiceAgent()
3 conversation_flow = ConversationFlow(agent)
4 pipeline = CascadingPipeline(
5 stt=DeepgramSTT(model="nova-2", language="en"),
6 llm=OpenAILLM(model="gpt-4o"),
7 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
8 vad=SileroVAD(threshold=0.35),
9 turn_detector=TurnDetector(threshold=0.8)
10 )
11 session = AgentSession(
12 agent=agent,
13 pipeline=pipeline,
14 conversation_flow=conversation_flow
15 )
16 try:
17 await context.connect()
18 await session.start()
19 await asyncio.Event().wait()
20 finally:
21 await session.close()
22 await context.shutdown()
23The
make_context function sets up the room options, enabling the playground mode for testing.1def make_context() -> JobContext:
2 room_options = RoomOptions(
3 name="VideoSDK Cascaded Agent",
4 playground=True
5 )
6 return JobContext(room_options=room_options)
7Running and Testing the Agent
Step 5.1: Running the Python Script
To run the agent, execute the script using:
1python main.py
2Step 5.2: Interacting with the Agent in the Playground
After running the script, you will see a playground link in the console. Use this link to join the session and interact with your AI Voice Agent. You can test various queries and see how the agent responds.
Advanced Features and Customizations
Extending Functionality with Custom Tools
The VideoSDK framework allows you to extend the agent's functionality using custom tools. This enables you to add new capabilities tailored to specific use cases.
Exploring Other Plugins
While this tutorial uses specific plugins for STT, TTS, and LLM, VideoSDK supports various other options. You can explore alternatives based on your requirements and preferences. For instance, the
ElevenLabs TTS Plugin for voice agent
offers advanced text-to-speech capabilities.Troubleshooting Common Issues
API Key and Authentication Errors
Ensure your API keys are correctly set in the
.env file. Double-check for any typos or missing values.Audio Input/Output Problems
Verify that your microphone and speakers are correctly configured and accessible by the application.
Dependency and Version Conflicts
Ensure all dependencies are installed with compatible versions. Use a virtual environment to manage these dependencies effectively.
Conclusion
Summary of What You’ve Built
You have successfully built an AI Voice Assistant for live commerce using the VideoSDK framework. This agent can assist users in navigating live shopping platforms, providing product information, and enhancing the overall shopping experience.
Next Steps and Further Learning
Consider exploring additional plugins and customizing the agent further to suit specific business needs. Continue learning by experimenting with different configurations and extending the agent's capabilities. For a comprehensive understanding of the system's components, refer to the
AI voice Agent core components overview
andAI voice Agent Sessions
.Want to level-up your learning? Subscribe now
Subscribe to our newsletter for more tech based insights
FAQ