Introduction to AI Voice Agents in Multilingual Conversational AI
AI Voice Agents are sophisticated systems designed to interact with users through voice commands. They are capable of understanding spoken language, processing the information, and delivering a coherent response. These agents are crucial in the multilingual conversational AI industry, as they enable seamless communication across different languages, breaking down language barriers and enhancing user experience.
What is an AI Voice Agent
?
An AI
Voice Agent
is a software application that uses artificial intelligence to process and respond to voice inputs. It leverages technologies like Speech-to-Text (STT), Language Learning Models (LLM), and Text-to-Speech (TTS) to convert spoken language into text, process the text to derive meaning, and then convert the response back into speech.Why are they important for the Multilingual Conversational AI Industry?
In a globalized world, businesses often operate across multiple countries and languages. AI Voice Agents facilitate customer service, support, and interaction in the user's native language, thereby improving accessibility and satisfaction. They are used in various sectors, including e-commerce, healthcare, and customer support.
Core Components of a Voice Agent
- STT (Speech-to-Text): Converts spoken language into text.
- LLM (Language Learning Model): Processes the text to generate a meaningful response.
- TTS (Text-to-Speech): Converts the text response back into spoken language.
For a comprehensive understanding of these elements, refer to the
AI voice Agent core components overview
.What You'll Build in This Tutorial
In this tutorial, you will build a multilingual AI
Voice Agent
using the VideoSDK framework. This agent will be capable of understanding and responding in multiple languages, making it an ideal solution for international customer service applications.Architecture and Core Concepts
High-Level Architecture Overview
The AI
Voice Agent
architecture involves several components working together to process voice inputs and generate responses. The data flow begins with the user's speech, which is captured and processed through the following steps:- Voice
Activity Detection
(VAD): Identifies when the user is speaking. - Speech-to-Text (STT): Transcribes the spoken words into text.
- Language Learning Model (LLM): Analyzes the text to generate a response.
- Text-to-Speech (TTS): Converts the response text back into speech.
Turn Detector
: Manages conversational turns between the user and the agent.

Understanding Key Concepts in the VideoSDK Framework
- Agent: Represents the core class of your bot, handling interactions and responses.
- CascadingPipeline: Manages the flow of audio processing from STT to LLM to TTS. For more details, explore the
Cascading pipeline in AI voice Agents
. - VAD & TurnDetector: Ensure the agent listens and responds at appropriate times.
Setting Up the Development Environment
Prerequisites
Before starting, ensure you have Python 3.11+ installed and a VideoSDK account, which you can create at app.videosdk.live.
Step 1: Create a Virtual Environment
To keep your project dependencies organized, create a virtual environment:
1python -m venv myenv
2source myenv/bin/activate # On Windows use `myenv\Scripts\activate`
3Step 2: Install Required Packages
Install the necessary packages using pip:
1pip install videosdk-agents videosdk-plugins
2Step 3: Configure API Keys in a .env File
Create a
.env file in your project directory and add your VideoSDK API keys:1VIDEOSDK_API_KEY=your_api_key_here
2Building the AI Voice Agent: A Step-by-Step Guide
Below is the complete, runnable code for your AI Voice Agent. We will break it down in the following sections.
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are a multilingual conversational AI designed to assist users in various languages. Your primary role is to act as a helpful customer service representative for an international e-commerce platform. You can answer questions about product details, shipping information, and return policies in multiple languages, including English, Spanish, French, and Mandarin. You are capable of understanding and responding to inquiries in the user's preferred language, ensuring a seamless and personalized experience.\n\nCapabilities:\n1. Provide detailed product information and specifications.\n2. Assist with order tracking and shipping inquiries.\n3. Explain return and refund policies clearly.\n4. Offer multilingual support, switching languages based on user preference.\n\nConstraints:\n1. You are not authorized to process payments or handle sensitive financial information.\n2. You must always include a disclaimer that users should refer to the official website for the most accurate and updated information.\n3. You cannot provide legal or financial advice and should direct users to consult with professionals for such inquiries."
14
15class MyVoiceAgent(Agent):
16 def __init__(self):
17 super().__init__(instructions=agent_instructions)
18 async def on_enter(self): await self.session.say("Hello! How can I help?")
19 async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22 # Create agent and conversation flow
23 agent = MyVoiceAgent()
24 conversation_flow = ConversationFlow(agent)
25
26 # Create pipeline
27 pipeline = CascadingPipeline(
28 stt=DeepgramSTT(model="nova-2", language="en"),
29 llm=OpenAILLM(model="gpt-4o"),
30 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31 vad=SileroVAD(threshold=0.35),
32 turn_detector=TurnDetector(threshold=0.8)
33 )
34
35 session = AgentSession(
36 agent=agent,
37 pipeline=pipeline,
38 conversation_flow=conversation_flow
39 )
40
41 try:
42 await context.connect()
43 await session.start()
44 # Keep the session running until manually terminated
45 await asyncio.Event().wait()
46 finally:
47 # Clean up resources when done
48 await session.close()
49 await context.shutdown()
50
51def make_context() -> JobContext:
52 room_options = RoomOptions(
53 # room_id="YOUR_MEETING_ID", # Set to join a pre-created room; omit to auto-create
54 name="VideoSDK Cascaded Agent",
55 playground=True
56 )
57
58 return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61 job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62 job.start()
63Step 4.1: Generating a VideoSDK Meeting ID
To interact with your agent, you need a meeting ID. Use the following
curl command to generate one:1curl -X POST "https://api.videosdk.live/v1/meetings" \
2-H "Authorization: YOUR_API_KEY" \
3-H "Content-Type: application/json"
4Step 4.2: Creating the Custom Agent Class
The
MyVoiceAgent class extends the Agent class, defining the agent's behavior. It uses the agent_instructions to guide its interactions, ensuring it can handle multilingual queries effectively.1class MyVoiceAgent(Agent):
2 def __init__(self):
3 super().__init__(instructions=agent_instructions)
4 async def on_enter(self): await self.session.say("Hello! How can I help?")
5 async def on_exit(self): await self.session.say("Goodbye!")
6Step 4.3: Defining the Core Pipeline
The
CascadingPipeline is central to processing audio inputs and generating responses. It integrates various plugins:- STT:
DeepgramSTTtranscribes speech to text. - LLM:
OpenAILLMprocesses the text to generate a response. - TTS:
ElevenLabsTTSconverts the response text back into speech. - VAD:
SileroVADdetects when the user is speaking. - TurnDetector: Manages conversational turns.
1pipeline = CascadingPipeline(
2 stt=DeepgramSTT(model="nova-2", language="en"),
3 llm=OpenAILLM(model="gpt-4o"),
4 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5 vad=SileroVAD(threshold=0.35),
6 turn_detector=TurnDetector(threshold=0.8)
7)
8Step 4.4: Managing the Session and Startup Logic
The
start_session function initializes the agent session and manages the lifecycle of the conversation. It connects to the VideoSDK service, starts the session, and handles cleanup upon termination.1async def start_session(context: JobContext):
2 # Create agent and conversation flow
3 agent = MyVoiceAgent()
4 conversation_flow = ConversationFlow(agent)
5
6 # Create pipeline
7 pipeline = CascadingPipeline(
8 stt=DeepgramSTT(model="nova-2", language="en"),
9 llm=OpenAILLM(model="gpt-4o"),
10 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
11 vad=SileroVAD(threshold=0.35),
12 turn_detector=TurnDetector(threshold=0.8)
13 )
14
15 session = AgentSession(
16 agent=agent,
17 pipeline=pipeline,
18 conversation_flow=conversation_flow
19 )
20
21 try:
22 await context.connect()
23 await session.start()
24 # Keep the session running until manually terminated
25 await asyncio.Event().wait()
26 finally:
27 # Clean up resources when done
28 await session.close()
29 await context.shutdown()
30The
make_context function sets up the room options for the agent, enabling it to operate in a test environment:1def make_context() -> JobContext:
2 room_options = RoomOptions(
3 name="VideoSDK Cascaded Agent",
4 playground=True
5 )
6
7 return JobContext(room_options=room_options)
8Finally, the
if __name__ == "__main__": block starts the job, ensuring the agent is ready to interact:1if __name__ == "__main__":
2 job = WorkerJob(entrypoint=start_session, jobctx=make_context)
3 job.start()
4Running and Testing the Agent
Step 5.1: Running the Python Script
To run your agent, execute the Python script:
1python main.py
2Step 5.2: Interacting with the Agent in the Playground
Once the script is running, check the console for a playground link. Open it in your browser to interact with your multilingual AI Voice Agent. You can speak to it in different languages and observe how it responds.
Advanced Features and Customizations
Extending Functionality with Custom Tools
You can extend the agent's capabilities by integrating custom tools. This involves creating additional functions that the agent can call to perform specific tasks, enhancing its utility.
Exploring Other Plugins
The VideoSDK framework supports various plugins for STT, LLM, and TTS. Consider experimenting with different options to find the best fit for your use case.
Troubleshooting Common Issues
API Key and Authentication Errors
Ensure your API keys are correctly configured in the
.env file. Double-check for typos or missing keys.Audio Input/Output Problems
Verify your microphone and speaker settings. Ensure your system permissions allow audio input and output.
Dependency and Version Conflicts
Ensure all dependencies are installed with compatible versions. Use a virtual environment to manage dependencies effectively.
Conclusion
Summary of What You've Built
In this tutorial, you have built a multilingual AI Voice Agent capable of interacting with users in multiple languages. This agent can handle customer inquiries, provide product information, and assist with order tracking.
Next Steps and Further Learning
Consider exploring additional plugins and tools to enhance your agent's capabilities. Continue learning about AI and voice technologies to stay ahead in the rapidly evolving field of conversational AI.
For more detailed insights into managing sessions, refer to the
AI voice Agent Sessions
.Want to level-up your learning? Subscribe now
Subscribe to our newsletter for more tech based insights
FAQ