Build a Multilingual AI Voice Agent

Create a multilingual AI Voice Agent with VideoSDK. Follow this step-by-step guide with code examples to build and test your agent.

Introduction to AI Voice Agents in Multilingual Conversational AI

AI Voice Agents are sophisticated systems designed to interact with users through voice commands. They are capable of understanding spoken language, processing the information, and delivering a coherent response. These agents are crucial in the multilingual conversational AI industry, as they enable seamless communication across different languages, breaking down language barriers and enhancing user experience.

What is an AI

Voice Agent

?

An AI

Voice Agent

is a software application that uses artificial intelligence to process and respond to voice inputs. It leverages technologies like Speech-to-Text (STT), Language Learning Models (LLM), and Text-to-Speech (TTS) to convert spoken language into text, process the text to derive meaning, and then convert the response back into speech.

Why are they important for the Multilingual Conversational AI Industry?

In a globalized world, businesses often operate across multiple countries and languages. AI Voice Agents facilitate customer service, support, and interaction in the user's native language, thereby improving accessibility and satisfaction. They are used in various sectors, including e-commerce, healthcare, and customer support.

Core Components of a

Voice Agent

  • STT (Speech-to-Text): Converts spoken language into text.
  • LLM (Language Learning Model): Processes the text to generate a meaningful response.
  • TTS (Text-to-Speech): Converts the text response back into spoken language.
For a comprehensive understanding of these elements, refer to the

AI voice Agent core components overview

.

What You'll Build in This Tutorial

In this tutorial, you will build a multilingual AI

Voice Agent

using the VideoSDK framework. This agent will be capable of understanding and responding in multiple languages, making it an ideal solution for international customer service applications.

Architecture and Core Concepts

High-Level Architecture Overview

The AI

Voice Agent

architecture involves several components working together to process voice inputs and generate responses. The data flow begins with the user's speech, which is captured and processed through the following steps:
  1. Voice

    Activity Detection

    (VAD)
    : Identifies when the user is speaking.
  2. Speech-to-Text (STT): Transcribes the spoken words into text.
  3. Language Learning Model (LLM): Analyzes the text to generate a response.
  4. Text-to-Speech (TTS): Converts the response text back into speech.
  5. Turn Detector

    : Manages conversational turns between the user and the agent.
Diagram

Understanding Key Concepts in the VideoSDK Framework

  • Agent: Represents the core class of your bot, handling interactions and responses.
  • CascadingPipeline: Manages the flow of audio processing from STT to LLM to TTS. For more details, explore the

    Cascading pipeline in AI voice Agents

    .
  • VAD & TurnDetector: Ensure the agent listens and responds at appropriate times.

Setting Up the Development Environment

Prerequisites

Before starting, ensure you have Python 3.11+ installed and a VideoSDK account, which you can create at app.videosdk.live.

Step 1: Create a Virtual Environment

To keep your project dependencies organized, create a virtual environment:
1python -m venv myenv
2source myenv/bin/activate  # On Windows use `myenv\Scripts\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip:
1pip install videosdk-agents videosdk-plugins
2

Step 3: Configure API Keys in a .env File

Create a .env file in your project directory and add your VideoSDK API keys:
1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

Below is the complete, runnable code for your AI Voice Agent. We will break it down in the following sections.
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are a multilingual conversational AI designed to assist users in various languages. Your primary role is to act as a helpful customer service representative for an international e-commerce platform. You can answer questions about product details, shipping information, and return policies in multiple languages, including English, Spanish, French, and Mandarin. You are capable of understanding and responding to inquiries in the user's preferred language, ensuring a seamless and personalized experience.\n\nCapabilities:\n1. Provide detailed product information and specifications.\n2. Assist with order tracking and shipping inquiries.\n3. Explain return and refund policies clearly.\n4. Offer multilingual support, switching languages based on user preference.\n\nConstraints:\n1. You are not authorized to process payments or handle sensitive financial information.\n2. You must always include a disclaimer that users should refer to the official website for the most accurate and updated information.\n3. You cannot provide legal or financial advice and should direct users to consult with professionals for such inquiries."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = AgentSession(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Step 4.1: Generating a VideoSDK Meeting ID

To interact with your agent, you need a meeting ID. Use the following curl command to generate one:
1curl -X POST "https://api.videosdk.live/v1/meetings" \
2-H "Authorization: YOUR_API_KEY" \
3-H "Content-Type: application/json"
4

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class extends the Agent class, defining the agent's behavior. It uses the agent_instructions to guide its interactions, ensuring it can handle multilingual queries effectively.
1class MyVoiceAgent(Agent):
2    def __init__(self):
3        super().__init__(instructions=agent_instructions)
4    async def on_enter(self): await self.session.say("Hello! How can I help?")
5    async def on_exit(self): await self.session.say("Goodbye!")
6

Step 4.3: Defining the Core Pipeline

The CascadingPipeline is central to processing audio inputs and generating responses. It integrates various plugins:
  • STT: DeepgramSTT transcribes speech to text.
  • LLM: OpenAILLM processes the text to generate a response.
  • TTS: ElevenLabsTTS converts the response text back into speech.
  • VAD: SileroVAD detects when the user is speaking.
  • TurnDetector: Manages conversational turns.
1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8

Step 4.4: Managing the Session and Startup Logic

The start_session function initializes the agent session and manages the lifecycle of the conversation. It connects to the VideoSDK service, starts the session, and handles cleanup upon termination.
1async def start_session(context: JobContext):
2    # Create agent and conversation flow
3    agent = MyVoiceAgent()
4    conversation_flow = ConversationFlow(agent)
5
6    # Create pipeline
7    pipeline = CascadingPipeline(
8        stt=DeepgramSTT(model="nova-2", language="en"),
9        llm=OpenAILLM(model="gpt-4o"),
10        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
11        vad=SileroVAD(threshold=0.35),
12        turn_detector=TurnDetector(threshold=0.8)
13    )
14
15    session = AgentSession(
16        agent=agent,
17        pipeline=pipeline,
18        conversation_flow=conversation_flow
19    )
20
21    try:
22        await context.connect()
23        await session.start()
24        # Keep the session running until manually terminated
25        await asyncio.Event().wait()
26    finally:
27        # Clean up resources when done
28        await session.close()
29        await context.shutdown()
30
The make_context function sets up the room options for the agent, enabling it to operate in a test environment:
1def make_context() -> JobContext:
2    room_options = RoomOptions(
3        name="VideoSDK Cascaded Agent",
4        playground=True
5    )
6
7    return JobContext(room_options=room_options)
8
Finally, the if __name__ == "__main__": block starts the job, ensuring the agent is ready to interact:
1if __name__ == "__main__":
2    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
3    job.start()
4

Running and Testing the Agent

Step 5.1: Running the Python Script

To run your agent, execute the Python script:
1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

Once the script is running, check the console for a playground link. Open it in your browser to interact with your multilingual AI Voice Agent. You can speak to it in different languages and observe how it responds.

Advanced Features and Customizations

Extending Functionality with Custom Tools

You can extend the agent's capabilities by integrating custom tools. This involves creating additional functions that the agent can call to perform specific tasks, enhancing its utility.

Exploring Other Plugins

The VideoSDK framework supports various plugins for STT, LLM, and TTS. Consider experimenting with different options to find the best fit for your use case.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correctly configured in the .env file. Double-check for typos or missing keys.

Audio Input/Output Problems

Verify your microphone and speaker settings. Ensure your system permissions allow audio input and output.

Dependency and Version Conflicts

Ensure all dependencies are installed with compatible versions. Use a virtual environment to manage dependencies effectively.

Conclusion

Summary of What You've Built

In this tutorial, you have built a multilingual AI Voice Agent capable of interacting with users in multiple languages. This agent can handle customer inquiries, provide product information, and assist with order tracking.

Next Steps and Further Learning

Consider exploring additional plugins and tools to enhance your agent's capabilities. Continue learning about AI and voice technologies to stay ahead in the rapidly evolving field of conversational AI.
For more detailed insights into managing sessions, refer to the

AI voice Agent Sessions

.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ