Build a Multilingual ASR Voice Agent

Step-by-step guide to building a multilingual ASR voice agent using VideoSDK. Includes complete code and testing instructions.

Introduction to AI Voice Agents in Multilingual ASR

What is an AI

Voice Agent

?

An AI

Voice Agent

is a sophisticated software application designed to interact with users through voice commands. These agents leverage advanced technologies like Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS) to understand and respond to human speech. They are capable of performing tasks, answering questions, and providing information in a conversational manner.

Why are they important for the multilingual ASR industry?

In the multilingual ASR industry, AI Voice Agents play a crucial role by enabling seamless communication across different languages. They help break down language barriers, allowing businesses to reach a global audience. Use cases include customer support, virtual assistants, and real-time translation services.

Core Components of a

Voice Agent

  • Speech-to-Text (STT): Converts spoken language into text.
  • Large Language Models (LLM): Processes and understands the text to generate appropriate responses.
  • Text-to-Speech (TTS): Converts the response text back into spoken language.

What You'll Build in This Tutorial

In this tutorial, you will build a multilingual AI

Voice Agent

using the VideoSDK framework. This agent will be capable of recognizing and transcribing speech in multiple languages, providing real-time responses to users.

Architecture and Core Concepts

High-Level Architecture Overview

The architecture of our AI Voice Agent involves several key components working together to process user speech and generate responses. The process begins with capturing audio input, which is then analyzed and transcribed by the STT component. The transcribed text is processed by the LLM to generate a response, which is finally converted into speech by the TTS component.
Diagram

Understanding Key Concepts in the VideoSDK Framework

  • Agent: The core class representing your bot, handling interactions and managing the conversation flow.
  • CascadingPipeline: A structured flow of audio processing components, including STT, LLM, and TTS. You can explore more about the

    Cascading pipeline in AI voice Agents

    to understand its significance.
  • VAD & TurnDetector: These components help the agent determine when to listen and when to respond, ensuring smooth interactions. The

    Turn detector for AI voice Agents

    is crucial for managing conversation flow effectively.

Setting Up the Development Environment

Prerequisites

Before you start, ensure you have Python 3.11 or higher installed. You will also need a VideoSDK account, which you can create at app.videosdk.live.

Step 1: Create a Virtual Environment

To keep dependencies organized, create a virtual environment:
1python -m venv myenv
2source myenv/bin/activate  # On Windows use `myenv\\Scripts\\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip:
1pip install videosdk
2pip install dotenv
3

Step 3: Configure API Keys in a .env file

Create a .env file in your project directory and add your VideoSDK API keys:
1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

Here is the complete code for the AI Voice Agent:
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are a multilingual voice assistant specializing in automatic speech recognition (ASR) across multiple languages. Your primary role is to assist users by accurately transcribing spoken language into text in real-time. You can handle a variety of languages, including but not limited to English, Spanish, French, Mandarin, and Hindi. Your capabilities include recognizing different accents and dialects within these languages, providing users with the option to switch languages seamlessly, and offering suggestions for improving speech clarity if needed. However, you are not capable of translating text between languages or providing language learning services. You must inform users that while you strive for high accuracy, occasional errors in transcription may occur, and they should verify critical information independently. You are designed to respect user privacy and ensure that all transcriptions are handled securely."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = AgentSession(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Step 4.1: Generating a VideoSDK Meeting ID

To create a meeting ID, use the following curl command:
1curl -X POST "https://api.videosdk.live/v1/meetings" \
2-H "Authorization: Bearer YOUR_API_KEY" \
3-H "Content-Type: application/json"
4

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class extends the Agent class from the VideoSDK framework. It initializes with specific instructions and defines actions on entering and exiting a session.
1class MyVoiceAgent(Agent):
2    def __init__(self):
3        super().__init__(instructions=agent_instructions)
4    async def on_enter(self): await self.session.say("Hello! How can I help?")
5    async def on_exit(self): await self.session.say("Goodbye!")
6

Step 4.3: Defining the Core Pipeline

The CascadingPipeline is crucial for processing audio. It consists of several components:
  • STT: Uses DeepgramSTT to transcribe speech.
  • LLM: Utilizes OpenAILLM for generating responses.
  • TTS: Employs ElevenLabsTTS to convert text back to speech.
  • VAD: SileroVAD helps in detecting voice activity. Learn more about

    Silero Voice Activity Detection

    to enhance your understanding of its role.
  • TurnDetector: Manages conversation turns.
1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8

Step 4.4: Managing the Session and Startup Logic

The start_session function manages the agent session lifecycle, ensuring the agent connects and operates within a session.
1async def start_session(context: JobContext):
2    agent = MyVoiceAgent()
3    conversation_flow = ConversationFlow(agent)
4
5    pipeline = CascadingPipeline(
6        stt=DeepgramSTT(model="nova-2", language="en"),
7        llm=OpenAILLM(model="gpt-4o"),
8        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
9        vad=SileroVAD(threshold=0.35),
10        turn_detector=TurnDetector(threshold=0.8)
11    )
12
13    session = AgentSession(
14        agent=agent,
15        pipeline=pipeline,
16        conversation_flow=conversation_flow
17    )
18
19    try:
20        await context.connect()
21        await session.start()
22        await asyncio.Event().wait()
23    finally:
24        await session.close()
25        await context.shutdown()
26
The make_context function creates a job context with room options, enabling the agent to join or create a meeting room.
1def make_context() -> JobContext:
2    room_options = RoomOptions(
3        name="VideoSDK Cascaded Agent",
4        playground=True
5    )
6    return JobContext(room_options=room_options)
7
Finally, the main block initializes and starts the worker job:
1if __name__ == "__main__":
2    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
3    job.start()
4

Running and Testing the Agent

Step 5.1: Running the Python Script

To run the agent, execute the following command:
1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

Once the script is running, the console will display a playground link. Open this link in your browser to interact with the agent. Speak into your microphone, and the agent will respond in real-time.

Advanced Features and Customizations

Extending Functionality with Custom Tools

You can extend the agent's capabilities by integrating custom tools using the function_tool feature of the VideoSDK framework. This allows you to add new functionalities tailored to specific needs.

Exploring Other Plugins

While this tutorial uses specific plugins for STT, LLM, and TTS, the VideoSDK framework supports various other options. Explore different plugins to find the best fit for your requirements.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correctly set in the .env file. Double-check for typos or missing information.

Audio Input/Output Problems

Verify that your microphone and speakers are properly configured and working. Check system settings and permissions.

Dependency and Version Conflicts

Ensure all dependencies are installed with compatible versions. Use a virtual environment to manage package versions effectively.

Conclusion

Summary of What You've Built

In this tutorial, you've built a multilingual AI Voice Agent capable of processing and responding to speech in multiple languages using the VideoSDK framework.

Next Steps and Further Learning

Explore additional features and plugins offered by VideoSDK to enhance your agent's capabilities. Continue learning by experimenting with different configurations and customizations. For a comprehensive understanding, refer to the

AI voice Agent core components overview

and

AI voice Agent Sessions

to deepen your knowledge of the framework's capabilities.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ