Kaldi ASR AI Voice Agent Guide

Step-by-step guide to building a Kaldi ASR AI Voice Agent using VideoSDK. Includes code examples and testing instructions.

Introduction to AI Voice Agents in Kaldi ASR

What is an AI

Voice Agent

?

An AI

Voice Agent

is a sophisticated software entity designed to interact with users through voice commands. These agents leverage advanced technologies such as Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS) to comprehend and respond to human speech. They are increasingly prevalent in various industries, offering hands-free assistance and enhancing user experiences.

Why are they important for the Kaldi ASR industry?

In the realm of Kaldi ASR, AI Voice Agents play a pivotal role by providing real-time speech recognition and processing capabilities. These agents are crucial for applications such as voice-controlled devices, customer support automation, and accessibility tools, where understanding and generating human-like responses are essential.

Core Components of a

Voice Agent

  • Speech-to-Text (STT): Converts spoken language into text.
  • Large Language Model (LLM): Processes the text to generate meaningful responses.
  • Text-to-Speech (TTS): Converts the generated text back into speech for user interaction.

What You'll Build in This Tutorial

In this tutorial, we will guide you through building a Kaldi ASR AI

Voice Agent

using the VideoSDK framework. You will learn how to integrate various components to create a fully functional voice agent capable of understanding and responding to user queries.

Architecture and Core Concepts

High-Level Architecture Overview

The architecture of our AI Voice Agent involves a seamless flow of data from user speech to agent response. The process begins with capturing audio input, which is then processed through a series of stages: Speech-to-Text (STT), Natural Language Processing (NLP), and Text-to-Speech (TTS). Each stage plays a critical role in ensuring accurate and contextually relevant interactions. This flow is managed through a

cascading pipeline in AI voice Agents

, ensuring efficient data processing.
Diagram

Understanding Key Concepts in the VideoSDK Framework

  • Agent: Represents the core logic of your voice bot, handling interactions and managing the conversation flow.
  • CascadingPipeline: Defines the sequence of audio processing, including STT, LLM, and TTS, to ensure smooth data flow.
  • VAD & TurnDetector: These components determine when the agent should listen or speak, enhancing interaction efficiency. The

    Silero Voice Activity Detection

    and

    Turn detector for AI voice Agents

    are crucial for managing these interactions.

Setting Up the Development Environment

Prerequisites

Before diving into the implementation, ensure you have the following:
  • Python 3.11+ installed on your machine.
  • A VideoSDK account, which you can create at app.videosdk.live.

Step 1: Create a Virtual Environment

To maintain a clean workspace, it is recommended to use a virtual environment. Run the following commands:
1python -m venv venv
2source venv/bin/activate  # On Windows use `venv\\Scripts\\activate`
3

Step 2: Install Required Packages

Install the necessary Python packages using pip:
1pip install videosdk
2

Step 3: Configure API Keys in a .env file

Create a .env file in your project directory and add your VideoSDK API key:
1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

First, let's present the complete code block that we'll be working with:
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are an AI Voice Agent specializing in Automatic Speech Recognition (ASR) using the Kaldi framework. Your persona is that of a knowledgeable and efficient technical assistant. Your primary capabilities include: \n\n1. Providing detailed explanations about the Kaldi ASR framework, including its features, benefits, and typical use cases.\n2. Assisting users in setting up and configuring Kaldi ASR for various applications.\n3. Offering troubleshooting tips and solutions for common issues encountered with Kaldi ASR.\n4. Guiding users through the process of integrating Kaldi ASR with other systems and platforms.\n\nConstraints and Limitations:\n- You are not a substitute for professional technical support or consulting services. Always recommend consulting with a professional for complex issues or custom implementations.\n- You must not provide any medical, legal, or financial advice.\n- Ensure that all technical guidance is based on the latest stable release of the Kaldi ASR framework.\n- Include a disclaimer that the information provided is for educational purposes and should be verified with official Kaldi documentation or a professional expert."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = AgentSession(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63
Now, let's break down this code to understand each component.

Step 4.1: Generating a VideoSDK Meeting ID

To interact with your agent, you need a meeting ID. Use the following curl command to generate one:
1curl -X POST "https://api.videosdk.live/v2/rooms" \
2-H "Authorization: Bearer YOUR_API_KEY" \
3-H "Content-Type: application/json" \
4-d '{"region": "us-east"}'
5

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class is where we define the behavior of our voice agent. It inherits from the Agent class and sets specific instructions for the agent's persona and capabilities.
1class MyVoiceAgent(Agent):
2    def __init__(self):
3        super().__init__(instructions=agent_instructions)
4    async def on_enter(self): await self.session.say("Hello! How can I help?")
5    async def on_exit(self): await self.session.say("Goodbye!")
6

Step 4.3: Defining the Core Pipeline

The CascadingPipeline is crucial as it dictates the flow of data through the agent. Each plugin is responsible for a specific task:
  • STT: Converts speech to text using Deepgram.
  • LLM: Processes the text with OpenAI's GPT-4.
  • TTS: Converts text to speech with ElevenLabs.
  • VAD: Detects when the user is speaking.
  • TurnDetector: Determines when the agent should respond.
1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8

Step 4.4: Managing the Session and Startup Logic

The start_session function manages the lifecycle of the agent's session, ensuring it connects and starts correctly. The make_context function sets up the room options for the

AI Agent playground

.
1def make_context() -> JobContext:
2    room_options = RoomOptions(
3        name="VideoSDK Cascaded Agent",
4        playground=True
5    )
6
7    return JobContext(room_options=room_options)
8
9async def start_session(context: JobContext):
10    agent = MyVoiceAgent()
11    conversation_flow = ConversationFlow(agent)
12
13    pipeline = CascadingPipeline(
14        stt=DeepgramSTT(model="nova-2", language="en"),
15        llm=OpenAILLM(model="gpt-4o"),
16        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
17        vad=SileroVAD(threshold=0.35),
18        turn_detector=TurnDetector(threshold=0.8)
19    )
20
21    session = AgentSession(
22        agent=agent,
23        pipeline=pipeline,
24        conversation_flow=conversation_flow
25    )
26
27    try:
28        await context.connect()
29        await session.start()
30        await asyncio.Event().wait()
31    finally:
32        await session.close()
33        await context.shutdown()
34
35if __name__ == "__main__":
36    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
37    job.start()
38

Running and Testing the Agent

Step 5.1: Running the Python Script

To run your agent, execute the following command in your terminal:
1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

Once the agent is running, you will see a playground link in the console. Open this link in your browser to interact with your agent. You can speak to the agent and receive responses in real-time.

Advanced Features and Customizations

Extending Functionality with Custom Tools

The VideoSDK framework allows you to extend your agent's capabilities by integrating custom tools. This enables you to tailor the agent's functionality to specific use cases.

Exploring Other Plugins

While this tutorial uses specific plugins for STT, LLM, and TTS, the VideoSDK framework supports various other options. Explore these plugins to enhance your agent's performance and capabilities.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API key is correctly set in the .env file and that your account has the necessary permissions.

Audio Input/Output Problems

Check your microphone and speaker settings to ensure they are properly configured and recognized by your system.

Dependency and Version Conflicts

Ensure all packages are up-to-date and compatible with your Python version. Use a virtual environment to manage dependencies effectively.

Conclusion

Summary of What You've Built

In this tutorial, you've built a fully functional AI Voice Agent using Kaldi ASR and the VideoSDK framework. You learned how to integrate various plugins and manage the agent's lifecycle.

Next Steps and Further Learning

To further enhance your skills, explore advanced customization options and experiment with different plugins. Consider contributing to the VideoSDK community by sharing your projects and insights. For a comprehensive understanding, refer to the

AI voice Agent core components overview

and

AI voice Agent Sessions

to deepen your knowledge.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ