Build a Rasa NLU Voice Agent with VideoSDK

Step-by-step guide to building a Rasa NLU AI Voice Agent with VideoSDK.

Introduction to AI Voice Agents in Rasa NLU

AI Voice Agents are software programs designed to interact with humans through voice commands. They are capable of understanding natural language, processing it, and responding in a human-like manner. In the context of Rasa NLU, these agents leverage natural language understanding to provide intelligent responses and facilitate seamless human-computer interaction.

Why are they important for the Rasa NLU industry?

In the Rasa NLU industry, AI Voice Agents play a crucial role in enhancing customer experience, automating support, and providing personalized interactions. They are used in various applications such as virtual assistants, customer service bots, and interactive voice response systems.

Core Components of a

Voice Agent

  • Speech-to-Text (STT): Converts spoken language into text.
  • Large Language Model (LLM): Processes the text and generates a response.
  • Text-to-Speech (TTS): Converts the response text back into spoken language.
For a comprehensive understanding, refer to the

AI voice Agent core components overview

.

What You'll Build in This Tutorial

In this tutorial, you will build a fully functional AI

Voice Agent

using Rasa NLU and VideoSDK. The agent will understand user queries, process them using a language model, and respond with synthesized speech.

Architecture and Core Concepts

High-Level Architecture Overview

The AI

Voice Agent

architecture involves several components working in harmony. The user speaks into a microphone, the audio is captured and processed through a series of steps: Speech-to-Text (STT), Language Model processing, and Text-to-Speech (TTS). The processed audio is then played back to the user.
Diagram

Understanding Key Concepts in the VideoSDK Framework

  • Agent: The core class representing your bot, responsible for managing interactions.
  • CascadingPipeline: Manages the flow of audio processing, integrating STT, LLM, and TTS. Learn more about the

    Cascading pipeline in AI voice Agents

    .
  • VAD & TurnDetector: Tools that help the agent determine when to listen and when to speak. Explore the

    Turn detector for AI voice Agents

    for more details.

Setting Up the Development Environment

Prerequisites

To get started, ensure you have Python 3.11+ installed and a VideoSDK account. You can sign up at app.videosdk.live.

Step 1: Create a Virtual Environment

Create a virtual environment to manage dependencies:
1python3 -m venv venv
2source venv/bin/activate  # On Windows use `venv\Scripts\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip:
1pip install videosdk
2pip install python-dotenv
3

Step 3: Configure API Keys in a .env file

Create a .env file in your project directory and add your API keys:
1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

Here is the complete code for the AI Voice Agent:
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10pre_download_model()
11
12agent_instructions = "You are a knowledgeable AI Voice Agent specializing in natural language understanding using Rasa NLU. Your primary role is to assist users in understanding and implementing Rasa NLU for their projects. You can provide detailed explanations, answer questions about Rasa NLU features, and guide users through the setup and configuration process. However, you are not a substitute for professional technical support, and users should consult official Rasa documentation or support for complex issues. Always remind users to verify their implementations with official resources."
13
14class MyVoiceAgent(Agent):
15    def __init__(self):
16        super().__init__(instructions=agent_instructions)
17    async def on_enter(self): await self.session.say("Hello! How can I help?")
18    async def on_exit(self): await self.session.say("Goodbye!")
19
20async def start_session(context: JobContext):
21    agent = MyVoiceAgent()
22    conversation_flow = ConversationFlow(agent)
23    pipeline = CascadingPipeline(
24        stt=DeepgramSTT(model="nova-2", language="en"),
25        llm=OpenAILLM(model="gpt-4o"),
26        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
27        vad=SileroVAD(threshold=0.35),
28        turn_detector=TurnDetector(threshold=0.8)
29    )
30
31    session = AgentSession(
32        agent=agent,
33        pipeline=pipeline,
34        conversation_flow=conversation_flow
35    )
36
37    try:
38        await context.connect()
39        await session.start()
40        await asyncio.Event().wait()
41    finally:
42        await session.close()
43        await context.shutdown()
44
45def make_context() -> JobContext:
46    room_options = RoomOptions(
47        name="VideoSDK Cascaded Agent",
48        playground=True
49    )
50    return JobContext(room_options=room_options)
51
52if __name__ == "__main__":
53    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
54    job.start()
55

Step 4.1: Generating a VideoSDK Meeting ID

To generate a meeting ID, use the following curl command:
1curl -X POST "https://api.videosdk.live/v1/rooms" \
2-H "Authorization: Bearer your_api_key_here" \
3-H "Content-Type: application/json" \
4-d '{}'
5

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class is a custom implementation of the Agent class. It defines the agent's behavior upon entering and exiting a session. The on_enter and on_exit methods are used to greet and bid farewell to users.

Step 4.3: Defining the Core Pipeline

The CascadingPipeline is the heart of the agent's processing capabilities. It integrates various plugins:

Step 4.4: Managing the Session and Startup Logic

The start_session function initializes the agent and its session. It sets up the conversation flow and the processing pipeline. The make_context function configures the room options for the VideoSDK session.
The if __name__ == "__main__": block ensures that the agent starts running when the script is executed.

Running and Testing the Agent

Step 5.1: Running the Python Script

Run the script using:
1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

After running the script, find the

AI Agent playground

link in the console. Use this link to join the session and interact with your agent.

Advanced Features and Customizations

Extending Functionality with Custom Tools

You can extend the agent's functionality by integrating custom tools. This allows for more specialized interactions and processing capabilities.

Exploring Other Plugins

Explore other plugins for STT, LLM, and TTS to enhance your agent's performance and capabilities.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correctly configured in the .env file.

Audio Input/Output Problems

Check your audio device settings and ensure they are properly configured for input and output.

Dependency and Version Conflicts

Ensure all dependencies are compatible with your Python version and each other.

Conclusion

Summary of What You've Built

You have successfully built an AI Voice Agent using Rasa NLU and VideoSDK. This agent can understand and respond to user queries using advanced language models.

Next Steps and Further Learning

Explore additional features and plugins to enhance your agent. Consider integrating with other APIs and services to expand its capabilities.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ