Introduction to AI Voice Agents in Rasa NLU
AI Voice Agents are software programs designed to interact with humans through voice commands. They are capable of understanding natural language, processing it, and responding in a human-like manner. In the context of Rasa NLU, these agents leverage natural language understanding to provide intelligent responses and facilitate seamless human-computer interaction.
Why are they important for the Rasa NLU industry?
In the Rasa NLU industry, AI Voice Agents play a crucial role in enhancing customer experience, automating support, and providing personalized interactions. They are used in various applications such as virtual assistants, customer service bots, and interactive voice response systems.
Core Components of a Voice Agent
- Speech-to-Text (STT): Converts spoken language into text.
- Large Language Model (LLM): Processes the text and generates a response.
- Text-to-Speech (TTS): Converts the response text back into spoken language.
For a comprehensive understanding, refer to the
AI voice Agent core components overview
.What You'll Build in This Tutorial
In this tutorial, you will build a fully functional AI
Voice Agent
using Rasa NLU and VideoSDK. The agent will understand user queries, process them using a language model, and respond with synthesized speech.Architecture and Core Concepts
High-Level Architecture Overview
The AI
Voice Agent
architecture involves several components working in harmony. The user speaks into a microphone, the audio is captured and processed through a series of steps: Speech-to-Text (STT), Language Model processing, and Text-to-Speech (TTS). The processed audio is then played back to the user.
Understanding Key Concepts in the VideoSDK Framework
- Agent: The core class representing your bot, responsible for managing interactions.
- CascadingPipeline: Manages the flow of audio processing, integrating STT, LLM, and TTS. Learn more about the
Cascading pipeline in AI voice Agents
. - VAD & TurnDetector: Tools that help the agent determine when to listen and when to speak. Explore the
Turn detector for AI voice Agents
for more details.
Setting Up the Development Environment
Prerequisites
To get started, ensure you have Python 3.11+ installed and a VideoSDK account. You can sign up at app.videosdk.live.
Step 1: Create a Virtual Environment
Create a virtual environment to manage dependencies:
1python3 -m venv venv
2source venv/bin/activate # On Windows use `venv\Scripts\activate`
3Step 2: Install Required Packages
Install the necessary packages using pip:
1pip install videosdk
2pip install python-dotenv
3Step 3: Configure API Keys in a .env file
Create a
.env file in your project directory and add your API keys:1VIDEOSDK_API_KEY=your_api_key_here
2Building the AI Voice Agent: A Step-by-Step Guide
Here is the complete code for the AI Voice Agent:
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10pre_download_model()
11
12agent_instructions = "You are a knowledgeable AI Voice Agent specializing in natural language understanding using Rasa NLU. Your primary role is to assist users in understanding and implementing Rasa NLU for their projects. You can provide detailed explanations, answer questions about Rasa NLU features, and guide users through the setup and configuration process. However, you are not a substitute for professional technical support, and users should consult official Rasa documentation or support for complex issues. Always remind users to verify their implementations with official resources."
13
14class MyVoiceAgent(Agent):
15 def __init__(self):
16 super().__init__(instructions=agent_instructions)
17 async def on_enter(self): await self.session.say("Hello! How can I help?")
18 async def on_exit(self): await self.session.say("Goodbye!")
19
20async def start_session(context: JobContext):
21 agent = MyVoiceAgent()
22 conversation_flow = ConversationFlow(agent)
23 pipeline = CascadingPipeline(
24 stt=DeepgramSTT(model="nova-2", language="en"),
25 llm=OpenAILLM(model="gpt-4o"),
26 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
27 vad=SileroVAD(threshold=0.35),
28 turn_detector=TurnDetector(threshold=0.8)
29 )
30
31 session = AgentSession(
32 agent=agent,
33 pipeline=pipeline,
34 conversation_flow=conversation_flow
35 )
36
37 try:
38 await context.connect()
39 await session.start()
40 await asyncio.Event().wait()
41 finally:
42 await session.close()
43 await context.shutdown()
44
45def make_context() -> JobContext:
46 room_options = RoomOptions(
47 name="VideoSDK Cascaded Agent",
48 playground=True
49 )
50 return JobContext(room_options=room_options)
51
52if __name__ == "__main__":
53 job = WorkerJob(entrypoint=start_session, jobctx=make_context)
54 job.start()
55Step 4.1: Generating a VideoSDK Meeting ID
To generate a meeting ID, use the following
curl command:1curl -X POST "https://api.videosdk.live/v1/rooms" \
2-H "Authorization: Bearer your_api_key_here" \
3-H "Content-Type: application/json" \
4-d '{}'
5Step 4.2: Creating the Custom Agent Class
The
MyVoiceAgent class is a custom implementation of the Agent class. It defines the agent's behavior upon entering and exiting a session. The on_enter and on_exit methods are used to greet and bid farewell to users.Step 4.3: Defining the Core Pipeline
The
CascadingPipeline is the heart of the agent's processing capabilities. It integrates various plugins:Deepgram STT Plugin for voice agent
: Converts speech to text.OpenAI LLM Plugin for voice agent
: Processes the text using a language model.- ElevenLabsTTS: Converts the processed text back to speech.
- SileroVAD: Voice
activity detection
to determine when to listen. - TurnDetector: Manages conversational turn-taking.
Step 4.4: Managing the Session and Startup Logic
The
start_session function initializes the agent and its session. It sets up the conversation flow and the processing pipeline. The make_context function configures the room options for the VideoSDK session.The
if __name__ == "__main__": block ensures that the agent starts running when the script is executed.Running and Testing the Agent
Step 5.1: Running the Python Script
Run the script using:
1python main.py
2Step 5.2: Interacting with the Agent in the Playground
After running the script, find the
AI Agent playground
link in the console. Use this link to join the session and interact with your agent.Advanced Features and Customizations
Extending Functionality with Custom Tools
You can extend the agent's functionality by integrating custom tools. This allows for more specialized interactions and processing capabilities.
Exploring Other Plugins
Explore other plugins for STT, LLM, and TTS to enhance your agent's performance and capabilities.
Troubleshooting Common Issues
API Key and Authentication Errors
Ensure your API keys are correctly configured in the
.env file.Audio Input/Output Problems
Check your audio device settings and ensure they are properly configured for input and output.
Dependency and Version Conflicts
Ensure all dependencies are compatible with your Python version and each other.
Conclusion
Summary of What You've Built
You have successfully built an AI Voice Agent using Rasa NLU and VideoSDK. This agent can understand and respond to user queries using advanced language models.
Next Steps and Further Learning
Explore additional features and plugins to enhance your agent. Consider integrating with other APIs and services to expand its capabilities.
Want to level-up your learning? Subscribe now
Subscribe to our newsletter for more tech based insights
FAQ