Introduction to AI Voice Agents in Fallback Responses Voice Agent
AI Voice Agents are automated systems designed to interact with users through voice commands. They utilize technologies like Speech-to-Text (STT), Language Learning Models (LLM), and Text-to-Speech (TTS) to process and respond to user queries. In the context of fallback responses, these agents provide alternative solutions when the primary system cannot address a user's query directly.
Why are they important for the fallback responses voice agent industry?
In industries where customer service and user interaction are crucial, fallback responses ensure that users receive guidance even when their queries fall outside the system's primary capabilities. This enhances user satisfaction and ensures seamless interaction.
Core Components of a Voice Agent
- STT (Speech-to-Text): Converts spoken language into text.
- LLM (Language Learning Model): Processes the text to understand and generate responses.
- TTS (Text-to-Speech): Converts the generated text back into speech.
What You'll Build in This Tutorial
In this tutorial, you'll build a voice agent capable of providing fallback responses using the VideoSDK framework. This agent will identify when it cannot answer a query directly and offer alternative solutions or guidance. For a comprehensive setup, refer to the
Voice Agent Quick Start Guide
.Architecture and Core Concepts
High-Level Architecture Overview
The architecture involves capturing user speech, converting it to text, processing it with an LLM, and then converting the response back to speech. This flow ensures that the agent can interact naturally with users.

Understanding Key Concepts in the VideoSDK Framework
- Agent: Represents the core functionality of the bot.
- CascadingPipeline: Manages the flow of audio processing from STT to LLM to TTS. Learn more about the
Cascading pipeline in AI voice Agents
. - VAD & TurnDetector: Determine when the agent should listen and respond.
Setting Up the Development Environment
Prerequisites
To build this agent, ensure you have Python 3.11+ and a VideoSDK account. Sign up at app.videosdk.live.
Step 1: Create a Virtual Environment
Create a virtual environment to manage dependencies:
1python -m venv venv
2source venv/bin/activate # On Windows use `venv\\Scripts\\activate`
3Step 2: Install Required Packages
Install the necessary Python packages:
1pip install videosdk
2Step 3: Configure API Keys in a .env file
Create a
.env file in your project directory and add your VideoSDK API keys:1VIDEOSDK_API_KEY=your_api_key_here
2Building the AI Voice Agent: A Step-by-Step Guide
Let's start by examining the complete code block for our AI Voice Agent:
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10pre_download_model()
11
12agent_instructions = "You are a 'fallback responses voice agent' designed to assist users when their queries cannot be directly answered by the primary capabilities of the system. Your persona is that of a friendly and understanding guide, always ready to help users navigate their issues or find alternative solutions. Your primary capabilities include:
13
141. Identifying when a user's query cannot be answered directly by the system's main functions.
152. Providing polite and helpful fallback responses that guide users towards alternative resources or suggest rephrasing their questions.
163. Offering general information about the system's capabilities and limitations to manage user expectations.
17
18Constraints and limitations:
19- You are not equipped to provide specific answers outside the predefined scope of the system's main functions.
20- You must always include a disclaimer that the information provided is general and users should consult specific resources or professionals for detailed assistance.
21- You should avoid making assumptions about user intent and instead encourage users to provide more context or rephrase their queries for better assistance."
22
23class MyVoiceAgent(Agent):
24 def __init__(self):
25 super().__init__(instructions=agent_instructions)
26 async def on_enter(self): await self.session.say("Hello! How can I help?")
27 async def on_exit(self): await self.session.say("Goodbye!")
28
29async def start_session(context: JobContext):
30 agent = MyVoiceAgent()
31 conversation_flow = ConversationFlow(agent)
32
33 pipeline = CascadingPipeline(
34 stt=DeepgramSTT(model="nova-2", language="en"),
35 llm=OpenAILLM(model="gpt-4o"),
36 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
37 vad=SileroVAD(threshold=0.35),
38 turn_detector=TurnDetector(threshold=0.8)
39 )
40
41 session = AgentSession(
42 agent=agent,
43 pipeline=pipeline,
44 conversation_flow=conversation_flow
45 )
46
47 try:
48 await context.connect()
49 await session.start()
50 await asyncio.Event().wait()
51 finally:
52 await session.close()
53 await context.shutdown()
54
55def make_context() -> JobContext:
56 room_options = RoomOptions(
57 name="VideoSDK Cascaded Agent",
58 playground=True
59 )
60
61 return JobContext(room_options=room_options)
62
63if __name__ == "__main__":
64 job = WorkerJob(entrypoint=start_session, jobctx=make_context)
65 job.start()
66Step 4.1: Generating a VideoSDK Meeting ID
To create a meeting ID, use the following
curl command:1curl -X POST https://api.videosdk.live/v1/meetings -H "Authorization: Bearer YOUR_API_KEY"
2Step 4.2: Creating the Custom Agent Class
The
MyVoiceAgent class extends the Agent class, providing custom behavior for entering and exiting sessions. It uses predefined instructions to guide user interactions. For more detailed instructions, refer to the Voice Agent Quick Start Guide
.Step 4.3: Defining the Core Pipeline
The
CascadingPipeline is crucial for managing the flow of audio processing. It integrates:Deepgram STT Plugin for voice agent
: Converts speech to text.OpenAI LLM Plugin for voice agent
: Processes text to generate responses.ElevenLabs TTS Plugin for voice agent
: Converts text back to speech.Silero Voice Activity Detection
&Turn detector for AI voice Agents
: Manage when the agent listens and responds.
Step 4.4: Managing the Session and Startup Logic
The
start_session function initializes the agent and manages the session lifecycle. The make_context function sets up the room options, and the main block starts the job. Learn more about AI voice Agent Sessions
.Running and Testing the Agent
Step 5.1: Running the Python Script
Run your script using:
1python main.py
2Step 5.2: Interacting with the Agent in the Playground
After starting the script, find the
AI Agent playground
link in the console to interact with the agent.Advanced Features and Customizations
Extending Functionality with Custom Tools
You can extend the agent's capabilities by integrating custom tools using the
function_tool concept.Exploring Other Plugins
Consider exploring other STT, LLM, and TTS plugins to enhance your agent's performance.
Troubleshooting Common Issues
API Key and Authentication Errors
Ensure your API keys are correctly set in the
.env file.Audio Input/Output Problems
Check your microphone and speaker settings if you encounter issues.
Dependency and Version Conflicts
Ensure all dependencies are compatible with Python 3.11+.
Conclusion
Summary of What You've Built
You've built a robust AI Voice Agent capable of providing fallback responses using VideoSDK.
Next Steps and Further Learning
Explore additional plugins and custom tools to further enhance your agent's capabilities.
Want to level-up your learning? Subscribe now
Subscribe to our newsletter for more tech based insights
FAQ