Introduction to AI Voice Agents in Maintaining Context in Conversation
AI Voice Agents are sophisticated systems designed to interact with users through voice commands. They are capable of understanding spoken language, processing the information, and responding appropriately. In industries like customer service, healthcare, and home automation, maintaining context in conversation is crucial for providing coherent and relevant responses.
What is an AI Voice Agent
?
An AI
Voice Agent
is a software system that uses natural language processing (NLP) to understand and respond to human speech. These agents can perform tasks ranging from answering queries to controlling smart devices.Why are they important for maintaining context in conversation?
In applications like healthcare, AI Voice Agents need to understand the context of a conversation to provide accurate information or assistance. For instance, a healthcare assistant should remember previous interactions to offer personalized advice or schedule appointments effectively.
Core Components of a Voice Agent
- Speech-to-Text (STT): Converts spoken language into text.
- Large Language Model (LLM): Processes the text to understand and generate responses.
- Text-to-Speech (TTS): Converts text responses back into spoken language.
For a comprehensive understanding of these elements, refer to the
AI voice Agent core components overview
.What You'll Build in This Tutorial
In this tutorial, you'll learn to build an AI
Voice Agent
using the VideoSDK framework. The agent will maintain context in conversations, answer health-related questions, and schedule appointments.Architecture and Core Concepts
High-Level Architecture Overview
The architecture involves capturing user speech, converting it to text, processing it through an LLM, and then converting the response back to speech. This flow ensures the agent maintains context throughout the interaction.

Understanding Key Concepts in the VideoSDK Framework
- Agent: Represents the core logic of your voice assistant.
Cascading Pipeline in AI voice Agents
: Manages the flow of data from STT to LLM to TTS.- VAD & TurnDetector: Determine when the agent should listen or speak.
Setting Up the Development Environment
Prerequisites
- Python 3.11+
- VideoSDK Account (sign up at app.videosdk.live)
Step 1: Create a Virtual Environment
Create a virtual environment to manage dependencies:
1python -m venv venv
2source venv/bin/activate # On Windows use `venv\\Scripts\\activate`
3Step 2: Install Required Packages
Install the necessary packages using pip:
1pip install videosdk
2pip install python-dotenv
3Step 3: Configure API Keys in a .env File
Create a
.env file in your project directory and add your API keys:1VIDEOSDK_API_KEY=your_api_key_here
2Building the AI Voice Agent: A Step-by-Step Guide
Here is the complete code for the AI Voice Agent:
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10pre_download_model()
11
12agent_instructions = "{\n \"persona\": \"helpful healthcare assistant\",\n \"capabilities\": [\n \"maintain context in conversation to provide coherent and relevant responses\",\n \"answer questions about common symptoms\",\n \"schedule appointments with healthcare providers\",\n \"provide general health tips and advice\"\n ],\n \"constraints\": [\n \"you are not a medical professional and must include a disclaimer to consult a doctor\",\n \"do not provide any diagnosis or treatment plans\",\n \"ensure user privacy and data protection at all times\",\n \"limit conversations to general health topics and appointment scheduling\"\n ]\n}"
13
14class MyVoiceAgent(Agent):
15 def __init__(self):
16 super().__init__(instructions=agent_instructions)
17 async def on_enter(self): await self.session.say("Hello! How can I help?")
18 async def on_exit(self): await self.session.say("Goodbye!")
19
20async def start_session(context: JobContext):
21 agent = MyVoiceAgent()
22 conversation_flow = ConversationFlow(agent)
23
24 pipeline = CascadingPipeline(
25 stt=DeepgramSTT(model="nova-2", language="en"),
26 llm=OpenAILLM(model="gpt-4o"),
27 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
28 vad=[Silero Voice Activity Detection](https://docs.videosdk.live/ai_agents/plugins/silero-vad)(threshold=0.35),
29 turn_detector=[Turn detector for AI voice Agents](https://docs.videosdk.live/ai_agents/plugins/turn-detector)(threshold=0.8)
30 )
31
32 session = [AI voice Agent Sessions](https://docs.videosdk.live/ai_agents/core-components/agent-session)(
33 agent=agent,
34 pipeline=pipeline,
35 conversation_flow=conversation_flow
36 )
37
38 try:
39 await context.connect()
40 await session.start()
41 await asyncio.Event().wait()
42 finally:
43 await session.close()
44 await context.shutdown()
45
46def make_context() -> JobContext:
47 room_options = RoomOptions(
48 name="VideoSDK Cascaded Agent",
49 playground=True
50 )
51
52 return JobContext(room_options=room_options)
53
54if __name__ == "__main__":
55 job = WorkerJob(entrypoint=start_session, jobctx=make_context)
56 job.start()
57Step 4.1: Generating a VideoSDK Meeting ID
To interact with the agent, you'll need a meeting ID. Use the following
curl command to generate one:1curl -X POST \
2 https://api.videosdk.live/v1/meetings \
3 -H "Authorization: Bearer YOUR_API_KEY" \
4 -H "Content-Type: application/json"
5Step 4.2: Creating the Custom Agent Class
The
MyVoiceAgent class extends the Agent class, defining custom behavior for entering and exiting sessions. It uses predefined instructions to maintain conversation context.1class MyVoiceAgent(Agent):
2 def __init__(self):
3 super().__init__(instructions=agent_instructions)
4 async def on_enter(self):
5 await self.session.say("Hello! How can I help?")
6 async def on_exit(self):
7 await self.session.say("Goodbye!")
8Step 4.3: Defining the Core Pipeline
The
CascadingPipeline manages the flow of data through the system. It uses various plugins for STT, LLM, TTS, VAD, and turn detection.1pipeline = CascadingPipeline(
2 stt=DeepgramSTT(model="nova-2", language="en"),
3 llm=[OpenAI LLM Plugin for voice agent](https://docs.videosdk.live/ai_agents/plugins/llm/openai)(model="gpt-4o"),
4 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5 vad=SileroVAD(threshold=0.35),
6 turn_detector=TurnDetector(threshold=0.8)
7)
8Step 4.4: Managing the Session and Startup Logic
The
start_session function initializes the agent session and handles connection and cleanup.1async def start_session(context: JobContext):
2 agent = MyVoiceAgent()
3 conversation_flow = ConversationFlow(agent)
4
5 pipeline = CascadingPipeline(
6 stt=DeepgramSTT(model="nova-2", language="en"),
7 llm=OpenAILLM(model="gpt-4o"),
8 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
9 vad=SileroVAD(threshold=0.35),
10 turn_detector=TurnDetector(threshold=0.8)
11 )
12
13 session = AgentSession(
14 agent=agent,
15 pipeline=pipeline,
16 conversation_flow=conversation_flow
17 )
18
19 try:
20 await context.connect()
21 await session.start()
22 await asyncio.Event().wait()
23 finally:
24 await session.close()
25 await context.shutdown()
26The
make_context function sets up the JobContext with room options for testing.1def make_context() -> JobContext:
2 room_options = RoomOptions(
3 name="VideoSDK Cascaded Agent",
4 playground=True
5 )
6
7 return JobContext(room_options=room_options)
8The main block starts the job:
1if __name__ == "__main__":
2 job = WorkerJob(entrypoint=start_session, jobctx=make_context)
3 job.start()
4Running and Testing the Agent
Step 5.1: Running the Python Script
Run the script using the command:
1python main.py
2Step 5.2: Interacting with the Agent in the Playground
After starting the script, look for a playground link in the console. Use this link to join the session and interact with the agent.
Advanced Features and Customizations
Extending Functionality with Custom Tools
You can enhance the agent by adding custom tools using the
function_tool concept, allowing for more specialized tasks.Exploring Other Plugins
Experiment with different STT, LLM, and TTS plugins to optimize performance and cost.
Troubleshooting Common Issues
API Key and Authentication Errors
Ensure your API keys are correct and stored securely in the
.env file.Audio Input/Output Problems
Check your microphone and speaker settings if you encounter audio issues.
Dependency and Version Conflicts
Use a virtual environment to manage dependencies and avoid conflicts.
Conclusion
Summary of What You've Built
You've created a functional AI Voice Agent capable of maintaining context in conversations, useful in healthcare and other industries.
Next Steps and Further Learning
Explore more advanced features and plugins in the VideoSDK framework to expand your agent's capabilities.
Want to level-up your learning? Subscribe now
Subscribe to our newsletter for more tech based insights
FAQ