Introduction to AI Voice Agents in Conversational Memory for Voice Agents
AI Voice Agents are sophisticated software applications designed to interact with users through voice commands. They leverage advanced technologies like Speech-to-Text (STT), Language Learning Models (LLM), and Text-to-Speech (TTS) to understand and respond to user inquiries. In the context of conversational memory, these agents can remember past interactions, allowing for more personalized and engaging user experiences.
What is an AI Voice Agent
?
An AI
Voice Agent
acts as a virtual assistant, capable of processing natural language to perform tasks or provide information. These agents are increasingly used in industries ranging from customer service to healthcare, where they assist users by providing quick and accurate responses.Why are They Important for the Conversational Memory for Voice Agents Industry?
Conversational memory enhances the functionality of voice agents by enabling them to recall previous interactions. This feature is crucial in creating a seamless user experience, as it allows the agent to tailor responses based on past conversations, thus improving user satisfaction and engagement.
Core Components of a Voice Agent
- STT (Speech-to-Text): Converts spoken language into text.
- LLM (Language Learning Models): Processes and understands the text to generate appropriate responses.
- TTS (Text-to-Speech): Converts the generated text back into spoken language.
What You'll Build in This Tutorial
In this tutorial, you will learn how to build an AI
Voice Agent
with conversational memory capabilities using the VideoSDK framework. We will guide you through setting up the environment, building the agent, and testing it in a real-world scenario.Architecture and Core Concepts
High-Level Architecture Overview
The architecture of an AI Voice Agent with conversational memory involves several key components working together to process user input and generate responses. The process begins with capturing user speech, which is then converted to text using STT. This text is processed by an LLM to generate a response, which is subsequently converted back to speech using TTS.

Understanding Key Concepts in the VideoSDK Framework
Agent
The
Agent class represents the core of your voice bot. It handles the interaction flow and manages the conversation state.Cascading Pipeline in AI voice Agents
The
CascadingPipeline orchestrates the flow of audio processing, connecting various plugins like STT, LLM, and TTS to create a seamless interaction.VAD & Turn Detector for AI voice Agents
Voice
Activity Detection
(VAD) and Turn Detection are crucial for determining when the agent should listen or speak, ensuring smooth and natural conversations.Setting Up the Development Environment
Prerequisites
Before you begin, ensure you have Python 3.11+ installed and a VideoSDK account, which you can create at app.videosdk.live.
Step 1: Create a Virtual Environment
To manage dependencies, create a virtual environment:
1python -m venv myenv
2source myenv/bin/activate # On Windows use `myenv\Scripts\activate`
3Step 2: Install Required Packages
Install the necessary packages using pip:
1pip install videosdk-agents videosdk-plugins
2Step 3: Configure API Keys in a .env File
Create a
.env file in your project directory and add your VideoSDK API key:1VIDEOSDK_API_KEY=your_api_key_here
2Building the AI Voice Agent: A Step-by-Step Guide
To build your AI Voice Agent, we will start with the complete code and then break it down into smaller sections for a detailed explanation.
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are a conversational AI Voice Agent with a focus on 'conversational memory for voice agents'. Your persona is that of a friendly and knowledgeable virtual assistant. Your primary capabilities include:
14
151. Remembering past interactions with users to provide a more personalized experience.
162. Answering questions related to general knowledge and providing helpful information based on previous conversations.
173. Assisting users with setting reminders and managing simple tasks based on their past preferences and interactions.
18
19Constraints and limitations:
20
211. You are not capable of providing professional advice in areas such as medical, legal, or financial matters. Always include a disclaimer advising users to consult with a qualified professional for such inquiries.
222. Your memory is limited to the current session and cannot store information beyond the session's duration.
233. Ensure user privacy by not storing sensitive personal information and adhering to data protection regulations.
24
25Your goal is to enhance user interaction by utilizing conversational memory to make interactions more relevant and engaging."
26
27class MyVoiceAgent(Agent):
28 def __init__(self):
29 super().__init__(instructions=agent_instructions)
30 async def on_enter(self): await self.session.say("Hello! How can I help?")
31 async def on_exit(self): await self.session.say("Goodbye!")
32
33async def start_session(context: JobContext):
34 # Create agent and conversation flow
35 agent = MyVoiceAgent()
36 conversation_flow = ConversationFlow(agent)
37
38 # Create pipeline
39 pipeline = CascadingPipeline(
40 stt=DeepgramSTT(model="nova-2", language="en"),
41 llm=OpenAILLM(model="gpt-4o"),
42 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
43 vad=SileroVAD(threshold=0.35),
44 turn_detector=TurnDetector(threshold=0.8)
45 )
46
47 session = [AI voice Agent Sessions](https://docs.videosdk.live/ai_agents/core-components/agent-session)(
48 agent=agent,
49 pipeline=pipeline,
50 conversation_flow=conversation_flow
51 )
52
53 try:
54 await context.connect()
55 await session.start()
56 # Keep the session running until manually terminated
57 await asyncio.Event().wait()
58 finally:
59 # Clean up resources when done
60 await session.close()
61 await context.shutdown()
62
63def make_context() -> JobContext:
64 room_options = RoomOptions(
65 # room_id="YOUR_MEETING_ID", # Set to join a pre-created room; omit to auto-create
66 name="VideoSDK Cascaded Agent",
67 playground=True
68 )
69
70 return JobContext(room_options=room_options)
71
72if __name__ == "__main__":
73 job = WorkerJob(entrypoint=start_session, jobctx=make_context)
74 job.start()
75Step 4.1: Generating a VideoSDK Meeting ID
To interact with your agent, you need a meeting ID. You can generate one using the following
curl command:1curl -X POST https://api.videosdk.live/v1/meetings \
2-H "Authorization: Bearer YOUR_API_KEY" \
3-H "Content-Type: application/json"
4Step 4.2: Creating the Custom Agent Class
The
MyVoiceAgent class extends the Agent class, providing a custom implementation for handling user interactions. The on_enter and on_exit methods define actions when the agent starts and stops interacting with the user.Step 4.3: Defining the Core Pipeline
The
CascadingPipeline is a crucial component that defines the flow of data through the system. It connects the STT, LLM, and TTS plugins, allowing the agent to process user input and generate responses seamlessly.Step 4.4: Managing the Session and Startup Logic
The
start_session function initializes the agent session, creating a ConversationFlow and CascadingPipeline. The make_context function sets up the room options, and the main block starts the job, keeping the agent running.Running and Testing the Agent
Step 5.1: Running the Python Script
Execute the script using Python:
1python main.py
2Step 5.2: Interacting with the Agent in the Playground
Once the script is running, a link to the VideoSDK playground will be displayed in the console. Use this link to join the session and interact with your agent.
Advanced Features and Customizations
Extending Functionality with Custom Tools
You can enhance your agent by integrating custom tools, allowing it to perform specific tasks beyond its default capabilities.
Exploring Other Plugins
Explore other STT, LLM, and TTS plugins available in the VideoSDK framework to customize your agent further.
Troubleshooting Common Issues
API Key and Authentication Errors
Ensure your API key is correctly configured in the
.env file and that you have the necessary permissions.Audio Input/Output Problems
Check your audio device settings and ensure they are correctly configured for input and output.
Dependency and Version Conflicts
Verify that all dependencies are installed with compatible versions to avoid conflicts.
Conclusion
Summary of What You've Built
In this tutorial, you built an AI Voice Agent with conversational memory capabilities using the VideoSDK framework. You learned how to set up the environment, create a custom agent, and test it in a real-world scenario.
Next Steps and Further Learning
Continue exploring the
AI voice Agent core components overview
within the VideoSDK framework to add more advanced features to your agent. Consider integrating additional plugins or custom tools to expand its capabilities.Want to level-up your learning? Subscribe now
Subscribe to our newsletter for more tech based insights
FAQ