Mastering AI Voice Agents: Maintain Context

Step-by-step guide to building AI Voice Agents that maintain context using VideoSDK. Includes code examples and testing.

Introduction to AI Voice Agents in Maintaining Context in Conversation

AI Voice Agents are sophisticated systems designed to interact with users through voice commands. They are capable of understanding spoken language, processing the information, and responding appropriately. In industries like customer service, healthcare, and home automation, maintaining context in conversation is crucial for providing coherent and relevant responses.

What is an AI

Voice Agent

?

An AI

Voice Agent

is a software system that uses natural language processing (NLP) to understand and respond to human speech. These agents can perform tasks ranging from answering queries to controlling smart devices.

Why are they important for maintaining context in conversation?

In applications like healthcare, AI Voice Agents need to understand the context of a conversation to provide accurate information or assistance. For instance, a healthcare assistant should remember previous interactions to offer personalized advice or schedule appointments effectively.

Core Components of a

Voice Agent

  • Speech-to-Text (STT): Converts spoken language into text.
  • Large Language Model (LLM): Processes the text to understand and generate responses.
  • Text-to-Speech (TTS): Converts text responses back into spoken language.
For a comprehensive understanding of these elements, refer to the

AI voice Agent core components overview

.

What You'll Build in This Tutorial

In this tutorial, you'll learn to build an AI

Voice Agent

using the VideoSDK framework. The agent will maintain context in conversations, answer health-related questions, and schedule appointments.

Architecture and Core Concepts

High-Level Architecture Overview

The architecture involves capturing user speech, converting it to text, processing it through an LLM, and then converting the response back to speech. This flow ensures the agent maintains context throughout the interaction.
Diagram

Understanding Key Concepts in the VideoSDK Framework

  • Agent: Represents the core logic of your voice assistant.
  • Cascading Pipeline in AI voice Agents

    : Manages the flow of data from STT to LLM to TTS.
  • VAD & TurnDetector: Determine when the agent should listen or speak.

Setting Up the Development Environment

Prerequisites

  • Python 3.11+
  • VideoSDK Account (sign up at app.videosdk.live)

Step 1: Create a Virtual Environment

Create a virtual environment to manage dependencies:
1python -m venv venv
2source venv/bin/activate  # On Windows use `venv\\Scripts\\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip:
1pip install videosdk
2pip install python-dotenv
3

Step 3: Configure API Keys in a .env File

Create a .env file in your project directory and add your API keys:
1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

Here is the complete code for the AI Voice Agent:
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10pre_download_model()
11
12agent_instructions = "{\n  \"persona\": \"helpful healthcare assistant\",\n  \"capabilities\": [\n    \"maintain context in conversation to provide coherent and relevant responses\",\n    \"answer questions about common symptoms\",\n    \"schedule appointments with healthcare providers\",\n    \"provide general health tips and advice\"\n  ],\n  \"constraints\": [\n    \"you are not a medical professional and must include a disclaimer to consult a doctor\",\n    \"do not provide any diagnosis or treatment plans\",\n    \"ensure user privacy and data protection at all times\",\n    \"limit conversations to general health topics and appointment scheduling\"\n  ]\n}"
13
14class MyVoiceAgent(Agent):
15    def __init__(self):
16        super().__init__(instructions=agent_instructions)
17    async def on_enter(self): await self.session.say("Hello! How can I help?")
18    async def on_exit(self): await self.session.say("Goodbye!")
19
20async def start_session(context: JobContext):
21    agent = MyVoiceAgent()
22    conversation_flow = ConversationFlow(agent)
23
24    pipeline = CascadingPipeline(
25        stt=DeepgramSTT(model="nova-2", language="en"),
26        llm=OpenAILLM(model="gpt-4o"),
27        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
28        vad=[Silero Voice Activity Detection](https://docs.videosdk.live/ai_agents/plugins/silero-vad)(threshold=0.35),
29        turn_detector=[Turn detector for AI voice Agents](https://docs.videosdk.live/ai_agents/plugins/turn-detector)(threshold=0.8)
30    )
31
32    session = [AI voice Agent Sessions](https://docs.videosdk.live/ai_agents/core-components/agent-session)(
33        agent=agent,
34        pipeline=pipeline,
35        conversation_flow=conversation_flow
36    )
37
38    try:
39        await context.connect()
40        await session.start()
41        await asyncio.Event().wait()
42    finally:
43        await session.close()
44        await context.shutdown()
45
46def make_context() -> JobContext:
47    room_options = RoomOptions(
48        name="VideoSDK Cascaded Agent",
49        playground=True
50    )
51
52    return JobContext(room_options=room_options)
53
54if __name__ == "__main__":
55    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
56    job.start()
57

Step 4.1: Generating a VideoSDK Meeting ID

To interact with the agent, you'll need a meeting ID. Use the following curl command to generate one:
1curl -X POST \
2  https://api.videosdk.live/v1/meetings \
3  -H "Authorization: Bearer YOUR_API_KEY" \
4  -H "Content-Type: application/json"
5

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class extends the Agent class, defining custom behavior for entering and exiting sessions. It uses predefined instructions to maintain conversation context.
1class MyVoiceAgent(Agent):
2    def __init__(self):
3        super().__init__(instructions=agent_instructions)
4    async def on_enter(self):
5        await self.session.say("Hello! How can I help?")
6    async def on_exit(self):
7        await self.session.say("Goodbye!")
8

Step 4.3: Defining the Core Pipeline

The CascadingPipeline manages the flow of data through the system. It uses various plugins for STT, LLM, TTS, VAD, and turn detection.
1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=[OpenAI LLM Plugin for voice agent](https://docs.videosdk.live/ai_agents/plugins/llm/openai)(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8

Step 4.4: Managing the Session and Startup Logic

The start_session function initializes the agent session and handles connection and cleanup.
1async def start_session(context: JobContext):
2    agent = MyVoiceAgent()
3    conversation_flow = ConversationFlow(agent)
4
5    pipeline = CascadingPipeline(
6        stt=DeepgramSTT(model="nova-2", language="en"),
7        llm=OpenAILLM(model="gpt-4o"),
8        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
9        vad=SileroVAD(threshold=0.35),
10        turn_detector=TurnDetector(threshold=0.8)
11    )
12
13    session = AgentSession(
14        agent=agent,
15        pipeline=pipeline,
16        conversation_flow=conversation_flow
17    )
18
19    try:
20        await context.connect()
21        await session.start()
22        await asyncio.Event().wait()
23    finally:
24        await session.close()
25        await context.shutdown()
26
The make_context function sets up the JobContext with room options for testing.
1def make_context() -> JobContext:
2    room_options = RoomOptions(
3        name="VideoSDK Cascaded Agent",
4        playground=True
5    )
6
7    return JobContext(room_options=room_options)
8
The main block starts the job:
1if __name__ == "__main__":
2    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
3    job.start()
4

Running and Testing the Agent

Step 5.1: Running the Python Script

Run the script using the command:
1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

After starting the script, look for a playground link in the console. Use this link to join the session and interact with the agent.

Advanced Features and Customizations

Extending Functionality with Custom Tools

You can enhance the agent by adding custom tools using the function_tool concept, allowing for more specialized tasks.

Exploring Other Plugins

Experiment with different STT, LLM, and TTS plugins to optimize performance and cost.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correct and stored securely in the .env file.

Audio Input/Output Problems

Check your microphone and speaker settings if you encounter audio issues.

Dependency and Version Conflicts

Use a virtual environment to manage dependencies and avoid conflicts.

Conclusion

Summary of What You've Built

You've created a functional AI Voice Agent capable of maintaining context in conversations, useful in healthcare and other industries.

Next Steps and Further Learning

Explore more advanced features and plugins in the VideoSDK framework to expand your agent's capabilities.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ