Master Context Switching in AI Voice Agents

Build AI Voice Agents with context switching using VideoSDK. Follow this detailed guide with complete code examples.

Introduction to AI Voice Agents in Context Switching

What is an AI

Voice Agent

?

AI Voice Agents are sophisticated systems designed to interact with users through voice commands. They leverage technologies such as Speech-to-Text (STT), Text-to-Speech (TTS), and Language Models (LLM) to understand and respond to human speech in a natural manner. These agents are becoming increasingly prevalent in various industries, providing hands-free, efficient, and personalized user experiences.

Why are they important for the Context Switching in

Voice Agent

Industry?

In the context switching domain, AI Voice Agents are crucial as they allow seamless transitions between different topics or tasks during a conversation. This capability is essential in environments where users may need to switch contexts rapidly, such as customer service, virtual assistants, and smart home devices. By maintaining context, these agents provide a more coherent and intuitive interaction experience.

Core Components of a

Voice Agent

  • Speech-to-Text (STT): Converts spoken language into text.
  • Language Model (LLM): Processes the text to understand and generate responses.
  • Text-to-Speech (TTS): Converts the generated text back into spoken language.

What You'll Build in This Tutorial

In this tutorial, you will learn how to build an AI

Voice Agent

capable of context switching using the VideoSDK framework. We will guide you through setting up the environment, creating the agent, and testing it in a real-world scenario.

Architecture and Core Concepts

High-Level Architecture Overview

The architecture of an AI

Voice Agent

involves several components working together to process user input and generate responses. Here's a high-level overview:
  1. User Speech: The user speaks into the microphone.
  2. Voice

    Activity Detection

    (VAD):
    Detects when the user is speaking.
  3. Speech-to-Text (STT): Converts the speech to text.
  4. Language Model (LLM): Processes the text to understand the intent and generate a response.
  5. Text-to-Speech (TTS): Converts the response text into speech.
  6. Agent Response: The agent speaks back to the user.
Diagram

Understanding Key Concepts in the VideoSDK Framework

  • Agent: The core class that represents your bot. It handles interactions and manages the conversation flow.
  • Cascading Pipeline in AI Voice Agents

    :
    A sequence of audio processing steps (STT -> LLM -> TTS) that transforms user input into agent responses.
  • VAD &

    Turn Detector for AI Voice Agents

    :
    These components help the agent determine when to listen and when to respond, ensuring smooth interactions.

Setting Up the Development Environment

Prerequisites

  • Python 3.11+: Ensure you have Python 3.11 or later installed.
  • VideoSDK Account: Sign up at app.videosdk.live to access the necessary API keys.

Step 1: Create a Virtual Environment

Create a virtual environment to manage your project dependencies:
1python -m venv venv
2source venv/bin/activate  # On Windows use `venv\\Scripts\\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip:
1pip install videosdk
2pip install python-dotenv
3

Step 3: Configure API Keys in a .env File

Create a .env file in your project directory and add your VideoSDK API key:
1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

Complete Code Block

Here is the complete code for the AI Voice Agent:
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are a 'context-aware virtual assistant' designed to handle 'context switching in voice agent' scenarios efficiently. Your primary role is to assist users by maintaining context across multiple topics and seamlessly switching between them as needed. \n\n**Persona:** You are a friendly and knowledgeable virtual assistant capable of understanding and managing multiple conversational threads. \n\n**Capabilities:**\n1. Maintain context across different topics and switch seamlessly between them.\n2. Answer questions related to various domains such as technology, travel, and general knowledge.\n3. Provide reminders and manage simple tasks like setting alarms or timers.\n4. Offer suggestions based on previous interactions and user preferences.\n\n**Constraints and Limitations:**\n1. You are not an expert in any specific field and should always encourage users to verify information from reliable sources.\n2. You must not store any personal user data beyond the session for privacy reasons.\n3. You should clearly inform users when you are switching contexts and ensure that the transition is smooth and understandable.\n4. You cannot perform any actions that require personal data or sensitive information without explicit user consent."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = AgentSession(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Step 4.1: Generating a VideoSDK Meeting ID

To interact with your agent, you'll need a meeting ID. You can generate one using the following curl command:
1curl -X POST https://api.videosdk.live/v1/meetings \\
2-H "Authorization: Bearer YOUR_API_KEY" \\
3-H "Content-Type: application/json"
4

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class extends the base Agent class. It initializes with specific instructions that define the agent's capabilities and constraints. The on_enter and on_exit methods handle greetings and farewells, creating a friendly user interaction.
1class MyVoiceAgent(Agent):
2    def __init__(self):
3        super().__init__(instructions=agent_instructions)
4    async def on_enter(self): await self.session.say("Hello! How can I help?")
5    async def on_exit(self): await self.session.say("Goodbye!")
6

Step 4.3: Defining the Core Pipeline

The CascadingPipeline is where the magic happens. It defines the flow of audio processing, starting with Speech-to-Text (STT) and ending with Text-to-Speech (TTS). Each plugin plays a crucial role:
  • DeepgramSTT: Converts speech to text.
  • OpenAILLM: Processes text and generates responses.
  • ElevenLabsTTS: Converts responses back to speech.
  • SileroVAD & TurnDetector: Manage audio input detection and conversational turns.
1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8

Step 4.4: Managing the Session and Startup Logic

The start_session function manages the lifecycle of the agent session. It connects to the VideoSDK context, starts the session, and ensures resources are cleaned up after use. The make_context function sets up the room options, and the main block starts the agent job.
1async def start_session(context: JobContext):
2    # Create agent and conversation flow
3    agent = MyVoiceAgent()
4    conversation_flow = ConversationFlow(agent)
5
6    # Create pipeline
7    pipeline = CascadingPipeline(
8        stt=DeepgramSTT(model="nova-2", language="en"),
9        llm=OpenAILLM(model="gpt-4o"),
10        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
11        vad=SileroVAD(threshold=0.35),
12        turn_detector=TurnDetector(threshold=0.8)
13    )
14
15    session = AgentSession(
16        agent=agent,
17        pipeline=pipeline,
18        conversation_flow=conversation_flow
19    )
20
21    try:
22        await context.connect()
23        await session.start()
24        # Keep the session running until manually terminated
25        await asyncio.Event().wait()
26    finally:
27        # Clean up resources when done
28        await session.close()
29        await context.shutdown()
30
31def make_context() -> JobContext:
32    room_options = RoomOptions(
33    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
34        name="VideoSDK Cascaded Agent",
35        playground=True
36    )
37
38    return JobContext(room_options=room_options)
39
40if __name__ == "__main__":
41    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
42    job.start()
43

Running and Testing the Agent

Step 5.1: Running the Python Script

To start your agent, run the script using:
1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

Once the script is running, a playground link will be displayed in the console. Open this link in your browser to interact with your agent. You can speak to the agent and observe how it handles context switching in real-time.

Advanced Features and Customizations

Extending Functionality with Custom Tools

VideoSDK allows you to extend your agent's capabilities by integrating custom tools. These tools can perform specific tasks or provide additional data processing capabilities, enhancing the agent's functionality.

Exploring Other Plugins

While this tutorial uses specific plugins for STT, LLM, and TTS, VideoSDK supports various other options. Explore different plugins to find the best fit for your use case.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API key is correctly configured in the .env file. Check for typos or missing keys.

Audio Input/Output Problems

Verify your microphone and speaker settings. Ensure the correct devices are selected and functioning.

Dependency and Version Conflicts

Check the versions of your installed packages. Conflicts may arise if dependencies are outdated or incompatible.

Conclusion

Summary of What You've Built

In this tutorial, you've built an AI Voice Agent capable of context switching using the VideoSDK framework. You've learned about the architecture, core components, and how to set up and test your agent.

Next Steps and Further Learning

Explore additional features and plugins offered by VideoSDK. Consider customizing your agent further to handle more complex scenarios or integrate with other services. For a comprehensive understanding of the

AI Voice Agent core components overview

, delve into the detailed documentation provided by VideoSDK.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ