How does the AI Voice Agent process user input?

The agent uses a cascading pipeline that involves converting speech to text (STT), processing the text to understand intent (LLM), and converting the processed text back to speech (TTS).

What are the prerequisites for building an AI Voice Agent?

You need Python 3.11+, a VideoSDK account, and the necessary API keys configured in a `.env` file.

How can I test the AI Voice Agent?

Run the Python script to start the agent and use the playground link provided in the console to interact with the agent in your browser.

What should I do if I encounter API key errors?

Ensure your API keys are correctly configured in the `.env` file and check their permissions and validity.

Build AI Voice Agent for Appointments

Step-by-step guide to build an AI voice agent for appointment booking using VideoSDK and Python.

Introduction to AI Voice Agents in How to Build AI Voice Agent for Appointment Booking

In today's fast-paced world, the ability to automate tasks like appointment booking can significantly enhance efficiency and user satisfaction. AI Voice Agents are at the forefront of this automation, providing a seamless interface for users to interact with systems using natural language.

What is an AI Voice Agent?

An AI Voice Agent is a software application that uses artificial intelligence to interact with users through voice commands. These agents can understand spoken language, process it, and respond in a conversational manner, making them ideal for tasks like appointment booking, customer service, and more.

Why are they important for the Appointment Booking Industry?

AI Voice Agents are crucial in the appointment booking industry as they help automate the scheduling process, reduce human error, and provide 24/7 availability. They can handle multiple requests simultaneously, offer personalized scheduling options, and improve the overall user experience.

Core Components of a Voice Agent

Speech-to-Text (STT): Converts spoken language into text. For enhanced functionality, consider using the
Deepgram STT Plugin for voice agent
.
Large Language Model (LLM): Processes the text to understand the user's intent. The
OpenAI LLM Plugin for voice agent
can be integrated for advanced language processing.
Text-to-Speech (TTS): Converts the processed text back into spoken language. Enhance this component with the
ElevenLabs TTS Plugin for voice agent
.

What You'll Build in This Tutorial

In this tutorial, you'll learn how to build a voice agent capable of booking appointments using the VideoSDK AI Agents framework. We'll guide you through the process of setting up the environment, coding the agent, and testing it in a real-world scenario. For a comprehensive setup, refer to the

Voice Agent Quick Start Guide

Architecture and Core Concepts

High-Level Architecture Overview

The architecture of an AI Voice Agent involves several components working together to process user input and generate a response. The typical flow is as follows:

User Speech: The user speaks into the system.
Speech-to-Text (STT): Converts the speech into text.
Language Processing (LLM): Analyzes the text to determine the user's intent.
Text-to-Speech (TTS): Converts the response text back into speech.
Agent Response: The system speaks back to the user.

Understanding Key Concepts in the VideoSDK Framework

Agent: The core class representing your bot, responsible for managing interactions.
CascadingPipeline: Manages the flow of audio processing from STT to LLM to TTS. Learn more about this in the
Cascading pipeline in AI voice Agents
.
VAD & TurnDetector: Voice Activity Detection (VAD) and Turn Detection help the agent know when to listen and when to speak.

Setting Up the Development Environment

Prerequisites

To get started, ensure you have Python 3.11+ installed and a VideoSDK account. You can sign up at app.videosdk.live.

Step 1: Create a Virtual Environment

Creating a virtual environment helps manage dependencies and avoid conflicts.

1python3 -m venv venv
2source venv/bin/activate  # On Windows use `venv\\Scripts\\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip.

1pip install videosdk
2

Step 3: Configure API Keys in a `.env` File

Create a .env file in your project directory to store your API keys securely.

1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

Let's begin by presenting the complete code for the AI Voice Agent. This code sets up the agent, defines its behavior, and manages the interaction pipeline.

1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are a helpful appointment booking assistant designed to assist users in scheduling appointments efficiently. Your primary role is to facilitate the booking process by understanding user requests, providing available time slots, and confirming appointments. You can also answer basic questions related to the appointment process, such as cancellation policies or rescheduling options. However, you are not a medical professional and cannot provide medical advice or diagnose conditions. Always include a disclaimer advising users to consult a qualified professional for medical-related inquiries. Your interactions should be polite, concise, and focused on resolving the user\'s request as efficiently as possible. You must ensure user data privacy and adhere to all relevant data protection regulations."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = AgentSession(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Step 4.1: Generating a VideoSDK Meeting ID

To interact with your agent, you need a meeting ID. Use the following curl command to generate one:

1curl -X POST \\
2  https://api.videosdk.live/v1/meetings \\
3  -H "Authorization: YOUR_VIDEOSDK_API_KEY" \\
4  -H "Content-Type: application/json"
5

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class inherits from the Agent class. It defines the agent's behavior, including greetings and farewells.

1class MyVoiceAgent(Agent):
2    def __init__(self):
3        super().__init__(instructions=agent_instructions)
4    async def on_enter(self): await self.session.say("Hello! How can I help?")
5    async def on_exit(self): await self.session.say("Goodbye!")
6

Step 4.3: Defining the Core Pipeline

The CascadingPipeline is crucial for processing audio input and generating responses. It integrates various plugins for STT, LLM, TTS, VAD, and Turn Detection. For a detailed understanding, refer to the

AI voice Agent core components overview

1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8

Step 4.4: Managing the Session and Startup Logic

The start_session function initializes the session and starts the agent. The make_context function sets up the environment with room options. For more details, explore

AI voice Agent Sessions

1async def start_session(context: JobContext):
2    agent = MyVoiceAgent()
3    conversation_flow = ConversationFlow(agent)
4
5    pipeline = CascadingPipeline(
6        stt=DeepgramSTT(model="nova-2", language="en"),
7        llm=OpenAILLM(model="gpt-4o"),
8        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
9        vad=SileroVAD(threshold=0.35),
10        turn_detector=TurnDetector(threshold=0.8)
11    )
12
13    session = AgentSession(
14        agent=agent,
15        pipeline=pipeline,
16        conversation_flow=conversation_flow
17    )
18
19    try:
20        await context.connect()
21        await session.start()
22        await asyncio.Event().wait()
23    finally:
24        await session.close()
25        await context.shutdown()
26
27def make_context() -> JobContext:
28    room_options = RoomOptions(
29        name="VideoSDK Cascaded Agent",
30        playground=True
31    )
32
33    return JobContext(room_options=room_options)
34
35if __name__ == "__main__":
36    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
37    job.start()
38

Running and Testing the Agent

Step 5.1: Running the Python Script

To run your agent, execute the following command in your terminal:

1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

Once the script is running, you'll see a playground link in the console. Open this link in your browser to interact with your agent. You can speak to the agent and see how it responds to your appointment booking requests.

Advanced Features and Customizations

Extending Functionality with Custom Tools

You can extend your agent's capabilities by integrating custom tools. These tools can perform specific tasks, such as fetching data from an external API, to enhance the agent's functionality.

Exploring Other Plugins

The VideoSDK framework supports various plugins. You can explore other STT, LLM, and TTS options to tailor the agent's performance to your needs.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correctly configured in the .env file. Double-check the permissions and validity of your keys.

Audio Input/Output Problems

Verify that your microphone and speakers are working correctly. Check your system's audio settings and permissions.

Dependency and Version Conflicts

Ensure all dependencies are installed with compatible versions. Use a virtual environment to manage package versions and avoid conflicts.

Conclusion

Summary of What You've Built

In this tutorial, you've built a fully functional AI Voice Agent capable of booking appointments. You've learned how to set up the environment, code the agent, and test it in a real-world scenario.

Next Steps and Further Learning

Explore additional features and plugins offered by the VideoSDK framework. Consider integrating your agent with other systems to expand its capabilities and improve its usability.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ

Free $20 Balance for AI Voice Agents & Video Calls