Introduction to AI Voice Agents in Public Services
AI Voice Agents are transforming the way public services interact with citizens by providing efficient, 24/7 assistance. These agents can handle inquiries related to healthcare, transportation, and utilities, making them invaluable in public service sectors.
What is an AI Voice Agent?
An AI Voice Agent is a software application that uses artificial intelligence to interact with users through voice commands. It processes spoken language, understands the intent, and responds appropriately, simulating a human-like conversation.
Why are they important for Public Services?
In the public services industry, AI Voice Agents can streamline operations by handling routine inquiries, guiding users through complex processes, and providing instant access to information. This reduces the workload on human staff and enhances citizen satisfaction.
Core Components of a Voice Agent
- Speech-to-Text (STT): Converts spoken language into text.
- Large Language Model (LLM): Understands and generates human-like text responses.
- Text-to-Speech (TTS): Converts text responses back into spoken language.
What You'll Build in This Tutorial
In this tutorial, you will build an AI Voice Agent using the VideoSDK framework, capable of assisting with public service inquiries. For a detailed walkthrough, refer to the
Voice Agent Quick Start Guide
.Architecture and Core Concepts
High-Level Architecture Overview
AI Voice Agents operate by converting user speech into text, processing the text to understand the user’s intent, and then generating a spoken response. This involves several components working in harmony.
1sequenceDiagram
2 participant User
3 participant Agent
4 participant STT
5 participant LLM
6 participant TTS
7 User->>Agent: Speak
8 Agent->>STT: Convert Speech to Text
9 STT-->>Agent: Text
10 Agent->>LLM: Process Text
11 LLM-->>Agent: Response Text
12 Agent->>TTS: Convert Text to Speech
13 TTS-->>Agent: Audio
14 Agent->>User: Speak
15Understanding Key Concepts in the VideoSDK Framework
- Agent: The core class representing your bot, handling interactions and managing the conversation flow.
- CascadingPipeline: Manages the flow of audio processing, integrating STT, LLM, and TTS. Learn more about the
Cascading pipeline in AI voice Agents
. - VAD & TurnDetector: These components help the agent determine when to listen and when to speak, ensuring smooth interactions. For more details, explore the
Turn detector for AI voice Agents
.
Setting Up the Development Environment
Prerequisites
Before starting, ensure you have Python 3.11+ installed and a VideoSDK account. Sign up at app.videosdk.live to get your API keys.
Step 1: Create a Virtual Environment
Create a virtual environment to manage your project dependencies:
1python -m venv venv
2source venv/bin/activate # On Windows use `venv\\Scripts\\activate`
3Step 2: Install Required Packages
Install the necessary Python packages using pip:
1pip install videosdk
2Step 3: Configure API Keys in a .env file
Create a
.env file in your project directory and add your VideoSDK API keys:1VIDEOSDK_API_KEY=your_api_key_here
2Building the AI Voice Agent: A Step-by-Step Guide
Complete Code Overview
Below is the complete code for building your AI Voice Agent. We will break down each part in the following sections.
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are an AI Voice Agent designed to assist with public services inquiries. Your persona is that of a knowledgeable and courteous public service representative. Your primary capabilities include providing information about various public services such as healthcare, transportation, and utilities, assisting users in navigating public service websites, and answering frequently asked questions related to public services. You can also guide users on how to access specific services and provide contact information for further assistance. However, you must adhere to the following constraints: you are not a legal or medical professional, so you must include a disclaimer advising users to consult with qualified professionals for legal or medical advice. Additionally, you should not store any personal data or make any transactions on behalf of users. Your responses should be concise, accurate, and respectful, ensuring a positive user experience."
14
15class MyVoiceAgent(Agent):
16 def __init__(self):
17 super().__init__(instructions=agent_instructions)
18 async def on_enter(self): await self.session.say("Hello! How can I help?")
19 async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22 # Create agent and conversation flow
23 agent = MyVoiceAgent()
24 conversation_flow = ConversationFlow(agent)
25
26 # Create pipeline
27 pipeline = CascadingPipeline(
28 stt=DeepgramSTT(model="nova-2", language="en"),
29 llm=OpenAILLM(model="gpt-4o"),
30 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31 vad=SileroVAD(threshold=0.35),
32 turn_detector=TurnDetector(threshold=0.8)
33 )
34
35 session = AgentSession(
36 agent=agent,
37 pipeline=pipeline,
38 conversation_flow=conversation_flow
39 )
40
41 try:
42 await context.connect()
43 await session.start()
44 # Keep the session running until manually terminated
45 await asyncio.Event().wait()
46 finally:
47 # Clean up resources when done
48 await session.close()
49 await context.shutdown()
50
51def make_context() -> JobContext:
52 room_options = RoomOptions(
53 # room_id="YOUR_MEETING_ID", # Set to join a pre-created room; omit to auto-create
54 name="VideoSDK Cascaded Agent",
55 playground=True
56 )
57
58 return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61 job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62 job.start()
63Step 4.1: Generating a VideoSDK Meeting ID
To create a meeting ID, use the following
curl command:1curl -X POST https://api.videosdk.live/v1/meetings \\
2-H "Authorization: Bearer YOUR_API_KEY" \\
3-H "Content-Type: application/json"
4This will return a meeting ID that you can use to connect your agent.
Step 4.2: Creating the Custom Agent Class
The
MyVoiceAgent class is where you define your agent's behavior. It inherits from the Agent class and defines how the agent should greet users upon entering or exiting a session.1class MyVoiceAgent(Agent):
2 def __init__(self):
3 super().__init__(instructions=agent_instructions)
4 async def on_enter(self): await self.session.say("Hello! How can I help?")
5 async def on_exit(self): await self.session.say("Goodbye!")
6Step 4.3: Defining the Core Pipeline
The
CascadingPipeline is crucial for processing audio. It integrates various plugins for STT, LLM, TTS, VAD, and turn detection. For more information on these plugins, check out the Deepgram STT Plugin for voice agent
,OpenAI LLM Plugin for voice agent
, andElevenLabs TTS Plugin for voice agent
.1pipeline = CascadingPipeline(
2 stt=DeepgramSTT(model="nova-2", language="en"),
3 llm=OpenAILLM(model="gpt-4o"),
4 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5 vad=SileroVAD(threshold=0.35),
6 turn_detector=TurnDetector(threshold=0.8)
7)
8Step 4.4: Managing the Session and Startup Logic
The session management involves connecting to the VideoSDK platform and starting the agent. You can explore more about
AI voice Agent Sessions
.1async def start_session(context: JobContext):
2 agent = MyVoiceAgent()
3 conversation_flow = ConversationFlow(agent)
4 pipeline = CascadingPipeline(
5 stt=DeepgramSTT(model="nova-2", language="en"),
6 llm=OpenAILLM(model="gpt-4o"),
7 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
8 vad=SileroVAD(threshold=0.35),
9 turn_detector=TurnDetector(threshold=0.8)
10 )
11 session = AgentSession(
12 agent=agent,
13 pipeline=pipeline,
14 conversation_flow=conversation_flow
15 )
16 try:
17 await context.connect()
18 await session.start()
19 await asyncio.Event().wait()
20 finally:
21 await session.close()
22 await context.shutdown()
23Running and Testing the Agent
Step 5.1: Running the Python Script
To start the agent, run the script using Python:
1python main.py
2Step 5.2: Interacting with the Agent in the Playground
Once the agent is running, use the
AI Agent playground
link provided in the console to interact with your AI Voice Agent.Advanced Features and Customizations
Extending Functionality with Custom Tools
You can enhance your agent by integrating custom tools using the
function_tool feature, allowing for more tailored interactions.Exploring Other Plugins
Consider experimenting with different STT, LLM, and TTS plugins to optimize the agent's performance and capabilities.
Troubleshooting Common Issues
API Key and Authentication Errors
Ensure your API keys are correctly set in the
.env file and that they have the necessary permissions.Audio Input/Output Problems
Check your microphone and speaker settings, and ensure the correct audio devices are selected.
Dependency and Version Conflicts
Ensure all dependencies are installed with compatible versions as specified in the documentation.
Conclusion
Summary of What You've Built
Congratulations! You've built an AI Voice Agent capable of assisting with public service inquiries using the VideoSDK framework. For a comprehensive understanding of the components, refer to the
AI voice Agent core components overview
.Next Steps and Further Learning
Explore additional features and plugins to expand your agent's capabilities and improve user interactions.
Want to level-up your learning? Subscribe now
Subscribe to our newsletter for more tech based insights
FAQ