Build AI Voice Agents for Public Services

Create AI Voice Agents for public services with VideoSDK. Follow our detailed guide with complete code examples.

Introduction to AI Voice Agents in Public Services

AI Voice Agents are transforming the way public services interact with citizens by providing efficient, 24/7 assistance. These agents can handle inquiries related to healthcare, transportation, and utilities, making them invaluable in public service sectors.

What is an AI Voice Agent?

An AI Voice Agent is a software application that uses artificial intelligence to interact with users through voice commands. It processes spoken language, understands the intent, and responds appropriately, simulating a human-like conversation.

Why are they important for Public Services?

In the public services industry, AI Voice Agents can streamline operations by handling routine inquiries, guiding users through complex processes, and providing instant access to information. This reduces the workload on human staff and enhances citizen satisfaction.

Core Components of a Voice Agent

  • Speech-to-Text (STT): Converts spoken language into text.
  • Large Language Model (LLM): Understands and generates human-like text responses.
  • Text-to-Speech (TTS): Converts text responses back into spoken language.

What You'll Build in This Tutorial

In this tutorial, you will build an AI Voice Agent using the VideoSDK framework, capable of assisting with public service inquiries. For a detailed walkthrough, refer to the

Voice Agent Quick Start Guide

.

Architecture and Core Concepts

High-Level Architecture Overview

AI Voice Agents operate by converting user speech into text, processing the text to understand the user’s intent, and then generating a spoken response. This involves several components working in harmony.
1sequenceDiagram
2    participant User
3    participant Agent
4    participant STT
5    participant LLM
6    participant TTS
7    User->>Agent: Speak
8    Agent->>STT: Convert Speech to Text
9    STT-->>Agent: Text
10    Agent->>LLM: Process Text
11    LLM-->>Agent: Response Text
12    Agent->>TTS: Convert Text to Speech
13    TTS-->>Agent: Audio
14    Agent->>User: Speak
15

Understanding Key Concepts in the VideoSDK Framework

  • Agent: The core class representing your bot, handling interactions and managing the conversation flow.
  • CascadingPipeline: Manages the flow of audio processing, integrating STT, LLM, and TTS. Learn more about the

    Cascading pipeline in AI voice Agents

    .
  • VAD & TurnDetector: These components help the agent determine when to listen and when to speak, ensuring smooth interactions. For more details, explore the

    Turn detector for AI voice Agents

    .

Setting Up the Development Environment

Prerequisites

Before starting, ensure you have Python 3.11+ installed and a VideoSDK account. Sign up at app.videosdk.live to get your API keys.

Step 1: Create a Virtual Environment

Create a virtual environment to manage your project dependencies:
1python -m venv venv
2source venv/bin/activate  # On Windows use `venv\\Scripts\\activate`
3

Step 2: Install Required Packages

Install the necessary Python packages using pip:
1pip install videosdk
2

Step 3: Configure API Keys in a .env file

Create a .env file in your project directory and add your VideoSDK API keys:
1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

Complete Code Overview

Below is the complete code for building your AI Voice Agent. We will break down each part in the following sections.
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are an AI Voice Agent designed to assist with public services inquiries. Your persona is that of a knowledgeable and courteous public service representative. Your primary capabilities include providing information about various public services such as healthcare, transportation, and utilities, assisting users in navigating public service websites, and answering frequently asked questions related to public services. You can also guide users on how to access specific services and provide contact information for further assistance. However, you must adhere to the following constraints: you are not a legal or medical professional, so you must include a disclaimer advising users to consult with qualified professionals for legal or medical advice. Additionally, you should not store any personal data or make any transactions on behalf of users. Your responses should be concise, accurate, and respectful, ensuring a positive user experience."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = AgentSession(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Step 4.1: Generating a VideoSDK Meeting ID

To create a meeting ID, use the following curl command:
1curl -X POST https://api.videosdk.live/v1/meetings \\
2-H "Authorization: Bearer YOUR_API_KEY" \\
3-H "Content-Type: application/json"
4
This will return a meeting ID that you can use to connect your agent.

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class is where you define your agent's behavior. It inherits from the Agent class and defines how the agent should greet users upon entering or exiting a session.
1class MyVoiceAgent(Agent):
2    def __init__(self):
3        super().__init__(instructions=agent_instructions)
4    async def on_enter(self): await self.session.say("Hello! How can I help?")
5    async def on_exit(self): await self.session.say("Goodbye!")
6

Step 4.3: Defining the Core Pipeline

The CascadingPipeline is crucial for processing audio. It integrates various plugins for STT, LLM, TTS, VAD, and turn detection. For more information on these plugins, check out the

Deepgram STT Plugin for voice agent

,

OpenAI LLM Plugin for voice agent

, and

ElevenLabs TTS Plugin for voice agent

.
1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8

Step 4.4: Managing the Session and Startup Logic

The session management involves connecting to the VideoSDK platform and starting the agent. You can explore more about

AI voice Agent Sessions

.
1async def start_session(context: JobContext):
2    agent = MyVoiceAgent()
3    conversation_flow = ConversationFlow(agent)
4    pipeline = CascadingPipeline(
5        stt=DeepgramSTT(model="nova-2", language="en"),
6        llm=OpenAILLM(model="gpt-4o"),
7        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
8        vad=SileroVAD(threshold=0.35),
9        turn_detector=TurnDetector(threshold=0.8)
10    )
11    session = AgentSession(
12        agent=agent,
13        pipeline=pipeline,
14        conversation_flow=conversation_flow
15    )
16    try:
17        await context.connect()
18        await session.start()
19        await asyncio.Event().wait()
20    finally:
21        await session.close()
22        await context.shutdown()
23

Running and Testing the Agent

Step 5.1: Running the Python Script

To start the agent, run the script using Python:
1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

Once the agent is running, use the

AI Agent playground

link provided in the console to interact with your AI Voice Agent.

Advanced Features and Customizations

Extending Functionality with Custom Tools

You can enhance your agent by integrating custom tools using the function_tool feature, allowing for more tailored interactions.

Exploring Other Plugins

Consider experimenting with different STT, LLM, and TTS plugins to optimize the agent's performance and capabilities.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correctly set in the .env file and that they have the necessary permissions.

Audio Input/Output Problems

Check your microphone and speaker settings, and ensure the correct audio devices are selected.

Dependency and Version Conflicts

Ensure all dependencies are installed with compatible versions as specified in the documentation.

Conclusion

Summary of What You've Built

Congratulations! You've built an AI Voice Agent capable of assisting with public service inquiries using the VideoSDK framework. For a comprehensive understanding of the components, refer to the

AI voice Agent core components overview

.

Next Steps and Further Learning

Explore additional features and plugins to expand your agent's capabilities and improve user interactions.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ