Build AI Voice Agent for Health Support

Step-by-step guide to building an AI Voice Agent for health support using VideoSDK.

Introduction to AI Voice Agents in how to build ai voice agent for health support industry

In the rapidly evolving world of technology, AI Voice Agents have emerged as powerful tools, revolutionizing how we interact with machines. These agents are designed to understand human speech, process the information, and respond in a natural, conversational manner. In this tutorial, we will focus on building an AI Voice Agent tailored for the health support industry, a sector where timely and accurate information is crucial.

What is an AI Voice Agent?

An AI Voice Agent is a software program that uses artificial intelligence to interact with users through voice commands. It listens to user inputs, processes the information using natural language understanding, and responds with synthesized speech. These agents are increasingly used in various industries to automate customer service, provide information, and perform tasks.

Why are they important for the health support industry?

In the health support industry, AI Voice Agents can play a vital role in providing immediate assistance to patients, answering common health-related queries, and helping schedule appointments with healthcare providers. They enhance the accessibility of healthcare services, ensuring that users receive timely support and information.

Core Components of a Voice Agent

To build a functional AI Voice Agent, several core components are essential:
  • Speech-to-Text (STT): Converts spoken language into text.
  • Large Language Model (LLM): Processes the text to understand and generate responses.
  • Text-to-Speech (TTS): Converts the generated text back into speech.
For a detailed understanding of these components, refer to the

AI voice Agent core components overview

.

What You'll Build in This Tutorial

In this guide, we will walk you through creating an AI Voice Agent using the VideoSDK framework, specifically designed for the health support industry. By the end of this tutorial, you will have a working agent capable of interacting with users and providing health-related assistance.

Architecture and Core Concepts

High-Level Architecture Overview

The architecture of an AI Voice Agent involves several stages, from capturing user speech to delivering a response. The process begins with capturing the user's voice input, which is then converted into text using STT. The text is processed by an LLM to generate a response, which is finally converted back into speech using TTS. For more details on the pipeline, explore the

Cascading pipeline in AI voice Agents

.
Diagram

Understanding Key Concepts in the VideoSDK Framework

  • Agent: The core class representing your bot. It handles user interactions and manages the conversation flow.
  • CascadingPipeline: This component manages the flow of audio processing, connecting STT, LLM, and TTS components to create a seamless interaction.
  • VAD & TurnDetector: These components help the agent determine when to listen and when to respond, ensuring smooth and natural conversations. Learn more about the

    Turn detector for AI voice Agents

    .

Setting Up the Development Environment

Prerequisites

Before diving into the code, ensure you have the following:
  • Python 3.11+: The latest version of Python is required for compatibility with the VideoSDK framework.
  • VideoSDK Account: Sign up at app.videosdk.live to access API keys and other resources.

Step 1: Create a Virtual Environment

To manage dependencies, create a virtual environment:
1python -m venv venv
2source venv/bin/activate  # On Windows use `venv\Scripts\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip:
1pip install videosdk
2pip install python-dotenv
3

Step 3: Configure API Keys in a .env file

Create a .env file in your project directory to store your API keys securely:
1VIDEOSDK_API_KEY=your_videosdk_api_key
2

Building the AI Voice Agent: A Step-by-Step Guide

To build our AI Voice Agent, we will start by presenting the complete code, then break it down step-by-step.
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are a helpful healthcare assistant AI Voice Agent designed to support users in the health support industry. Your primary capabilities include answering questions about common symptoms, providing general health tips, and assisting users in scheduling appointments with healthcare professionals. You can also offer information about healthcare services and direct users to appropriate resources. However, you are not a medical professional, and you must always include a disclaimer advising users to consult a qualified healthcare provider for medical advice, diagnosis, or treatment. You should maintain a friendly and professional tone, ensuring user privacy and data security at all times. You must not provide any medical diagnosis or treatment recommendations. Your responses should be concise, informative, and supportive, aiming to enhance the user's healthcare experience."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = AgentSession(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Step 4.1: Generating a VideoSDK Meeting ID

To interact with the agent, you need a meeting ID. You can generate one using the VideoSDK API:
1curl -X POST "https://api.videosdk.live/v1/rooms" \
2-H "Authorization: Bearer YOUR_API_KEY" \
3-H "Content-Type: application/json"
4

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class is a custom implementation of the Agent class. It defines the behavior of your voice agent, including how it greets users and ends sessions.
1class MyVoiceAgent(Agent):
2    def __init__(self):
3        super().__init__(instructions=agent_instructions)
4    async def on_enter(self): await self.session.say("Hello! How can I help?")
5    async def on_exit(self): await self.session.say("Goodbye!")
6

Step 4.3: Defining the Core Pipeline

The CascadingPipeline is central to processing audio data. It connects STT, LLM, and TTS plugins to create a seamless interaction. For more information on specific plugins, check out the

Deepgram STT Plugin for voice agent

,

OpenAI LLM Plugin for voice agent

, and

ElevenLabs TTS Plugin for voice agent

.
1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8

Step 4.4: Managing the Session and Startup Logic

The start_session function manages the agent's lifecycle, while make_context sets up the environment for the agent to operate. For more on managing sessions, see

AI voice Agent Sessions

.
1async def start_session(context: JobContext):
2    agent = MyVoiceAgent()
3    conversation_flow = ConversationFlow(agent)
4
5    pipeline = CascadingPipeline(
6        stt=DeepgramSTT(model="nova-2", language="en"),
7        llm=OpenAILLM(model="gpt-4o"),
8        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
9        vad=SileroVAD(threshold=0.35),
10        turn_detector=TurnDetector(threshold=0.8)
11    )
12
13    session = AgentSession(
14        agent=agent,
15        pipeline=pipeline,
16        conversation_flow=conversation_flow
17    )
18
19    try:
20        await context.connect()
21        await session.start()
22        await asyncio.Event().wait()
23    finally:
24        await session.close()
25        await context.shutdown()
26
27def make_context() -> JobContext:
28    room_options = RoomOptions(
29        name="VideoSDK Cascaded Agent",
30        playground=True
31    )
32    return JobContext(room_options=room_options)
33
34if __name__ == "__main__":
35    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
36    job.start()
37

Running and Testing the Agent

Step 5.1: Running the Python Script

To start your AI Voice Agent, run the Python script:
1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

After running the script, you will receive a playground link in the console. Use this link to join the session and interact with your agent. The agent will greet you and respond to your queries as programmed. For a quick setup, refer to the

Voice Agent Quick Start Guide

.

Advanced Features and Customizations

Extending Functionality with Custom Tools

The VideoSDK framework allows you to extend your agent's functionality using custom tools. These tools can be integrated into the pipeline to add new capabilities or improve existing ones. Consider implementing the

AI voice Agent Wake-Up Call Feature

to enhance user interaction.

Exploring Other Plugins

While this tutorial uses specific plugins for STT, LLM, and TTS, the VideoSDK framework supports various other options. Explore different plugins to find the best fit for your needs.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correctly set in the .env file. Double-check the VideoSDK account settings if you encounter authentication issues.

Audio Input/Output Problems

Verify your microphone and speaker settings if the agent cannot hear or respond. Check the system settings and permissions.

Dependency and Version Conflicts

Ensure all dependencies are installed with compatible versions. Use a virtual environment to manage package versions effectively.

Conclusion

Summary of What You've Built

In this tutorial, you've built a fully functional AI Voice Agent for the health support industry using the VideoSDK framework. The agent can interact with users, provide health-related information, and assist with scheduling appointments.

Next Steps and Further Learning

To further enhance your agent, consider exploring additional plugins and custom tools. Stay updated with the latest developments in AI and voice technology to continue improving your agent's capabilities.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ