Build a Healthcare AI Voice Agent with VideoSDK

Learn to build a HIPAA-compliant healthcare AI voice agent using VideoSDK. Full code, architecture, and testing included.

1. Introduction to AI Voice Agents in healthcare ai voice agent

What is an AI Voice Agent?

AI voice agents are intelligent software systems capable of understanding and responding to human speech in real time. Powered by advances in speech recognition, natural language processing, and speech synthesis, these agents can interpret spoken language, process it through AI models, and reply in natural-sounding voices.
In practice, a user speaks into a microphone, the agent transcribes the speech into text, analyzes the meaning, generates a response, and then converts that response back into speech. This seamless interaction enables hands-free, conversational experiences.

Why are they important for the healthcare ai voice agent industry?

AI voice agents are transforming healthcare by streamlining communication and reducing administrative burdens. Key use cases include:
  • Appointment scheduling: Patients can book, reschedule, or cancel appointments via voice.
  • Patient triage: The agent can gather symptom information and direct patients to appropriate care.
  • Medication reminders: Automated reminders help patients adhere to treatment plans.
  • Telehealth support: Agents can answer general questions and guide patients through telemedicine workflows.
These capabilities improve accessibility, efficiency, and patient satisfaction, while freeing up staff for higher-value tasks.

Core Components of a Voice Agent

Every AI voice agent relies on three core technologies:
  • STT (Speech-to-Text): Converts spoken audio into text.
  • LLM (Large Language Model): Analyzes text and generates intelligent responses.
  • TTS (Text-to-Speech): Converts text responses back into natural-sounding audio.
For a detailed breakdown of these essential building blocks, see the

AI voice Agent core components overview

.

What You'll Build in This Tutorial

In this guide, you'll build a healthcare-focused AI voice agent using the Videosdk AI Agents framework. Your agent will:
  • Answer general healthcare questions
  • Assist with appointment and medication reminders
  • Provide empathetic, privacy-conscious support
  • Adhere to healthcare privacy guidelines (HIPAA)
By the end, you'll have a fully functional, testable agent ready for further customization.

2. Architecture and Core Concepts

High-Level Architecture Overview

Let's visualize how your healthcare AI voice agent processes a conversation:
1sequenceDiagram
2  participant User
3  participant Microphone
4  participant Agent
5  participant STT
6  participant LLM
7  participant TTS
8  participant Speaker
9  User->>Microphone: Speaks
10  Microphone->>Agent: Audio Stream
11  Agent->>STT: Audio for Transcription
12  STT->>Agent: Text
13  Agent->>LLM: Text for Response
14  LLM->>Agent: Generated Reply
15  Agent->>TTS: Text for Synthesis
16  TTS->>Agent: Audio
17  Agent->>Speaker: Plays Response
18  Speaker->>User: Hears Reply
19
This sequence shows the data flow from user speech to the agent's spoken reply. Each component plays a crucial role in the pipeline. To get started quickly, you can follow the

Voice Agent Quick Start Guide

for step-by-step instructions.

Understanding Key Concepts in the VideoSDK Framework

  • Agent: The core class representing your voice bot's persona and behavior. You customize its instructions and responses.
  • CascadingPipeline: Orchestrates the flow of audio and text through STT, LLM, and TTS plugins. For more on this, check out the

    Cascading pipeline in AI voice Agents

    .
  • VAD (Voice Activity Detection) & TurnDetector: These plugins help the agent determine when the user is speaking and when it's their turn to respond, enabling smooth, natural conversations.
With these building blocks, you can create robust, real-time conversational agents tailored for healthcare.

3. Setting Up the Development Environment

Prerequisites

Before you begin, ensure you have:
  • Python 3.11+
  • VideoSDK account (sign up at app.videosdk.live)
  • API keys for Deepgram (STT), OpenAI (LLM), and ElevenLabs (TTS)
If you want to experiment and test your agent in a browser, the

AI Agent playground

provides an interactive environment for rapid prototyping.

Step 1: Create a Virtual Environment

It's best to isolate your dependencies. Run:
1python -m venv venv
2# On Windows:
3venv\Scripts\activate
4# On macOS/Linux:
5source venv/bin/activate
6

Step 2: Install Required Packages

Install the Videosdk AI Agents framework and plugins:
1pip install videosdk-agents videosdk-plugin-deepgram videosdk-plugin-openai videosdk-plugin-elevenlabs videosdk-plugin-silero videosdk-plugin-turn-detector
2

Step 3: Configure API Keys in a .env File

Create a .env file in your project root with the following content (replace placeholders with your actual keys):
1DEEPGRAM_API_KEY=your_deepgram_api_key
2OPENAI_API_KEY=your_openai_api_key
3ELEVENLABS_API_KEY=your_elevenlabs_api_key
4VIDEOSDK_API_KEY=your_videosdk_api_key
5
Now you're ready to build your agent!

4. Building the AI Voice Agent: A Step-by-Step Guide

Complete, Runnable Code Example

Below is the full Python script for your healthcare AI voice agent. We'll break down each section in detail afterward.
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are a helpful healthcare AI voice agent designed to assist users with general healthcare inquiries and administrative tasks. Your persona is empathetic, patient, and professional, aiming to provide clear and supportive guidance. Your capabilities include: answering general questions about common symptoms, providing information on healthcare services, assisting with appointment scheduling, offering reminders for medications or check-ups, and helping users navigate healthcare resources. Constraints: You are not a licensed medical professional and cannot diagnose, prescribe, or provide medical advice. Always include a disclaimer that users should consult a qualified healthcare provider for medical concerns. Do not collect or store any sensitive personal health information. Maintain user privacy and adhere to HIPAA guidelines where applicable. If a user requests urgent medical help, instruct them to contact emergency services immediately."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = AgentSession(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63
Let's break down how each part works.

Step 4.1: Generating a VideoSDK Meeting ID

Before running your agent, you'll need a meeting room for users to join. You can create one via the VideoSDK API:
1curl -X POST -H "Authorization: YOUR_VIDEOSDK_API_KEY" -H "Content-Type: application/json" -d '{}' https://api.videosdk.live/v2/rooms
2
The response will include a roomId. You can use this in your agent by setting room_id in RoomOptions. For playground testing, you can omit room_id and let the agent auto-create a room.

Step 4.2: Creating the Custom Agent Class

The heart of your agent is the MyVoiceAgent class. Let's examine how it's defined and customized for healthcare:
1agent_instructions = "You are a helpful healthcare AI voice agent designed to assist users with general healthcare inquiries and administrative tasks. ..."
2
3class MyVoiceAgent(Agent):
4    def __init__(self):
5        super().__init__(instructions=agent_instructions)
6    async def on_enter(self):
7        await self.session.say("Hello! How can I help?")
8    async def on_exit(self):
9        await self.session.say("Goodbye!")
10
  • The agent_instructions string defines your agent's persona, capabilities, and important constraints (such as not providing medical advice and maintaining privacy).
  • on_enter and on_exit provide polite greetings and farewells.
This ensures your agent is empathetic, professional, and HIPAA-aware.

Step 4.3: Defining the Core Pipeline

The CascadingPipeline connects all the audio and language processing components:
1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8
You can swap out these plugins for others (like Cartesia for STT or Google Gemini for LLM) as needed.

Step 4.4: Managing the Session and Startup Logic

The session management code handles connecting to VideoSDK, starting the agent, and cleaning up resources:
1async def start_session(context: JobContext):
2    agent = MyVoiceAgent()
3    conversation_flow = ConversationFlow(agent)
4    pipeline = CascadingPipeline(...)
5    session = AgentSession(agent=agent, pipeline=pipeline, conversation_flow=conversation_flow)
6    try:
7        await context.connect()
8        await session.start()
9        await asyncio.Event().wait()
10    finally:
11        await session.close()
12        await context.shutdown()
13
14def make_context() -> JobContext:
15    room_options = RoomOptions(
16        name="VideoSDK Cascaded Agent",
17        playground=True
18    )
19    return JobContext(room_options=room_options)
20
21if __name__ == "__main__":
22    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
23    job.start()
24
  • start_session creates the agent, pipeline, and session, then connects and starts everything.
  • make_context sets up the meeting room (with playground=True for easy testing).
  • The main block starts the agent worker.
For more on managing and maintaining agent sessions, visit the

AI voice Agent Sessions

documentation.

5. Running and Testing the Agent

Step 5.1: Running the Python Script

Activate your virtual environment and run:
1python main.py
2
In the console, you'll see output with a "Playground" link. This link lets you join the agent's meeting room as a user for live testing.

Step 5.2: Interacting with the Agent in the Playground

  1. Open the provided Playground link in your browser.
  2. Join as a participant (with microphone enabled).
  3. Speak your healthcare question or request.
  4. The agent will respond with synthesized speech.
To gracefully shut down the agent, press Ctrl+C in your terminal. This ensures all resources are cleaned up.

6. Advanced Features and Customizations

Extending Functionality with Custom Tools

The Videosdk framework supports function_tool, allowing you to add custom actions (like appointment booking or medication reminders). You can define Python functions and expose them to the LLM for dynamic, healthcare-specific workflows.

Exploring Other Plugins

You can experiment with different plugins for each pipeline stage:
  • STT: Cartesia (best), Deepgram (cost-effective), Rime (low-cost)
  • TTS: ElevenLabs (best quality), Deepgram (cost-effective)
  • LLM: OpenAI GPT-4, Google Gemini
This flexibility lets you optimize for cost, quality, or compliance as needed.

7. Troubleshooting Common Issues

API Key and Authentication Errors

  • Double-check your .env file for typos or missing keys.
  • Ensure your VideoSDK, Deepgram, OpenAI, and ElevenLabs accounts are active.

Audio Input/Output Problems

  • Make sure your microphone and speakers are working.
  • Test in the VideoSDK Playground to isolate issues.

Dependency and Version Conflicts

  • Use a fresh virtual environment.
  • Run pip list to check for conflicting packages.

8. Conclusion

You've built a fully functional, HIPAA-aware healthcare AI voice agent using the Videosdk AI Agents framework. This agent can answer general healthcare questions, assist with administrative tasks, and deliver a natural, conversational experience.
Next steps:
  • Add custom tools for appointment booking or reminders.
  • Explore other plugins for improved quality or cost savings.
  • Integrate with healthcare systems for real-world deployments.
With this foundation, you're ready to bring intelligent voice automation to healthcare workflows.

Get 10,000 Free Minutes Every Months

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ