AI Voice Assistants for Telemedicine

Build AI voice assistants for telemedicine with VideoSDK. Follow our step-by-step tutorial and code examples.

Introduction to AI Voice Agents in AI Voice Assistants for Telemedicine

In the rapidly evolving field of telemedicine, AI voice agents are becoming indispensable tools. These agents help streamline interactions between healthcare providers and patients by providing immediate, conversational support. But what exactly is an AI

voice agent

?

What is an AI

Voice Agent

?

An AI

voice agent

is a software program that uses speech recognition, natural language processing, and speech synthesis to interact with users through voice. These agents can understand spoken language, process the information, and respond in a human-like manner.

Why are they Important for the AI Voice Assistants for Telemedicine Industry?

In telemedicine, AI voice agents can assist patients by answering questions about symptoms, scheduling appointments, and providing general health information. They offer a scalable solution to manage patient inquiries efficiently, reducing the workload on healthcare professionals and improving patient access to information.

Core Components of a

Voice Agent

  • Speech-to-Text (STT): Converts spoken language into text.
  • Language Model (LLM): Processes the text to understand and generate responses.
  • Text-to-Speech (TTS): Converts the response text back into spoken language.

What You'll Build in This Tutorial

In this tutorial, we will guide you through building a basic AI voice assistant tailored for telemedicine using the VideoSDK framework. By the end, you'll have a functioning agent capable of handling simple healthcare-related queries.

Architecture and Core Concepts

High-Level Architecture Overview

The AI

voice agent

architecture involves several key components working together to process user input and generate responses. Here's a high-level overview of the data flow:
  1. User Speech: Captured via microphone.
  2. Voice

    Activity Detection

    (VAD)
    : Identifies when the user is speaking.
  3. Speech-to-Text (STT): Converts speech to text.
  4. Language Model (LLM): Processes text and generates a response.
  5. Text-to-Speech (TTS): Converts the response text back to speech.
  6. Agent Response: Played back to the user.
Diagram

Understanding Key Concepts in the VideoSDK Framework

  • Agent: The core class that represents your bot. It manages the interaction logic and state.
  • Cascading Pipeline

    : This defines the flow of audio processing, moving from STT to LLM to TTS.
  • VAD &

    Turn Detector

    : These components help the agent know when to listen and when to speak, ensuring smooth interaction.

Setting Up the Development Environment

Prerequisites

Before you begin, ensure you have Python 3.11+ installed and a VideoSDK account. Sign up at the VideoSDK dashboard to access necessary API keys.

Step 1: Create a Virtual Environment

To keep your project dependencies organized, create a virtual environment:
1python3 -m venv venv
2source venv/bin/activate  # On Windows use `venv\Scripts\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip:
1pip install videosdk-agents videosdk-plugins
2

Step 3: Configure API Keys in a .env File

Create a .env file in your project directory and add your VideoSDK API keys:
1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

Let's dive into building your AI voice agent. We'll start by presenting the complete code and then break it down step-by-step.
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are a helpful healthcare assistant specializing in telemedicine. Your primary role is to assist users by answering questions about symptoms, providing general health information, and helping to schedule appointments with healthcare professionals. You can also provide information about telemedicine services and how to access them. However, you are not a medical professional and must always include a disclaimer advising users to consult a doctor for medical advice. You should prioritize user privacy and data security, ensuring that any personal information shared is handled with the utmost confidentiality. You must not provide any diagnosis or treatment recommendations. Your responses should be clear, concise, and empathetic, ensuring users feel supported and informed."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = [AgentSession](https://docs.videosdk.live/ai_agents/core-components/agent-session)(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Step 4.1: Generating a VideoSDK Meeting ID

To interact with your agent, you'll need a meeting ID. Use the following curl command to generate one:
1curl -X POST "https://api.videosdk.live/v1/meetings" \
2-H "Authorization: Bearer YOUR_API_KEY" \
3-H "Content-Type: application/json"
4

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class extends the Agent class from the VideoSDK framework. It defines the behavior of your voice agent. The on_enter and on_exit methods are used to greet users and say goodbye, respectively.
1class MyVoiceAgent(Agent):
2    def __init__(self):
3        super().__init__(instructions=agent_instructions)
4    async def on_enter(self): await self.session.say("Hello! How can I help?")
5    async def on_exit(self): await self.session.say("Goodbye!")
6

Step 4.3: Defining the Core Pipeline

The

Cascading Pipeline

is the heart of your AI voice agent. It defines the sequence of processing steps from capturing audio to generating a response.
1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8

Step 4.4: Managing the Session and Startup Logic

The start_session function initializes the agent session and connects it to the VideoSDK framework. It ensures that the session remains active until manually terminated.
1async def start_session(context: JobContext):
2    # Create agent and conversation flow
3    agent = MyVoiceAgent()
4    conversation_flow = ConversationFlow(agent)
5
6    # Create pipeline
7    pipeline = CascadingPipeline(
8        stt=DeepgramSTT(model="nova-2", language="en"),
9        llm=OpenAILLM(model="gpt-4o"),
10        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
11        vad=SileroVAD(threshold=0.35),
12        turn_detector=TurnDetector(threshold=0.8)
13    )
14
15    session = AgentSession(
16        agent=agent,
17        pipeline=pipeline,
18        conversation_flow=conversation_flow
19    )
20
21    try:
22        await context.connect()
23        await session.start()
24        # Keep the session running until manually terminated
25        await asyncio.Event().wait()
26    finally:
27        # Clean up resources when done
28        await session.close()
29        await context.shutdown()
30
The make_context function sets up the room options for the agent, allowing it to run in a test environment.
1def make_context() -> JobContext:
2    room_options = RoomOptions(
3    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
4        name="VideoSDK Cascaded Agent",
5        playground=True
6    )
7
8    return JobContext(room_options=room_options)
9
Finally, the if __name__ == "__main__": block starts the agent when the script is executed.
1if __name__ == "__main__":
2    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
3    job.start()
4

Running and Testing the Agent

Step 5.1: Running the Python Script

To run your voice agent, execute the Python script:
1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

Once the script is running, you'll see a playground link in the console. Open this link in your browser to interact with the agent. Speak into your microphone to test the agent's responses.

Advanced Features and Customizations

Extending Functionality with Custom Tools

The VideoSDK framework allows for extending the agent's capabilities using custom tools. These tools can be integrated into the pipeline to add new functionalities.

Exploring Other Plugins

While this tutorial uses specific plugins for STT, LLM, and TTS, the VideoSDK framework supports various other options. Explore different plugins to find the best fit for your needs.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correctly configured in the .env file. Double-check for typos or missing keys.

Audio Input/Output Problems

Verify that your microphone and speakers are properly connected and configured. Check your system settings if you encounter issues.

Dependency and Version Conflicts

Ensure all dependencies are installed using the correct versions. Use a virtual environment to avoid conflicts with other projects.

Conclusion

Summary of What You've Built

In this tutorial, you've built a basic AI voice assistant for telemedicine using the VideoSDK framework. Your agent can handle simple healthcare-related queries and provide information to users.

Next Steps and Further Learning

To enhance your agent, consider exploring additional plugins and customizing the agent's logic. Keep learning and experimenting to create more advanced AI voice solutions.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ