What are the core components of a voice agent?

The core components include Speech-to-Text (STT), Large Language Model (LLM), and Text-to-Speech (TTS).

How do I generate a VideoSDK meeting ID?

Use the `curl` command with your API key to generate a meeting ID from the VideoSDK API.

What plugins are used in this tutorial?

The tutorial uses DeepgramSTT for speech-to-text, OpenAILLM for language processing, and ElevenLabsTTS for text-to-speech.

Build an AI Voice Assistant for Education

Step-by-step guide to building an AI voice assistant for the education sector using VideoSDK.

Introduction to AI Voice Agents in Education

AI Voice Agents are intelligent systems designed to interact with users through voice commands. They leverage advanced technologies such as Speech-to-Text (STT), Text-to-Speech (TTS), and Large Language Models (LLM) to understand and respond to user queries. In the education industry, these agents can revolutionize the way students, teachers, and parents engage with educational content.

What is an AI
Voice Agent
?

An AI

Voice Agent

is a software application that processes spoken language to perform tasks or provide information. It listens to user inputs, processes the data using natural language understanding, and generates appropriate responses.

Why are they important for the Education Industry?

In education, AI Voice Agents can assist with answering student queries, providing explanations for complex topics, and managing administrative tasks like scheduling and reminders. They enhance the learning experience by offering personalized support and freeing educators to focus on more critical tasks.

Core Components of a
Voice Agent

Speech-to-Text (STT): Converts spoken language into text.
Large Language Model (LLM): Processes the text to understand and generate responses.
Text-to-Speech (TTS): Converts text responses back into spoken language.

What You'll Build in This Tutorial

In this tutorial, you will build a fully functional AI Voice Assistant tailored for the education sector using VideoSDK. The agent will be capable of understanding and responding to educational queries, scheduling tasks, and more.

Architecture and Core Concepts

High-Level Architecture Overview

The AI

Voice Agent

architecture involves several components working in tandem to process user input and generate responses. The flow starts with the user speaking into the system, where the audio is captured and processed by the STT module. The transcribed text is then sent to the LLM, which generates a response. Finally, the TTS module converts this response back into speech.

Understanding Key Concepts in the VideoSDK Framework

Agent: Represents the core logic of your voice assistant.
Cascading Pipeline in AI voice Agents
: Manages the flow of data through the STT, LLM, and TTS modules.
VAD & TurnDetector: Ensure the agent listens and responds at appropriate times by detecting voice activity and conversational turns.

Setting Up the Development Environment

Prerequisites

Before starting, ensure you have Python 3.11+ installed and a VideoSDK account. Sign up at VideoSDK to access API keys and other resources.

Step 1: Create a Virtual Environment

Create a virtual environment to manage dependencies:

1python -m venv venv
2source venv/bin/activate  # On Windows use `venv\\Scripts\\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip:

1pip install videosdk-agents videosdk-plugins-openai videosdk-plugins-deepgram videosdk-plugins-elevenlabs videosdk-plugins-silero
2

Step 3: Configure API Keys in a `.env` File

Create a .env file in your project root and add your API keys:

1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

Here is the complete code to build your AI Voice Assistant:

1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10pre_download_model()
11
12agent_instructions = "You are an AI Voice Assistant specialized in the education industry. Your primary role is to assist students, teachers, and parents by providing information and support related to educational content and processes. You can answer questions about various subjects, provide explanations of complex topics, and offer guidance on educational resources. Additionally, you can help schedule study sessions and remind users of important academic deadlines. However, you are not a certified educator, so you must always encourage users to consult with qualified teachers or educational professionals for personalized advice. You must also respect user privacy and ensure that any personal data shared during interactions is handled securely and confidentially."
13
14class MyVoiceAgent(Agent):
15    def __init__(self):
16        super().__init__(instructions=agent_instructions)
17    async def on_enter(self): await self.session.say("Hello! How can I help?")
18    async def on_exit(self): await self.session.say("Goodbye!")
19
20async def start_session(context: JobContext):
21    agent = MyVoiceAgent()
22    conversation_flow = ConversationFlow(agent)
23
24    pipeline = CascadingPipeline(
25        stt=DeepgramSTT(model="nova-2", language="en"),
26        llm=OpenAILLM(model="gpt-4o"),
27        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
28        vad=SileroVAD(threshold=0.35),
29        turn_detector=TurnDetector(threshold=0.8)
30    )
31
32    session = [AI voice Agent Sessions](https://docs.videosdk.live/ai_agents/core-components/agent-session)(
33        agent=agent,
34        pipeline=pipeline,
35        conversation_flow=conversation_flow
36    )
37
38    try:
39        await context.connect()
40        await session.start()
41        await asyncio.Event().wait()
42    finally:
43        await session.close()
44        await context.shutdown()
45
46def make_context() -> JobContext:
47    room_options = RoomOptions(
48        name="VideoSDK Cascaded Agent",
49        playground=True
50    )
51
52    return JobContext(room_options=room_options)
53
54if __name__ == "__main__":
55    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
56    job.start()
57

Step 4.1: Generating a VideoSDK Meeting ID

To interact with your agent, you need a meeting ID. Use the following curl command to generate one:

1curl -X POST \\
2  https://api.videosdk.live/v1/meetings \\
3  -H "Authorization: Bearer YOUR_API_KEY" \\
4  -H "Content-Type: application/json"
5

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class extends the base Agent class, providing custom behavior for entering and exiting sessions:

1class MyVoiceAgent(Agent):
2    def __init__(self):
3        super().__init__(instructions=agent_instructions)
4    async def on_enter(self): await self.session.say("Hello! How can I help?")
5    async def on_exit(self): await self.session.say("Goodbye!")
6

This class sets up the initial and closing messages for the agent, ensuring a friendly interaction.

Step 4.3: Defining the Core Pipeline

The CascadingPipeline defines the flow of data through the system, integrating various plugins for processing:

1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=[OpenAI LLM Plugin for voice agent](https://docs.videosdk.live/ai_agents/plugins/llm/openai)(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=[Silero Voice Activity Detection](https://docs.videosdk.live/ai_agents/plugins/silero-vad)(threshold=0.35),
6    turn_detector=[Turn detector for AI voice Agents](https://docs.videosdk.live/ai_agents/plugins/turn-detector)(threshold=0.8)
7)
8

Each component in the pipeline plays a crucial role in handling different aspects of voice interaction.

Step 4.4: Managing the Session and Startup Logic

The start_session function manages the lifecycle of the agent session:

1async def start_session(context: JobContext):
2    agent = MyVoiceAgent()
3    conversation_flow = ConversationFlow(agent)
4
5    pipeline = CascadingPipeline(
6        stt=DeepgramSTT(model="nova-2", language="en"),
7        llm=OpenAILLM(model="gpt-4o"),
8        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
9        vad=SileroVAD(threshold=0.35),
10        turn_detector=TurnDetector(threshold=0.8)
11    )
12
13    session = AgentSession(
14        agent=agent,
15        pipeline=pipeline,
16        conversation_flow=conversation_flow
17    )
18
19    try:
20        await context.connect()
21        await session.start()
22        await asyncio.Event().wait()
23    finally:
24        await session.close()
25        await context.shutdown()
26

The make_context function sets up the environment for the session:

1def make_context() -> JobContext:
2    room_options = RoomOptions(
3        name="VideoSDK Cascaded Agent",
4        playground=True
5    )
6
7    return JobContext(room_options=room_options)
8

Finally, the script entry point starts the agent:

1if __name__ == "__main__":
2    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
3    job.start()
4

Running and Testing the Agent

Step 5.1: Running the Python Script

To start your AI Voice Assistant, run the following command:

1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

Once the script is running, a test URL will be displayed in the console. Visit this URL to interact with your agent. You can join the session and start speaking to test the agent's capabilities.

Advanced Features and Customizations

Extending Functionality with Custom Tools

The VideoSDK framework allows you to extend your agent's capabilities by integrating custom tools. This enables more specialized interactions tailored to your educational needs.

Exploring Other Plugins

While this tutorial uses specific plugins, the VideoSDK framework supports a variety of STT, LLM, and TTS options. Explore these to optimize your agent's performance and functionality.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correctly configured in the .env file and that you have the necessary permissions.

Audio Input/Output Problems

Check your audio device settings and ensure the correct input/output devices are selected.

Dependency and Version Conflicts

Ensure all dependencies are up to date and compatible with your Python version.

Conclusion

Summary of What You've Built

In this tutorial, you have built a comprehensive AI Voice Assistant tailored for the education industry. This agent can handle various educational queries and assist in managing tasks.

Next Steps and Further Learning

Continue exploring the

AI voice Agent core components overview

to add more features and improve your agent's capabilities. Consider integrating additional plugins and customizing the agent's behavior to better suit your needs.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ

Free $20 Balance for AI Voice Agents & Video Calls