1. Introduction to AI Voice Agents in voice acting agent
What is an AI Voice Agent
?
An AI
Voice Agent
is a software system that can interact with users using natural spoken language. It leverages speech recognition (STT), natural language processing (LLM), and speech synthesis (TTS) to hold real-time conversations, answer questions, and provide guidance.Why are they important for the voice acting agent industry?
In the voice acting industry, aspiring and professional voice actors often seek career advice, audition tips, and industry insights. A specialized AI Voice Acting Agent can provide immediate, friendly, and expert guidance—acting as a virtual talent agent available 24/7. This democratizes access to industry knowledge, helps users prepare for auditions, and connects them with valuable resources.
Core Components of a Voice Agent
- Speech-to-Text (STT): Converts spoken words into text.
- Natural Language Model (LLM): Understands and generates human-like responses.
- Text-to-Speech (TTS): Converts responses back into natural-sounding speech.
- Voice
Activity Detection
(VAD): Detects when the user is speaking. - Turn Detection: Determines when to take turns in conversation.
What You'll Build in This Tutorial
In this guide, you'll build a fully functional AI Voice Acting Agent using the VideoSDK AI Agents framework. The agent will provide career advice, audition tips, and industry information for voice actors. You'll learn how to set up the environment, implement the agent, and test it live.
2. Architecture and Core Concepts
High-Level Architecture Overview
Our Voice Acting Agent is built on a modular pipeline. User audio is processed by Voice Activity Detection and Turn Detection, transcribed by STT, interpreted by a Large Language Model, and spoken back using TTS. All components are orchestrated by the VideoSDK framework. For a deeper dive into the
AI voice Agent core components overview
, refer to the official documentation to understand how each part fits together.Understanding Key Concepts in the VideoSDK Framework
- Agent: The core logic that defines how your AI interacts with users. You subclass the Agent class to define custom behaviors.
- CascadingPipeline: Orchestrates the flow of audio and text through VAD, Turn Detection, STT, LLM, and TTS plugins. Learn more about the
Cascading pipeline in AI voice Agents
to see how this enables seamless communication between components. - VAD & TurnDetector: SileroVAD detects when the user is speaking; TurnDetector determines conversational turns, ensuring natural back-and-forth.
3. Setting Up the Development Environment
Prerequisites
- Python 3.11 or newer
- A VideoSDK account (for API keys and dashboard access)
Step 1: Create a Virtual Environment
Open your terminal and run:
1python3 -m venv venv
2source venv/bin/activate # On Windows: venv\Scripts\activate
3
Step 2: Install Required Packages
Install the VideoSDK AI Agents framework and plugins:
1pip install videosdk-agents videosdk-plugins-silero videosdk-plugins-turn-detector videosdk-plugins-deepgram videosdk-plugins-openai videosdk-plugins-elevenlabs
2
Step 3: Configure API Keys in a .env file
Create a
.env
file in your project directory and add your VideoSDK and plugin API keys:1VIDEOSDK_API_KEY=your_videosdk_api_key
2DEEPGRAM_API_KEY=your_deepgram_api_key
3OPENAI_API_KEY=your_openai_api_key
4ELEVENLABS_API_KEY=your_elevenlabs_api_key
5
You can find your VideoSDK API key in the dashboard after signing up. Obtain plugin keys from their respective providers.
4. Building the AI Voice Agent: A Step-by-Step Guide
Full Working Code Example
Below is the complete, runnable Python script for your Voice Acting Agent. We'll break down each part in the following sections.
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are a professional and knowledgeable Voice Acting Agent. Your persona is friendly, supportive, and resourceful, acting as a virtual talent agent specializing in voice acting careers.
14
15Capabilities:
16- Provide information about the voice acting industry, including career paths, audition tips, and required skills.
17- Offer guidance on building a voice acting portfolio, finding auditions, and connecting with casting directors or agencies.
18- Answer questions about voice acting techniques, training resources, and industry trends.
19- Suggest reputable online platforms, workshops, and communities for aspiring and professional voice actors.
20- Assist users in preparing for auditions by offering script reading tips and vocal warm-up exercises.
21
22Constraints and Limitations:
23- You do not represent any real-world agency or guarantee job placements.
24- Do not provide legal, financial, or contractual advice; always recommend consulting a qualified professional for such matters.
25- Avoid sharing personal opinions or endorsements of specific individuals or companies.
26- Do not collect or store any personal information from users.
27- Always encourage users to verify information independently and exercise caution when pursuing opportunities."
28
29class MyVoiceAgent(Agent):
30 def __init__(self):
31 super().__init__(instructions=agent_instructions)
32 async def on_enter(self): await self.session.say("Hello! How can I help?")
33 async def on_exit(self): await self.session.say("Goodbye!")
34
35async def start_session(context: JobContext):
36 # Create agent and conversation flow
37 agent = MyVoiceAgent()
38 conversation_flow = ConversationFlow(agent)
39
40 # Create pipeline
41 pipeline = CascadingPipeline(
42 stt=DeepgramSTT(model="nova-2", language="en"),
43 llm=OpenAILLM(model="gpt-4o"),
44 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
45 vad=SileroVAD(threshold=0.35),
46 turn_detector=TurnDetector(threshold=0.8)
47 )
48
49 session = AgentSession(
50 agent=agent,
51 pipeline=pipeline,
52 conversation_flow=conversation_flow
53 )
54
55 try:
56 await context.connect()
57 await session.start()
58 # Keep the session running until manually terminated
59 await asyncio.Event().wait()
60 finally:
61 # Clean up resources when done
62 await session.close()
63 await context.shutdown()
64
65def make_context() -> JobContext:
66 room_options = RoomOptions(
67 # room_id="YOUR_MEETING_ID", # Set to join a pre-created room; omit to auto-create
68 name="VideoSDK Cascaded Agent",
69 playground=True
70 )
71
72 return JobContext(room_options=room_options)
73
74if __name__ == "__main__":
75 job = WorkerJob(entrypoint=start_session, jobctx=make_context)
76 job.start()
77
Step 4.1: Generating a VideoSDK Meeting ID
To test your agent, you'll need a meeting ID. You can create one via the VideoSDK API.
Run this
curl
command (replace YOUR_API_KEY
):1curl -X POST \
2 -H "Authorization: YOUR_API_KEY" \
3 -H "Content-Type: application/json" \
4 -d '{}' \
5 https://api.videosdk.live/v2/rooms
6
The response will include a
roomId
. You can use this in your RoomOptions
if you want to join a pre-created room. If omitted, a new room is auto-created.Step 4.2: Creating the Custom Agent Class (MyVoiceAgent)
The heart of your agent is the custom class that defines its persona and behavior.
1agent_instructions = "You are a professional and knowledgeable Voice Acting Agent. Your persona is friendly, supportive, and resourceful, acting as a virtual talent agent specializing in voice acting careers.
2
3Capabilities:
4- Provide information about the voice acting industry, including career paths, audition tips, and required skills.
5- Offer guidance on building a voice acting portfolio, finding auditions, and connecting with casting directors or agencies.
6- Answer questions about voice acting techniques, training resources, and industry trends.
7- Suggest reputable online platforms, workshops, and communities for aspiring and professional voice actors.
8- Assist users in preparing for auditions by offering script reading tips and vocal warm-up exercises.
9
10Constraints and Limitations:
11- You do not represent any real-world agency or guarantee job placements.
12- Do not provide legal, financial, or contractual advice; always recommend consulting a qualified professional for such matters.
13- Avoid sharing personal opinions or endorsements of specific individuals or companies.
14- Do not collect or store any personal information from users.
15- Always encourage users to verify information independently and exercise caution when pursuing opportunities."
16
17class MyVoiceAgent(Agent):
18 def __init__(self):
19 super().__init__(instructions=agent_instructions)
20 async def on_enter(self):
21 await self.session.say("Hello! How can I help?")
22 async def on_exit(self):
23 await self.session.say("Goodbye!")
24
This class sets the agent's persona and welcome/goodbye messages.
Step 4.3: Defining the Core Pipeline (CascadingPipeline)
The pipeline connects all the plugins: STT, LLM, TTS, VAD, and Turn Detector.
1pipeline = CascadingPipeline(
2 stt=DeepgramSTT(model="nova-2", language="en"),
3 llm=OpenAILLM(model="gpt-4o"),
4 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5 vad=SileroVAD(threshold=0.35),
6 turn_detector=TurnDetector(threshold=0.8)
7)
8
- DeepgramSTT: High-quality, cost-effective speech recognition.
- OpenAILLM: Powerful GPT-4o language model for smart responses.
- ElevenLabsTTS: Natural and expressive voice synthesis. For more advanced voice synthesis, check out the
ElevenLabs TTS Plugin for voice agent
to explore additional configuration options and voices. - SileroVAD: Reliable voice activity detection.
- TurnDetector: Ensures smooth conversational turns. To better understand how turn-taking is managed, see the
Turn detector for AI voice Agents
documentation.
Step 4.4: Managing the Session and Startup Logic
Sessions manage the lifecycle of your agent and handle connections.
1async def start_session(context: JobContext):
2 agent = MyVoiceAgent()
3 conversation_flow = ConversationFlow(agent)
4 pipeline = CascadingPipeline(
5 stt=DeepgramSTT(model="nova-2", language="en"),
6 llm=OpenAILLM(model="gpt-4o"),
7 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
8 vad=SileroVAD(threshold=0.35),
9 turn_detector=TurnDetector(threshold=0.8)
10 )
11 session = AgentSession(
12 agent=agent,
13 pipeline=pipeline,
14 conversation_flow=conversation_flow
15 )
16 try:
17 await context.connect()
18 await session.start()
19 await asyncio.Event().wait()
20 finally:
21 await session.close()
22 await context.shutdown()
23
24def make_context() -> JobContext:
25 room_options = RoomOptions(
26 name="VideoSDK Cascaded Agent",
27 playground=True
28 )
29 return JobContext(room_options=room_options)
30
31if __name__ == "__main__":
32 job = WorkerJob(entrypoint=start_session, jobctx=make_context)
33 job.start()
34
- The
make_context
function creates a room withplayground=True
for easy browser testing. You can also experiment with your agent in theAI Agent playground
for interactive testing and rapid prototyping. - The main block starts the agent job.
5. Running and Testing the Agent
Step 5.1: Running the Python Script
- Make sure your
.env
file is set up with all required API keys. - Run the script:
1python main.py
2
- In the console output, look for a line that says
Playground URL:
. This link lets you join the agent session from your browser.
Step 5.2: Interacting with the Agent in the Playground
- Open the Playground URL in your browser.
- Join the meeting as a participant.
- Speak or type your questions about voice acting.
- The agent will respond in real time using natural-sounding speech.
To stop the agent, press
Ctrl+C
in your terminal. This triggers a graceful shutdown, ensuring all resources are released.6. Advanced Features and Customizations
Extending Functionality with Custom Tools
You can add custom tools or actions to your agent by subclassing and extending the
Agent
class. For example, you could add a portfolio review tool, integrate with audition databases, or trigger notifications.Exploring Other Plugins
VideoSDK supports a variety of plugins for STT, TTS, and LLM. You can experiment with alternatives like Cartesia for STT, Deepgram for TTS, or Google Gemini for LLM to optimize for cost, quality, or language support.
7. Troubleshooting Common Issues
API Key and Authentication Errors
- Double-check all API keys in your
.env
file. - Make sure your VideoSDK account is active.
Audio Input/Output Problems
- Ensure your microphone is enabled in the browser.
- Test with different browsers if you encounter issues.
Dependency and Version Conflicts
- Use a clean virtual environment.
- Run
pip list
to check for any conflicting package versions.
8. Conclusion
In this tutorial, you built a professional AI Voice Acting Agent using the VideoSDK AI Agents framework. You learned how to set up the environment, implement a custom agent, and test it in the browser.
Explore more advanced features, try different plugins, and consider deploying your agent to help aspiring voice actors worldwide. The possibilities for customization and integration are vast—happy building!
Want to level-up your learning? Subscribe now
Subscribe to our newsletter for more tech based insights
FAQ