Build a Voice Acting Agent AI with VideoSDK: Step-by-Step Guide

Step-by-step Python tutorial to create a professional AI Voice Acting Agent with VideoSDK. Includes code, setup, and testing instructions.

1. Introduction to AI Voice Agents in voice acting agent

What is an AI

Voice Agent

?

An AI

Voice Agent

is a software system that can interact with users using natural spoken language. It leverages speech recognition (STT), natural language processing (LLM), and speech synthesis (TTS) to hold real-time conversations, answer questions, and provide guidance.

Why are they important for the voice acting agent industry?

In the voice acting industry, aspiring and professional voice actors often seek career advice, audition tips, and industry insights. A specialized AI Voice Acting Agent can provide immediate, friendly, and expert guidance—acting as a virtual talent agent available 24/7. This democratizes access to industry knowledge, helps users prepare for auditions, and connects them with valuable resources.

Core Components of a

Voice Agent

  • Speech-to-Text (STT): Converts spoken words into text.
  • Natural Language Model (LLM): Understands and generates human-like responses.
  • Text-to-Speech (TTS): Converts responses back into natural-sounding speech.
  • Voice

    Activity Detection

    (VAD):
    Detects when the user is speaking.
  • Turn Detection: Determines when to take turns in conversation.

What You'll Build in This Tutorial

In this guide, you'll build a fully functional AI Voice Acting Agent using the VideoSDK AI Agents framework. The agent will provide career advice, audition tips, and industry information for voice actors. You'll learn how to set up the environment, implement the agent, and test it live.

2. Architecture and Core Concepts

High-Level Architecture Overview

Our Voice Acting Agent is built on a modular pipeline. User audio is processed by Voice Activity Detection and Turn Detection, transcribed by STT, interpreted by a Large Language Model, and spoken back using TTS. All components are orchestrated by the VideoSDK framework. For a deeper dive into the

AI voice Agent core components overview

, refer to the official documentation to understand how each part fits together.

Understanding Key Concepts in the VideoSDK Framework

  • Agent: The core logic that defines how your AI interacts with users. You subclass the Agent class to define custom behaviors.
  • CascadingPipeline: Orchestrates the flow of audio and text through VAD, Turn Detection, STT, LLM, and TTS plugins. Learn more about the

    Cascading pipeline in AI voice Agents

    to see how this enables seamless communication between components.
  • VAD & TurnDetector: SileroVAD detects when the user is speaking; TurnDetector determines conversational turns, ensuring natural back-and-forth.

3. Setting Up the Development Environment

Prerequisites

  • Python 3.11 or newer
  • A VideoSDK account (for API keys and dashboard access)

Step 1: Create a Virtual Environment

Open your terminal and run:
1python3 -m venv venv
2source venv/bin/activate  # On Windows: venv\Scripts\activate
3

Step 2: Install Required Packages

Install the VideoSDK AI Agents framework and plugins:
1pip install videosdk-agents videosdk-plugins-silero videosdk-plugins-turn-detector videosdk-plugins-deepgram videosdk-plugins-openai videosdk-plugins-elevenlabs
2

Step 3: Configure API Keys in a .env file

Create a .env file in your project directory and add your VideoSDK and plugin API keys:
1VIDEOSDK_API_KEY=your_videosdk_api_key
2DEEPGRAM_API_KEY=your_deepgram_api_key
3OPENAI_API_KEY=your_openai_api_key
4ELEVENLABS_API_KEY=your_elevenlabs_api_key
5
You can find your VideoSDK API key in the dashboard after signing up. Obtain plugin keys from their respective providers.

4. Building the AI Voice Agent: A Step-by-Step Guide

Full Working Code Example

Below is the complete, runnable Python script for your Voice Acting Agent. We'll break down each part in the following sections.
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are a professional and knowledgeable Voice Acting Agent. Your persona is friendly, supportive, and resourceful, acting as a virtual talent agent specializing in voice acting careers. 
14
15Capabilities:
16- Provide information about the voice acting industry, including career paths, audition tips, and required skills.
17- Offer guidance on building a voice acting portfolio, finding auditions, and connecting with casting directors or agencies.
18- Answer questions about voice acting techniques, training resources, and industry trends.
19- Suggest reputable online platforms, workshops, and communities for aspiring and professional voice actors.
20- Assist users in preparing for auditions by offering script reading tips and vocal warm-up exercises.
21
22Constraints and Limitations:
23- You do not represent any real-world agency or guarantee job placements.
24- Do not provide legal, financial, or contractual advice; always recommend consulting a qualified professional for such matters.
25- Avoid sharing personal opinions or endorsements of specific individuals or companies.
26- Do not collect or store any personal information from users.
27- Always encourage users to verify information independently and exercise caution when pursuing opportunities."
28
29class MyVoiceAgent(Agent):
30    def __init__(self):
31        super().__init__(instructions=agent_instructions)
32    async def on_enter(self): await self.session.say("Hello! How can I help?")
33    async def on_exit(self): await self.session.say("Goodbye!")
34
35async def start_session(context: JobContext):
36    # Create agent and conversation flow
37    agent = MyVoiceAgent()
38    conversation_flow = ConversationFlow(agent)
39
40    # Create pipeline
41    pipeline = CascadingPipeline(
42        stt=DeepgramSTT(model="nova-2", language="en"),
43        llm=OpenAILLM(model="gpt-4o"),
44        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
45        vad=SileroVAD(threshold=0.35),
46        turn_detector=TurnDetector(threshold=0.8)
47    )
48
49    session = AgentSession(
50        agent=agent,
51        pipeline=pipeline,
52        conversation_flow=conversation_flow
53    )
54
55    try:
56        await context.connect()
57        await session.start()
58        # Keep the session running until manually terminated
59        await asyncio.Event().wait()
60    finally:
61        # Clean up resources when done
62        await session.close()
63        await context.shutdown()
64
65def make_context() -> JobContext:
66    room_options = RoomOptions(
67    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
68        name="VideoSDK Cascaded Agent",
69        playground=True
70    )
71
72    return JobContext(room_options=room_options)
73
74if __name__ == "__main__":
75    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
76    job.start()
77

Step 4.1: Generating a VideoSDK Meeting ID

To test your agent, you'll need a meeting ID. You can create one via the VideoSDK API.
Run this curl command (replace YOUR_API_KEY):
1curl -X POST \
2  -H "Authorization: YOUR_API_KEY" \
3  -H "Content-Type: application/json" \
4  -d '{}' \
5  https://api.videosdk.live/v2/rooms
6
The response will include a roomId. You can use this in your RoomOptions if you want to join a pre-created room. If omitted, a new room is auto-created.

Step 4.2: Creating the Custom Agent Class (MyVoiceAgent)

The heart of your agent is the custom class that defines its persona and behavior.
1agent_instructions = "You are a professional and knowledgeable Voice Acting Agent. Your persona is friendly, supportive, and resourceful, acting as a virtual talent agent specializing in voice acting careers. 
2
3Capabilities:
4- Provide information about the voice acting industry, including career paths, audition tips, and required skills.
5- Offer guidance on building a voice acting portfolio, finding auditions, and connecting with casting directors or agencies.
6- Answer questions about voice acting techniques, training resources, and industry trends.
7- Suggest reputable online platforms, workshops, and communities for aspiring and professional voice actors.
8- Assist users in preparing for auditions by offering script reading tips and vocal warm-up exercises.
9
10Constraints and Limitations:
11- You do not represent any real-world agency or guarantee job placements.
12- Do not provide legal, financial, or contractual advice; always recommend consulting a qualified professional for such matters.
13- Avoid sharing personal opinions or endorsements of specific individuals or companies.
14- Do not collect or store any personal information from users.
15- Always encourage users to verify information independently and exercise caution when pursuing opportunities."
16
17class MyVoiceAgent(Agent):
18    def __init__(self):
19        super().__init__(instructions=agent_instructions)
20    async def on_enter(self):
21        await self.session.say("Hello! How can I help?")
22    async def on_exit(self):
23        await self.session.say("Goodbye!")
24
This class sets the agent's persona and welcome/goodbye messages.

Step 4.3: Defining the Core Pipeline (CascadingPipeline)

The pipeline connects all the plugins: STT, LLM, TTS, VAD, and Turn Detector.
1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8
  • DeepgramSTT: High-quality, cost-effective speech recognition.
  • OpenAILLM: Powerful GPT-4o language model for smart responses.
  • ElevenLabsTTS: Natural and expressive voice synthesis. For more advanced voice synthesis, check out the

    ElevenLabs TTS Plugin for voice agent

    to explore additional configuration options and voices.
  • SileroVAD: Reliable voice activity detection.
  • TurnDetector: Ensures smooth conversational turns. To better understand how turn-taking is managed, see the

    Turn detector for AI voice Agents

    documentation.

Step 4.4: Managing the Session and Startup Logic

Sessions manage the lifecycle of your agent and handle connections.
1async def start_session(context: JobContext):
2    agent = MyVoiceAgent()
3    conversation_flow = ConversationFlow(agent)
4    pipeline = CascadingPipeline(
5        stt=DeepgramSTT(model="nova-2", language="en"),
6        llm=OpenAILLM(model="gpt-4o"),
7        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
8        vad=SileroVAD(threshold=0.35),
9        turn_detector=TurnDetector(threshold=0.8)
10    )
11    session = AgentSession(
12        agent=agent,
13        pipeline=pipeline,
14        conversation_flow=conversation_flow
15    )
16    try:
17        await context.connect()
18        await session.start()
19        await asyncio.Event().wait()
20    finally:
21        await session.close()
22        await context.shutdown()
23
24def make_context() -> JobContext:
25    room_options = RoomOptions(
26        name="VideoSDK Cascaded Agent",
27        playground=True
28    )
29    return JobContext(room_options=room_options)
30
31if __name__ == "__main__":
32    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
33    job.start()
34
  • The make_context function creates a room with playground=True for easy browser testing. You can also experiment with your agent in the

    AI Agent playground

    for interactive testing and rapid prototyping.
  • The main block starts the agent job.

5. Running and Testing the Agent

Step 5.1: Running the Python Script

  1. Make sure your .env file is set up with all required API keys.
  2. Run the script:
1python main.py
2
  1. In the console output, look for a line that says Playground URL:. This link lets you join the agent session from your browser.

Step 5.2: Interacting with the Agent in the Playground

  • Open the Playground URL in your browser.
  • Join the meeting as a participant.
  • Speak or type your questions about voice acting.
  • The agent will respond in real time using natural-sounding speech.
To stop the agent, press Ctrl+C in your terminal. This triggers a graceful shutdown, ensuring all resources are released.

6. Advanced Features and Customizations

Extending Functionality with Custom Tools

You can add custom tools or actions to your agent by subclassing and extending the Agent class. For example, you could add a portfolio review tool, integrate with audition databases, or trigger notifications.

Exploring Other Plugins

VideoSDK supports a variety of plugins for STT, TTS, and LLM. You can experiment with alternatives like Cartesia for STT, Deepgram for TTS, or Google Gemini for LLM to optimize for cost, quality, or language support.

7. Troubleshooting Common Issues

API Key and Authentication Errors

  • Double-check all API keys in your .env file.
  • Make sure your VideoSDK account is active.

Audio Input/Output Problems

  • Ensure your microphone is enabled in the browser.
  • Test with different browsers if you encounter issues.

Dependency and Version Conflicts

  • Use a clean virtual environment.
  • Run pip list to check for any conflicting package versions.

8. Conclusion

In this tutorial, you built a professional AI Voice Acting Agent using the VideoSDK AI Agents framework. You learned how to set up the environment, implement a custom agent, and test it in the browser.
Explore more advanced features, try different plugins, and consider deploying your agent to help aspiring voice actors worldwide. The possibilities for customization and integration are vast—happy building!

Get 10,000 Free Minutes Every Months

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ