How do I get my VideoSDK API key?

Sign up for a free VideoSDK account and find your API key in the dashboard under the API Keys section.

Can I use a different speech-to-text or text-to-speech provider?

Yes, VideoSDK supports multiple STT (Cartesia, Deepgram, Rime) and TTS (ElevenLabs, Deepgram) plugins. Simply swap the plugin in the pipeline.

What happens if my agent stops responding?

Check your API keys, network connection, and ensure all services are running. Restart the script after resolving any issues.

How do I test the agent with real users?

Share the playground link generated in the console, or deploy the agent in a production room for live interactions.

Can I deploy this agent to production?

Yes, after testing, you can remove playground mode and use a persistent meeting room for production deployment.

Build an AI Voice Agent for Travel with Python & VideoSDK

Comprehensive tutorial to build and test a travel-focused AI voice agent using Python and VideoSDK. Includes setup, code, and testing.

1. Introduction to AI Voice Agents in Travel

What is an AI Voice Agent?

An AI voice agent is a software system that understands spoken language, processes it using artificial intelligence, and responds in natural, human-like speech. These agents combine speech-to-text (STT), large language models (LLM), and text-to-speech (TTS) to enable real-time, conversational interactions.

Why are They Important for the Travel Industry?

Travelers often need instant, reliable information about destinations, bookings, and logistics. AI voice agents can provide 24/7 assistance, answer questions about flights, hotels, and local attractions, and even help plan itineraries. This enhances user experience, reduces support costs, and enables personalized service at scale.

Core Components of a Voice Agent

STT (Speech-to-Text): Converts user speech into text.
LLM (Large Language Model): Processes text and generates intelligent responses.
TTS (Text-to-Speech): Converts responses back to natural-sounding speech.

What You'll Build in This Tutorial

You will create a fully functional travel AI voice agent using Python and the VideoSDK AI Agents framework. The agent will answer travel-related questions, suggest destinations, and provide helpful tips—all via voice in a browser playground.

2. Architecture and Core Concepts

High-Level Architecture Overview

The AI voice agent processes spoken input from the user, transcribes it, generates a response, and speaks it back. Here's a visual overview:

1sequenceDiagram
2    participant User
3    participant Agent
4    participant STT
5    participant LLM
6    participant TTS
7    participant VAD
8    participant TurnDetector
9
10    User->>Agent: Speaks question
11    Agent->>VAD: Detects speech activity
12    Agent->>TurnDetector: Detects end of turn
13    Agent->>STT: Transcribes speech
14    STT-->>Agent: Returns text
15    Agent->>LLM: Generates response
16    LLM-->>Agent: Returns answer
17    Agent->>TTS: Synthesizes speech
18    TTS-->>Agent: Plays audio to User
19

Understanding Key Concepts in the VideoSDK Framework

Agent: The main class that defines your AI's persona and logic.
CascadingPipeline: Connects STT, LLM, TTS, VAD, and turn detection plugins. For a deeper understanding of how this works, see the
Cascading pipeline in AI voice Agents
.
VAD (Voice Activity Detection): Detects when a user starts/stops speaking.
TurnDetector: Helps determine when the user's turn is over, improving natural conversation flow. Learn more about its role in the
Turn detector for AI voice Agents
.

If you're new to building voice agents, the

Voice Agent Quick Start Guide

provides a step-by-step introduction to get you started quickly.

For a comprehensive look at the essential building blocks, check out the

AI voice Agent core components overview

3. Setting Up the Development Environment

Prerequisites

Python 3.11+
A free VideoSDK account (for API keys and dashboard access)

Step 1: Create a Virtual Environment

Open your terminal and run:

1python3 -m venv venv
2source venv/bin/activate  # On Windows: venv\Scripts\activate
3

Step 2: Install Required Packages

Install the VideoSDK AI Agents SDK and plugins:

1pip install videosdk-agents videosdk-plugins-silero videosdk-plugins-deepgram videosdk-plugins-elevenlabs videosdk-plugins-openai
2

For speech-to-text, you'll be using the
Deepgram STT Plugin for voice agent
.
For text-to-speech, leverage the
ElevenLabs TTS Plugin for voice agent
.
For language understanding, integrate the
OpenAI LLM Plugin for voice agent
.
For voice activity detection, utilize the
Silero Voice Activity Detection
plugin.

Step 3: Configure API Keys in a .env File

Create a .env file in your project directory with the following (replace with your actual keys):

1VIDEOSDK_API_KEY=your_videosdk_key
2DEEPGRAM_API_KEY=your_deepgram_key
3OPENAI_API_KEY=your_openai_key
4ELEVENLABS_API_KEY=your_elevenlabs_key
5

4. Building the AI Voice Agent: A Step-by-Step Guide

Let's build the travel assistant agent! Here is the complete, runnable code. We'll break it down step by step in the sections below.

1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are a friendly and knowledgeable AI Voice Agent specializing in travel assistance. Your persona is that of a helpful virtual travel concierge, always polite, patient, and eager to assist travelers with their needs. 
14
15Capabilities:
16- Answer questions about travel destinations, including popular attractions, local customs, weather, and best times to visit.
17- Provide up-to-date information on flights, hotels, and transportation options.
18- Assist users in planning itineraries, suggesting activities, and recommending restaurants or accommodations based on user preferences.
19- Offer guidance on travel documentation, visa requirements, and safety tips.
20- Help with booking reminders, packing checklists, and travel alerts.
21
22Constraints and Limitations:
23- You do not have access to real-time booking systems or personal user data; you can only provide general information and suggestions.
24- You are not a licensed travel agent and cannot make bookings or reservations on behalf of users.
25- Always advise users to verify critical travel information (such as visa requirements or travel advisories) with official sources before making decisions.
26- Do not provide medical, legal, or financial advice; for such matters, instruct users to consult qualified professionals.
27- Maintain user privacy and do not request or store any sensitive personal information."
28
29class MyVoiceAgent(Agent):
30    def __init__(self):
31        super().__init__(instructions=agent_instructions)
32    async def on_enter(self): await self.session.say("Hello! How can I help?")
33    async def on_exit(self): await self.session.say("Goodbye!")
34
35async def start_session(context: JobContext):
36    # Create agent and conversation flow
37    agent = MyVoiceAgent()
38    conversation_flow = ConversationFlow(agent)
39
40    # Create pipeline
41    pipeline = CascadingPipeline(
42        stt=DeepgramSTT(model="nova-2", language="en"),
43        llm=OpenAILLM(model="gpt-4o"),
44        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
45        vad=SileroVAD(threshold=0.35),
46        turn_detector=TurnDetector(threshold=0.8)
47    )
48
49    session = AgentSession(
50        agent=agent,
51        pipeline=pipeline,
52        conversation_flow=conversation_flow
53    )
54
55    try:
56        await context.connect()
57        await session.start()
58        # Keep the session running until manually terminated
59        await asyncio.Event().wait()
60    finally:
61        # Clean up resources when done
62        await session.close()
63        await context.shutdown()
64
65def make_context() -> JobContext:
66    room_options = RoomOptions(
67    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
68        name="VideoSDK Cascaded Agent",
69        playground=True
70    )
71
72    return JobContext(room_options=room_options)
73
74if __name__ == "__main__":
75    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
76    job.start()
77

If you want a guided walkthrough of these steps, refer to the

Voice Agent Quick Start Guide

Step 4.1: Generating a VideoSDK Meeting ID

Before launching the agent, you can generate a meeting ID using the VideoSDK API. In most cases, the agent will auto-create a room, but here's how to do it manually:

1curl -X POST "https://api.videosdk.live/v2/rooms" -H "Authorization: your_videosdk_key"
2

Copy the returned meeting ID and use it in your code if you want to join a specific room.

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class defines your travel assistant's persona and behaviors.

1agent_instructions = "You are a friendly and knowledgeable AI Voice Agent specializing in travel assistance. ..."
2
3class MyVoiceAgent(Agent):
4    def __init__(self):
5        super().__init__(instructions=agent_instructions)
6    async def on_enter(self):
7        await self.session.say("Hello! How can I help?")
8    async def on_exit(self):
9        await self.session.say("Goodbye!")
10

Persona: The instructions define the agent's travel expertise and helpful, polite tone.
onenter/onexit: These methods greet and say goodbye to users.

Step 4.3: Defining the Core Pipeline

The CascadingPipeline connects all the plugins: STT, LLM, TTS, VAD, and TurnDetector.

1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8

DeepgramSTT: High-accuracy speech-to-text. For more details, see the
Deepgram STT Plugin for voice agent
.
OpenAILLM: GPT-4o for intelligent, context-aware responses. Learn more in the
OpenAI LLM Plugin for voice agent
.
ElevenLabsTTS: Natural, expressive voice output. See the
ElevenLabs TTS Plugin for voice agent
.
SileroVAD & TurnDetector: For smooth, interruption-free conversations. Explore the
Silero Voice Activity Detection
and
Turn detector for AI voice Agents
plugins for more information.

Step 4.4: Managing the Session and Startup Logic

The session manages the conversation and connects everything.

1async def start_session(context: JobContext):
2    agent = MyVoiceAgent()
3    conversation_flow = ConversationFlow(agent)
4    pipeline = CascadingPipeline(
5        stt=DeepgramSTT(model="nova-2", language="en"),
6        llm=OpenAILLM(model="gpt-4o"),
7        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
8        vad=SileroVAD(threshold=0.35),
9        turn_detector=TurnDetector(threshold=0.8)
10    )
11    session = AgentSession(
12        agent=agent,
13        pipeline=pipeline,
14        conversation_flow=conversation_flow
15    )
16    try:
17        await context.connect()
18        await session.start()
19        await asyncio.Event().wait()
20    finally:
21        await session.close()
22        await context.shutdown()
23
24def make_context() -> JobContext:
25    room_options = RoomOptions(
26        name="VideoSDK Cascaded Agent",
27        playground=True
28    )
29    return JobContext(room_options=room_options)
30
31if __name__ == "__main__":
32    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
33    job.start()
34

JobContext: Handles room creation and playground mode.
WorkerJob: Launches the agent and manages its lifecycle.

5. Running and Testing the Agent

Step 5.1: Running the Python Script

Ensure your .env file is set up with all API keys.
Run the agent script:

1python main.py
2

In the console, you'll see a 'Playground' link.

Step 5.2: Interacting with the Agent in the Playground

Open the playground link in your browser.
Join as a participant.
Speak or type questions about travel.
The agent will respond in real-time with voice and text.

To experiment and test your agent in real time, use the

AI Agent playground

for hands-on interaction.

Graceful Shutdown: Press Ctrl+C in your terminal to stop the agent and clean up resources.

6. Advanced Features and Customizations

Extending Functionality with Custom Tools

Integrate additional APIs (e.g., weather, flight status) for richer answers.
Add custom plugins for language translation or sentiment analysis.

Exploring Other Plugins

Try alternative STT/TTS providers (Cartesia, Rime, Deepgram).
Experiment with different LLMs (Google Gemini, Anthropic Claude).
Adjust VAD and TurnDetector thresholds for different environments.

7. Troubleshooting Common Issues

API Key and Authentication Errors

Double-check your .env file for typos.
Ensure all required API keys are active and not expired.

Audio Input/Output Problems

Check your browser's microphone permissions.
Test with a different browser or device if audio isn't working.

Dependency and Version Conflicts

Use a fresh virtual environment.
Run pip install --upgrade for all packages if you see import errors.

8. Conclusion

You've built a fully functional AI voice agent tailored for travel assistance! This agent can answer questions, suggest destinations, and provide travel tips—all via natural voice.

Next Steps:

Deploy your agent for real-world users.
Integrate more APIs for live data.
Explore advanced customization with the VideoSDK framework.

Happy travels—and happy coding!

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ

Free $20 Balance for AI Voice Agents & Video Calls

RELEVANT BLOGS