Build AI Voice Agent for Transportation

Step-by-step guide to building an AI Voice Agent for transportation using VideoSDK. Includes code and testing.

Introduction to AI Voice Agents in the Transportation Industry

AI Voice Agents are sophisticated software systems capable of understanding and responding to human speech. They leverage technologies like Speech-to-Text (STT), Text-to-Speech (TTS), and Language Models (LLM) to interact with users in a natural, conversational manner. In the transportation industry, these agents can streamline operations by providing real-time traffic updates, suggesting optimal routes, and offering public transportation schedules, enhancing both efficiency and user experience.
In this tutorial, we will build a comprehensive AI

Voice Agent

tailored for the transportation sector. This agent will assist users with transportation-related inquiries, ensuring a seamless interaction experience.

Architecture and Core Concepts

Our AI

Voice Agent

's architecture involves several key components that work together to process user input and generate responses. Here's a high-level overview of the data flow:
Diagram

Understanding Key Concepts in the VideoSDK Framework

  • Agent: The core class representing your bot, responsible for managing interactions.
  • CascadingPipeline: Manages the flow of audio processing through STT, LLM, and TTS components.
  • VAD & TurnDetector: These components help the agent determine when to listen and when to respond, ensuring smooth conversations. For more details, explore the

    Turn detector for AI voice Agents

    .

Setting Up the Development Environment

Before diving into the code, ensure you have the necessary tools and accounts:

Prerequisites

  • Python 3.11+
  • VideoSDK Account (sign up at app.videosdk.live)

Step 1: Create a Virtual Environment

1python -m venv venv
2source venv/bin/activate  # On Windows use `venv\\Scripts\\activate`
3

Step 2: Install Required Packages

1pip install videosdk
2pip install python-dotenv
3

Step 3: Configure API Keys in a .env File

Create a .env file in your project root and add your VideoSDK API key:
1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

Here's the complete code for our AI Voice Agent:
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10pre_download_model()
11
12agent_instructions = "You are a knowledgeable and efficient AI Voice Agent designed specifically for the transportation industry. Your primary role is to assist users with transportation-related inquiries and tasks. You can provide real-time traffic updates, suggest optimal routes, offer public transportation schedules, and answer questions about transportation regulations and policies. However, you must always remind users to verify critical information from official sources, as you are not a certified transportation expert. Additionally, you cannot provide real-time emergency assistance or handle any financial transactions. Always maintain a professional and courteous tone, ensuring user privacy and data security at all times."
13
14class MyVoiceAgent(Agent):
15    def __init__(self):
16        super().__init__(instructions=agent_instructions)
17    async def on_enter(self): await self.session.say("Hello! How can I help?")
18    async def on_exit(self): await self.session.say("Goodbye!")
19
20async def start_session(context: JobContext):
21    agent = MyVoiceAgent()
22    conversation_flow = ConversationFlow(agent)
23
24    pipeline = CascadingPipeline(
25        stt=DeepgramSTT(model="nova-2", language="en"),
26        llm=OpenAILLM(model="gpt-4o"),
27        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
28        vad=SileroVAD(threshold=0.35),
29        turn_detector=TurnDetector(threshold=0.8)
30    )
31
32    session = AgentSession(
33        agent=agent,
34        pipeline=pipeline,
35        conversation_flow=conversation_flow
36    )
37
38    try:
39        await context.connect()
40        await session.start()
41        await asyncio.Event().wait()
42    finally:
43        await session.close()
44        await context.shutdown()
45
46def make_context() -> JobContext:
47    room_options = RoomOptions(
48        name="VideoSDK Cascaded Agent",
49        playground=True
50    )
51
52    return JobContext(room_options=room_options)
53
54if __name__ == "__main__":
55    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
56    job.start()
57

Step 4.1: Generating a VideoSDK Meeting ID

To interact with your AI Voice Agent, you'll need a meeting ID. You can generate one using the VideoSDK API:
1curl -X POST "https://api.videosdk.live/v1/meetings" \
2-H "Authorization: Bearer YOUR_API_KEY" \
3-H "Content-Type: application/json"
4

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class is where we define the behavior of our agent. It extends the Agent class and customizes the interaction flow:
1class MyVoiceAgent(Agent):
2    def __init__(self):
3        super().__init__(instructions=agent_instructions)
4    async def on_enter(self): await self.session.say("Hello! How can I help?")
5    async def on_exit(self): await self.session.say("Goodbye!")
6
This class uses predefined instructions to guide its interactions, ensuring the agent remains focused on transportation-related tasks.

Step 4.3: Defining the Core Pipeline

The CascadingPipeline is crucial for processing user input and generating responses. It integrates various plugins:
1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8
Each component plays a specific role: converting speech to text, processing the text, and converting the response back to speech. For more information on these components, refer to the

AI voice Agent core components overview

.

Step 4.4: Managing the Session and Startup Logic

The start_session function manages the lifecycle of the agent's session:
1async def start_session(context: JobContext):
2    agent = MyVoiceAgent()
3    conversation_flow = ConversationFlow(agent)
4    pipeline = CascadingPipeline(...)
5    session = AgentSession(
6        agent=agent,
7        pipeline=pipeline,
8        conversation_flow=conversation_flow
9    )
10    try:
11        await context.connect()
12        await session.start()
13        await asyncio.Event().wait()
14    finally:
15        await session.close()
16        await context.shutdown()
17
This function connects the agent, starts the session, and ensures resources are properly cleaned up. For detailed insights into managing sessions, explore

AI voice Agent Sessions

.

Running and Testing the Agent

Step 5.1: Running the Python Script

Execute the script to start your agent:
1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

Upon running the script, you'll receive a playground link in the console. Use this link to interact with your agent and test its capabilities. Visit the

AI Agent playground

for more interactive testing options.

Advanced Features and Customizations

Extending Functionality with Custom Tools

You can extend your agent's capabilities by integrating custom tools and plugins, allowing for more specialized interactions. Consider using the

ElevenLabs TTS Plugin for voice agent

and

Deepgram STT Plugin for voice agent

for enhanced audio processing.

Exploring Other Plugins

The VideoSDK framework supports various STT, LLM, and TTS plugins, enabling you to tailor the agent to specific needs. For instance, the

OpenAI LLM Plugin for voice agent

can be used to enhance language understanding, while

Silero Voice Activity Detection

ensures accurate voice activity recognition.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API key is correctly set in the .env file and has the necessary permissions.

Audio Input/Output Problems

Check your microphone and speaker settings to ensure they are correctly configured.

Dependency and Version Conflicts

Verify that all required packages are installed and compatible with your Python version.

Conclusion

In this guide, we've built a fully functional AI Voice Agent for the transportation industry using the VideoSDK framework. This agent can handle various transportation-related inquiries, providing users with timely and accurate information. As next steps, consider exploring additional plugins and customizations to further enhance your agent's capabilities.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ