Why are AI voice assistants important in the automotive industry?

They enhance the driving experience by allowing hands-free control over navigation, entertainment, and vehicle diagnostics, improving safety and convenience.

What are the core components of a voice agent?

The core components include Speech-to-Text (STT), Large Language Model (LLM), and Text-to-Speech (TTS), which together enable the agent to process and respond to voice commands.

How do I generate a VideoSDK meeting ID?

You can generate a meeting ID using the VideoSDK API with a POST request to the rooms endpoint, including your API key in the headers.

What should I do if I encounter audio input/output problems?

Check your system\'s microphone and speaker settings to ensure the correct devices are selected and functioning properly.

Build an AI Voice Assistant for Automotive

Step-by-step guide to building an AI voice assistant for the automotive industry using VideoSDK.

Introduction to AI Voice Agents in the Automotive Industry

In today's rapidly evolving automotive industry, AI voice assistants are becoming an integral part of the driving experience. These intelligent systems are designed to enhance user interaction by providing hands-free control over various automotive functions, thereby improving safety and convenience.

What is an AI
Voice Agent
?

An AI

Voice Agent

is a software application that uses artificial intelligence to understand and respond to human speech. These agents can perform various tasks such as answering questions, controlling smart devices, and providing real-time information. In the automotive context, they help drivers interact with their vehicles more intuitively.

Why are they important for the Automotive Industry?

AI voice assistants in vehicles offer numerous benefits. They enhance the driving experience by allowing users to control navigation, manage entertainment systems, and access vehicle diagnostics without taking their eyes off the road. This hands-free interaction is crucial for safety and convenience.

Core Components of a
Voice Agent

To build an effective AI voice assistant, several core components are necessary:

Speech-to-Text (STT): Converts spoken language into text.
Large Language Model (LLM): Processes the text and generates appropriate responses.
Text-to-Speech (TTS): Converts the generated text back into speech for the user.

What You'll Build in This Tutorial

In this tutorial, we'll guide you through the process of building a fully functional AI voice assistant tailored for the automotive industry using the VideoSDK framework. By the end, you'll have a working agent capable of assisting users with automotive-related inquiries.

Architecture and Core Concepts

High-Level Architecture Overview

The AI voice assistant operates through a series of interconnected components that process user input and generate responses. The user's speech is first captured and converted into text using STT. This text is then processed by an LLM, which formulates a response. Finally, TTS converts the response text back into speech.

Understanding Key Concepts in the VideoSDK Framework

Agent: The core class representing your bot. It handles interaction logic and manages the conversation flow.
Cascading Pipeline in AI voice Agents
: This defines the flow of audio processing through various stages, including STT, LLM, and TTS.
VAD & TurnDetector: These components help the agent determine when to listen and when to respond, ensuring smooth interactions.

Setting Up the Development Environment

Prerequisites

Before we begin, ensure you have the following:

Python 3.11+ installed on your system.
A VideoSDK account, which you can create at app.videosdk.live.

Step 1: Create a Virtual Environment

To keep dependencies organized, create a virtual environment:

1python -m venv venv
2source venv/bin/activate  # On Windows use `venv\\Scripts\\activate`
3

Step 2: Install Required Packages

Install the necessary Python packages using pip:

1pip install videosdk
2

Step 3: Configure API Keys in a `.env` File

Create a .env file in your project directory and add your VideoSDK API key:

1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

Here is the complete code for our AI voice assistant. We'll break it down step-by-step in the following sections.

1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are an AI Voice Assistant specialized in the automotive industry. Your primary role is to assist users with automotive-related inquiries and tasks. You can provide information about vehicle specifications, maintenance tips, and troubleshooting common car issues. Additionally, you can help schedule service appointments and offer guidance on purchasing new vehicles. However, you are not a certified mechanic or automotive expert, so you must always recommend consulting a professional for detailed diagnostics or repairs. Your responses should be concise, informative, and user-friendly, ensuring a seamless interaction experience."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=[Deepgram STT Plugin for voice agent](https://docs.videosdk.live/ai_agents/plugins/stt/deepgram)(model="nova-2", language="en"),
29        llm=[OpenAI LLM Plugin for voice agent](https://docs.videosdk.live/ai_agents/plugins/llm/openai)(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=[Silero Voice Activity Detection](https://docs.videosdk.live/ai_agents/plugins/silero-vad)(threshold=0.35),
32        turn_detector=[Turn detector for AI voice Agents](https://docs.videosdk.live/ai_agents/plugins/turn-detector)(threshold=0.8)
33    )
34
35    session = [AI voice Agent Sessions](https://docs.videosdk.live/ai_agents/core-components/agent-session)(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Step 4.1: Generating a VideoSDK Meeting ID

To interact with the agent, you need a meeting ID. You can generate one using the VideoSDK API:

1curl -X POST \\
2  https://api.videosdk.live/v1/rooms \\
3  -H "Authorization: Bearer YOUR_API_KEY" \\
4  -H "Content-Type: application/json" \\
5  -d '{"name":"My Meeting"}'
6

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class is where we define the behavior of our voice assistant. It inherits from the Agent class and uses the provided agent_instructions to guide its interactions.

1class MyVoiceAgent(Agent):
2    def __init__(self):
3        super().__init__(instructions=agent_instructions)
4    async def on_enter(self):
5        await self.session.say("Hello! How can I help?")
6    async def on_exit(self):
7        await self.session.say("Goodbye!")
8

Step 4.3: Defining the Core Pipeline

The CascadingPipeline is crucial as it defines the flow of data through various processing stages:

1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8

Each component in the pipeline has a specific role:

DeepgramSTT: Converts speech to text.
OpenAILLM: Processes text and generates responses.
ElevenLabsTTS: Converts text responses back to speech.
SileroVAD: Detects voice activity to manage when the agent should listen.
TurnDetector: Helps manage conversation flow by detecting when the user has finished speaking.

Step 4.4: Managing the Session and Startup Logic

The start_session function manages the lifecycle of the agent session, ensuring it starts and stops gracefully:

1async def start_session(context: JobContext):
2    agent = MyVoiceAgent()
3    conversation_flow = ConversationFlow(agent)
4    pipeline = CascadingPipeline(...)
5    session = AgentSession(agent=agent, pipeline=pipeline, conversation_flow=conversation_flow)
6
7    try:
8        await context.connect()
9        await session.start()
10        await asyncio.Event().wait()
11    finally:
12        await session.close()
13        await context.shutdown()
14

The make_context function sets up the room options for the session:

1def make_context() -> JobContext:
2    room_options = RoomOptions(
3        name="VideoSDK Cascaded Agent",
4        playground=True
5    )
6    return JobContext(room_options=room_options)
7

Finally, the script's entry point ensures the agent starts correctly:

1if __name__ == "__main__":
2    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
3    job.start()
4

Running and Testing the Agent

Step 5.1: Running the Python Script

With everything set up, run your Python script to start the agent:

1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

Once the script is running, you'll see a playground link in the console. Open this link in your browser to join the session and interact with your AI voice assistant. You can test various automotive-related queries and observe how the agent responds.

Advanced Features and Customizations

Extending Functionality with Custom Tools

The VideoSDK framework allows you to extend your agent's capabilities by integrating custom tools. These tools can perform specific tasks or enhance existing functionalities.

Exploring Other Plugins

While this tutorial uses specific plugins for STT, LLM, and TTS, the VideoSDK framework supports various other options. Explore different plugins to find the best fit for your application's needs.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API key is correctly set in the .env file and that your account has the necessary permissions.

Audio Input/Output Problems

Check your system's microphone and speaker settings. Ensure the correct devices are selected and functioning properly.

Dependency and Version Conflicts

If you encounter issues with package versions, consider using a virtual environment to isolate dependencies and prevent conflicts.

Conclusion

Summary of What You've Built

Congratulations! You've built a fully functional AI voice assistant tailored for the automotive industry. This assistant can handle various automotive-related queries and provide valuable information to users.

Next Steps and Further Learning

To further enhance your AI voice assistant, consider exploring additional plugins, customizing the agent's behavior, and integrating more advanced features. The VideoSDK framework offers extensive documentation and resources to support your continued development journey.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ

Free $20 Balance for AI Voice Agents & Video Calls