Build a Flutter AI Voice Agent API

Implement a Flutter AI Voice Agent API using VideoSDK. Follow our comprehensive guide with code examples and testing instructions.

Introduction to AI Voice Agents in Flutter AI Voice Agent API

AI Voice Agents are transforming the way we interact with technology by enabling voice-based communication between users and applications. In the context of the Flutter AI Voice Agent API, these agents are particularly valuable for creating voice-enabled applications that can handle user queries, provide information, and perform tasks through natural language processing.

What is an AI Voice Agent?

An AI Voice Agent is a system that uses artificial intelligence to process and respond to human speech. It typically involves components like speech-to-text (STT), natural language processing (NLP), and text-to-speech (TTS) to convert spoken language into text, understand the intent, and generate a spoken response. For a detailed overview, refer to the

AI voice Agent core components overview

.

Why are they important for the Flutter AI Voice Agent API industry?

In the Flutter ecosystem, AI Voice Agents can enhance user experience by providing hands-free interaction, improving accessibility, and supporting multitasking. They are crucial in applications like virtual assistants, customer support bots, and smart home devices.

Core Components of a Voice Agent

  • Speech-to-Text (STT): Converts spoken words into text. Consider using the

    Deepgram STT Plugin for voice agent

    for efficient speech recognition.
  • Large Language Model (LLM): Understands and processes the text to determine the appropriate response.
  • Text-to-Speech (TTS): Converts the response text back into spoken words, and the

    ElevenLabs TTS Plugin for voice agent

    can be a great choice for this purpose.

What You'll Build in This Tutorial

In this tutorial, you will build a Flutter AI Voice Agent using the VideoSDK framework. The agent will be capable of understanding and responding to user queries in real-time. To get started, you might want to check the

Voice Agent Quick Start Guide

.

Architecture and Core Concepts

High-Level Architecture Overview

The architecture of an AI Voice Agent involves several stages: capturing user speech, converting it to text, processing the text to understand the user's intent, generating a response, and finally converting the response back to speech. The

Cascading pipeline in AI voice Agents

is essential for managing this flow efficiently.
Diagram

Understanding Key Concepts in the VideoSDK Framework

  • Agent: This is the core class representing your bot. It manages the interaction with users and processes their requests.
  • CascadingPipeline: This defines the flow of audio processing, moving through stages such as STT, LLM, and TTS.
  • VAD & TurnDetector: These components help the agent determine when to listen and when to speak, ensuring smooth interaction. The

    Turn detector for AI voice Agents

    is particularly useful for this.

Setting Up the Development Environment

Prerequisites

To get started, ensure you have Python 3.11+ installed and a VideoSDK account. You can sign up for an account at app.videosdk.live.

Step 1: Create a Virtual Environment

Create a virtual environment to manage your project dependencies:
1python -m venv venv
2source venv/bin/activate  # On Windows use `venv\Scripts\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip:
1pip install videosdk
2pip install python-dotenv
3

Step 3: Configure API Keys in a .env file

Create a .env file in your project directory and add your VideoSDK API keys:
1VIDEOSDK_API_KEY=your_api_key_here
2VIDEOSDK_SECRET_KEY=your_secret_key_here
3

Building the AI Voice Agent: A Step-by-Step Guide

To build your AI Voice Agent, we'll start by presenting the complete, runnable code block, followed by a detailed breakdown.
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are a Flutter AI Voice Agent API specializing in providing technical support and guidance for developers using Flutter to build voice-enabled applications. Your persona is that of a knowledgeable and friendly tech assistant. Your capabilities include answering questions about Flutter integration, providing code snippets for common tasks, and guiding users through troubleshooting steps. You can also offer best practices for optimizing voice recognition and handling API requests efficiently. However, you are not a substitute for official documentation or professional developer support. Always encourage users to refer to the official Flutter documentation and community forums for comprehensive guidance. You must include a disclaimer that you are an AI and your responses are based on pre-existing data and algorithms."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = AgentSession(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Step 4.1: Generating a VideoSDK Meeting ID

To interact with your agent, you need a meeting ID. Use the following curl command to generate one:
1curl -X POST \
2  https://api.videosdk.live/v1/rooms \
3  -H "Authorization: Bearer YOUR_ACCESS_TOKEN" \
4  -H "Content-Type: application/json" \
5  -d '{"name": "Flutter Voice Agent Room"}'
6

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class extends the Agent class, providing custom behavior for entering and exiting sessions. This class is where you define how your agent interacts with users.
1class MyVoiceAgent(Agent):
2    def __init__(self):
3        super().__init__(instructions=agent_instructions)
4    async def on_enter(self): await self.session.say("Hello! How can I help?")
5    async def on_exit(self): await self.session.say("Goodbye!")
6

Step 4.3: Defining the Core Pipeline

The CascadingPipeline is crucial for processing audio input and generating responses. It integrates various plugins for STT, LLM, TTS, VAD, and Turn Detection. For more insights on managing sessions, see

AI voice Agent Sessions

.
1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8

Step 4.4: Managing the Session and Startup Logic

The start_session function sets up the agent session and handles its lifecycle. The make_context function configures the room options for the session.
1def make_context() -> JobContext:
2    room_options = RoomOptions(
3        name="VideoSDK Cascaded Agent",
4        playground=True
5    )
6    return JobContext(room_options=room_options)
7
8async def start_session(context: JobContext):
9    agent = MyVoiceAgent()
10    conversation_flow = ConversationFlow(agent)
11    pipeline = CascadingPipeline(
12        stt=DeepgramSTT(model="nova-2", language="en"),
13        llm=OpenAILLM(model="gpt-4o"),
14        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
15        vad=SileroVAD(threshold=0.35),
16        turn_detector=TurnDetector(threshold=0.8)
17    )
18    session = AgentSession(
19        agent=agent,
20        pipeline=pipeline,
21        conversation_flow=conversation_flow
22    )
23    try:
24        await context.connect()
25        await session.start()
26        await asyncio.Event().wait()
27    finally:
28        await session.close()
29        await context.shutdown()
30
31if __name__ == "__main__":
32    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
33    job.start()
34

Running and Testing the Agent

Step 5.1: Running the Python Script

Run your Python script using the command:
1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

Once the script is running, find the playground link in the console output. Use this link to join the session and interact with your agent. The agent will respond to your voice inputs in real-time. For more advanced monitoring, explore

AI voice Agent tracing and observability

.

Advanced Features and Customizations

Extending Functionality with Custom Tools

You can extend your agent's functionality by integrating custom tools. This involves creating new plugins or modifying existing ones to meet specific needs.

Exploring Other Plugins

The VideoSDK framework supports various STT, LLM, and TTS plugins. Explore these options to find the best fit for your application needs.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correctly set in the .env file and that you have the necessary permissions.

Audio Input/Output Problems

Check your microphone and speaker settings to ensure they are configured correctly.

Dependency and Version Conflicts

Ensure all dependencies are installed with compatible versions as specified in the documentation.

Conclusion

Summary of What You've Built

You have successfully built a Flutter AI Voice Agent using the VideoSDK framework. This agent can process voice inputs and respond intelligently.

Next Steps and Further Learning

Explore more advanced features and plugins to enhance your agent's capabilities. Consider integrating with other APIs and services to expand its functionality.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ