What is a conversational AI platform?

A conversational AI platform is a technology that enables machines to understand, process, and respond to human language in a natural way.

What are the core components of an AI voice agent?

The core components include Speech-to-Text (STT), Text-to-Speech (TTS), and Large Language Models (LLM).

What is the role of the CascadingPipeline?

The CascadingPipeline manages the flow of audio processing from STT to LLM and TTS.

What plugins are used in this tutorial?

Plugins used include DeepgramSTT, OpenAILLM, ElevenLabsTTS, SileroVAD, and TurnDetector.

Build a Conversational AI Platform

Q: How do I test my AI voice agent?

Run the Python script, find the playground link in the console, and join the session to interact with the agent.

Step-by-step guide to building a conversational AI platform using VideoSDK. Includes code examples and testing instructions.

Introduction to AI Voice Agents in Conversational AI Platform

In today's rapidly evolving technological landscape, AI voice agents have become pivotal in enhancing user interaction within conversational AI platforms. These agents are designed to understand and respond to human speech, making them indispensable in customer service, virtual assistants, and more.

What is an AI Voice Agent?

An AI Voice Agent is a software application that can interpret human speech and respond in a conversational manner. It leverages technologies like Speech-to-Text (STT), Text-to-Speech (TTS), and Natural Language Processing (NLP) to facilitate seamless communication between humans and machines.

Why are they important for the Conversational AI Platform Industry?

AI voice agents are crucial in the conversational AI platform industry as they enable automated customer interactions, provide 24/7 support, and improve user experience by offering quick and accurate responses. They are widely used in sectors like healthcare, finance, and retail to streamline operations and enhance customer satisfaction.

Core Components of a Voice Agent

A typical voice agent consists of several core components:

Speech-to-Text (STT): Converts spoken language into text.
Text-to-Speech (TTS): Converts text back into spoken language.
Large Language Models (LLM): Processes and understands the text to generate appropriate responses.

For a comprehensive understanding, refer to the

AI voice Agent core components overview

What You'll Build in This Tutorial

In this tutorial, you will build a fully functional AI voice agent using the VideoSDK framework. This agent will be capable of understanding user queries about conversational AI platforms and providing insightful responses.

Architecture and Core Concepts

High-Level Architecture Overview

The architecture of an AI voice agent involves several stages: capturing user speech, processing it through various components, and generating a response. Here's a simplified flow:

1sequenceDiagram
2    participant User
3    participant Agent
4    participant STT
5    participant LLM
6    participant TTS
7    User->>Agent: Speak
8    Agent->>STT: Convert Speech to Text
9    STT->>LLM: Process Text
10    LLM->>TTS: Generate Response
11    TTS->>Agent: Convert Text to Speech
12    Agent->>User: Respond
13

Understanding Key Concepts in the VideoSDK Framework

Agent: The core class representing your bot, responsible for handling interactions.
CascadingPipeline: Manages the flow of audio processing from STT to LLM and TTS. Learn more about the
Cascading pipeline in AI voice Agents
.
VAD & TurnDetector: These components help the agent determine when to listen and when to respond, ensuring smooth interactions. For more details, see the
Turn detector for AI voice Agents
.

Setting Up the Development Environment

Prerequisites

Before you begin, ensure you have Python 3.11+ installed and a VideoSDK account. You can sign up at app.videosdk.live.

Step 1: Create a Virtual Environment

Create a virtual environment to manage dependencies:

1python -m venv venv
2source venv/bin/activate  # On Windows use `venv\\Scripts\\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip:

1pip install videosdk
2

Step 3: Configure API Keys in a `.env` File

Create a .env file in your project's root directory and add your VideoSDK API keys:

1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

Below is the complete code for the AI Voice Agent. We will break it down to understand each part:

1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "{\n  \"persona\": \"Conversational AI Platform Specialist\",\n  \"capabilities\": [\n    \"Provide information about various conversational AI platforms\",\n    \"Compare features and pricing of different platforms\",\n    \"Guide users on how to integrate conversational AI into their existing systems\",\n    \"Answer technical questions related to conversational AI deployment\"\n  ],\n  \"constraints\": [\n    \"You are not a certified technical consultant and should advise users to consult with a professional for complex integrations\",\n    \"Avoid making definitive statements about the superiority of one platform over another\",\n    \"Ensure that all information provided is up-to-date and sourced from reliable references\"\n  ]\n}"
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = AgentSession(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Step 4.1: Generating a VideoSDK Meeting ID

To create a meeting ID, use the following curl command:

1curl -X POST \
2  https://api.videosdk.live/v1/meetings \
3  -H "Authorization: YOUR_API_KEY" \
4  -H "Content-Type: application/json" \
5  -d "{\"region\":\"sg\"}"
6

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class extends the Agent class, defining the agent's behavior on entering and exiting a session. It uses the agent_instructions to guide its interactions.

Step 4.3: Defining the Core Pipeline

The CascadingPipeline is central to processing audio input and generating responses. It integrates several plugins:

DeepgramSTT: Converts speech to text. Explore the
Deepgram STT Plugin for voice agent
.
OpenAILLM: Processes text and generates responses using GPT-4. Check out the
OpenAI LLM Plugin for voice agent
.
ElevenLabsTTS: Converts text responses back to speech. Learn more about the
ElevenLabs TTS Plugin for voice agent
.
SileroVAD & TurnDetector: Manage when the agent listens and responds.

Step 4.4: Managing the Session and Startup Logic

The start_session function initializes the agent, pipeline, and session, maintaining the session until manually terminated. The make_context function sets up the room options, enabling a playground environment for testing. For hands-on experience, visit the

AI Agent playground

Running and Testing the Agent

Step 5.1: Running the Python Script

Execute the script with:

1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

After running the script, find the playground link in the console and join the session to interact with your AI voice agent.

Advanced Features and Customizations

Extending Functionality with Custom Tools

Enhance your agent by integrating custom tools for specific tasks, expanding its capabilities beyond the default plugins.

Exploring Other Plugins

Consider exploring other STT, LLM, and TTS options to optimize your agent's performance based on your specific needs.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correctly configured in the .env file to avoid authentication issues.

Audio Input/Output Problems

Check your microphone and speaker settings if you encounter audio issues during interactions.

Dependency and Version Conflicts

Use a virtual environment to manage dependencies and avoid version conflicts.

Conclusion

Summary of What You've Built

You have successfully built a conversational AI voice agent using the VideoSDK framework, capable of engaging users in meaningful dialogue about AI platforms.

Next Steps and Further Learning

Explore additional features and plugins to enhance your agent's capabilities, and consider deploying it in real-world applications for further learning and development. For a quick start, refer to the

Voice Agent Quick Start Guide

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ

Free $20 Balance for AI Voice Agents & Video Calls