What are the core components of a voice agent?

The core components include Speech-to-Text (STT), Language Model (LLM), and Text-to-Speech (TTS).

How do I generate a VideoSDK meeting ID?

Use the provided `curl` command with your API key to create a meeting ID.

What plugins are used in the AI voice agent pipeline?

The pipeline uses DeepgramSTT, OpenAILLM, ElevenLabsTTS, SileroVAD, and TurnDetector.

Build an AI Call Answering Service

Q: What is an AI voice agent?

An AI voice agent is a software application that understands and responds to human speech using technologies like STT, LLM, and TTS.

Q: Why use AI voice agents in call answering services?

They offer 24/7 availability, reduce wait times, and improve customer satisfaction by providing quick and accurate responses.

Step-by-step guide to building an AI call answering service using VideoSDK, complete with code examples and testing instructions.

Introduction to AI Voice Agents in AI Call Answering Service

In today's fast-paced world, businesses are increasingly turning to AI voice agents to handle customer interactions efficiently. These agents are designed to manage incoming calls, provide information, and route calls to the appropriate departments, all while maintaining a professional tone. In this tutorial, we'll walk you through building an AI call answering service using the VideoSDK framework.

What is an AI
Voice Agent
?

An AI

voice agent

is a software application that can understand and respond to human speech. It leverages technologies like speech-to-text (STT), language models (LLM), and text-to-speech (TTS) to interact with users in a natural way. These agents are commonly used in customer service to automate call handling, answer frequently asked questions, and provide support.

Why are They Important for the AI Call Answering Service Industry?

AI voice agents are crucial in the call answering service industry because they offer 24/7 availability, reduce wait times, and improve customer satisfaction by providing quick and accurate responses. They also free up human agents to handle more complex queries, enhancing overall efficiency.

Core Components of a
Voice Agent

Speech-to-Text (STT): Converts spoken language into text.
Language Model (LLM): Processes the text to understand and generate responses.
Text-to-Speech (TTS): Converts text responses back into spoken language.

What You'll Build in This Tutorial

In this guide, you'll learn to build a fully functional AI call answering service using VideoSDK. We’ll cover everything from setting up your development environment to deploying and testing your agent.

Architecture and Core Concepts

High-Level Architecture Overview

The AI call answering service follows a structured data flow: user speech is captured and converted to text using STT, processed by an LLM to generate a response, and then converted back to speech using TTS. This flow ensures seamless interaction between the user and the agent.

Understanding Key Concepts in the VideoSDK Framework

Agent: The core class representing your bot. It handles interactions and manages the conversation flow.
CascadingPipeline: This defines the flow of audio processing, integrating STT, LLM, and TTS.
VAD & TurnDetector: These components help the agent determine when to listen and respond, ensuring smooth interactions.

Setting Up the Development Environment

Prerequisites

Before you begin, ensure you have Python 3.11+ installed and a VideoSDK account. You can sign up at app.videosdk.live.

Step 1: Create a Virtual Environment

To keep your project dependencies organized, create a virtual environment:

1python -m venv myenv
2source myenv/bin/activate  # On Windows use `myenv\\Scripts\\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip:

1pip install videosdk
2

Step 3: Configure API Keys in a `.env` File

Create a .env file in your project's root directory to store your API keys securely:

1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

We'll start by presenting the complete, runnable code for your AI voice agent. Then, we'll break it down into smaller parts to explain each component.

1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are an AI Call Answering Service Agent designed to efficiently manage incoming calls for businesses. Your primary role is to assist callers by providing information, routing calls to the appropriate department, and taking messages when necessary. You should maintain a professional and courteous tone at all times.\n\nCapabilities:\n1. Greet callers and provide a brief introduction of the service.\n2. Answer frequently asked questions about the business, such as hours of operation, location, and services offered.\n3. Route calls to the appropriate department or individual based on the caller's needs.\n4. Take detailed messages when the requested party is unavailable and ensure they are delivered promptly.\n5. Provide basic troubleshooting assistance for common issues related to the business's products or services.\n\nConstraints and Limitations:\n1. You are not authorized to provide personal opinions or advice beyond the scope of the business's services.\n2. You must not handle sensitive information such as credit card details or personal identification numbers.\n3. Always include a disclaimer that complex issues should be addressed by speaking directly with a human representative.\n4. You are not a substitute for emergency services and should direct callers to contact emergency services if needed.\n5. Ensure compliance with privacy laws and regulations, and inform callers that their conversation may be recorded for quality assurance purposes."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = AgentSession(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Step 4.1: Generating a VideoSDK Meeting ID

To create a meeting ID, use the following curl command:

1curl -X POST "https://api.videosdk.live/v1/meetings" \
2-H "Authorization: Bearer YOUR_API_KEY" \
3-H "Content-Type: application/json" \
4-d "{\"region\":\"sg001\"}"
5

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class extends the Agent class from the VideoSDK framework. It defines the agent's behavior when entering and exiting a session. The on_enter and on_exit methods use the say function to communicate with users.

Step 4.3: Defining the Core Pipeline

The

Cascading Pipeline in AI voice Agents

is a central component that processes audio data. It integrates:

DeepgramSTT: Converts speech to text using the "nova-2" model.
OpenAILLM: Processes text and generates responses with the
OpenAI LLM Plugin for voice agent
using the "gpt-4o" model.
ElevenLabsTTS: Converts text responses back to speech using the
ElevenLabs TTS Plugin for voice agent
with the "elevenflashv2_5" model.
SileroVAD: Detects voice activity with a threshold of 0.35 using
Silero Voice Activity Detection
.
TurnDetector: Manages conversation turns with a threshold of 0.8, utilizing the
Turn detector for AI voice Agents
.

Step 4.4: Managing the Session and Startup Logic

The start_session function initializes the

AI voice Agent Sessions

and manages the conversation flow. It connects to the context and starts the session, keeping it running until manually terminated. The make_context function sets up the room options, enabling the playground mode for testing.

Running and Testing the Agent

Step 5.1: Running the Python Script

Execute the script with:

1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

Once the script is running, you'll find a playground link in the console. Use this link to join the session and interact with your AI voice agent.

Advanced Features and Customizations

Extending Functionality with Custom Tools

Enhance your agent's capabilities by integrating custom tools. The function_tool feature allows you to add specific functions tailored to your needs.

Exploring Other Plugins

VideoSDK supports various plugins for STT, LLM, and TTS. Experiment with different models to find the best fit for your application.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correctly configured in the .env file. Double-check the authorization header in your requests.

Audio Input/Output Problems

Verify your microphone and speaker settings. Ensure the correct devices are selected in your system settings.

Dependency and Version Conflicts

Use a virtual environment to manage dependencies. Check for version conflicts and update packages as needed.

Conclusion

Summary of What You've Built

Congratulations! You've built a fully functional AI call answering service using VideoSDK. Your agent can handle calls, provide information, and route queries efficiently.

Next Steps and Further Learning

Explore additional features and plugins to enhance your agent's capabilities. Consider integrating with other APIs to expand functionality and provide even more value to users. For a comprehensive understanding of the

AI voice Agent core components overview

, delve deeper into the documentation.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ

Free $20 Balance for AI Voice Agents & Video Calls

RELEVANT BLOGS