Build an AI Phone Call Agent with VideoSDK

Learn to build an AI voice agent for phone calls using VideoSDK. This guide provides step-by-step instructions and code examples.

Introduction to AI Voice Agents in AI Phone Call

AI Voice Agents are revolutionizing the way we interact with technology, offering a seamless interface for communication via voice. These agents can perform tasks ranging from simple information retrieval to complex customer service interactions. In the context of AI phone calls, voice agents are particularly valuable as they can handle calls autonomously, providing information, scheduling appointments, and even troubleshooting basic issues.

What is an AI

Voice Agent

?

An AI

Voice Agent

is a software program designed to interact with users through voice commands. It utilizes technologies such as Speech-to-Text (STT), Text-to-Speech (TTS), and Natural Language Processing (NLP) to understand and respond to user queries effectively.

Why are they important for the AI Phone Call industry?

In the AI phone call industry, voice agents are crucial for automating customer service, reducing wait times, and providing 24/7 support. They can handle a high volume of calls, ensuring that users receive prompt and accurate responses without human intervention.

Core Components of a

Voice Agent

  • Speech-to-Text (STT): Converts spoken language into text.
  • Text-to-Speech (TTS): Converts text back into spoken language.
  • Natural Language Processing (NLP): Understands and processes the text to generate meaningful responses.
For a comprehensive understanding, refer to the

AI voice Agent core components overview

.

What You'll Build in This Tutorial

In this tutorial, you will learn how to build a fully functional AI

voice agent

capable of managing phone calls using the VideoSDK framework. This agent will be able to initiate conversations, provide information, and handle basic troubleshooting.

Architecture and Core Concepts

Understanding the architecture and core concepts of an AI voice agent is essential for building an effective system. In this section, we will explore the high-level architecture and key components used in the VideoSDK framework.

High-Level Architecture Overview

The architecture of an AI voice agent involves several components that work together to process user input and generate responses. The process begins with capturing the user's speech, which is then converted to text using STT. The text is analyzed using an LLM (Large Language Model) to generate a response, which is finally converted back to speech using TTS.
Diagram

Understanding Key Concepts in the VideoSDK Framework

  • Agent: The core class representing your bot, responsible for managing interactions.
  • CascadingPipeline: A series of processing steps that handle audio input and output, including STT, LLM, and TTS. Learn more about the

    Cascading pipeline in AI voice Agents

    .
  • VAD & TurnDetector: Tools that help the agent determine when to listen and when to speak, ensuring smooth interactions. Explore the

    Turn detector for AI voice Agents

    .

Setting Up the Development Environment

Before building your AI voice agent, you'll need to set up your development environment. This includes installing necessary software and configuring your system.

Prerequisites

To get started, ensure you have Python 3.11+ installed and a VideoSDK account. You can sign up at app.videosdk.live.

Step 1: Create a Virtual Environment

Create a virtual environment to manage your project's dependencies:
1python -m venv venv
2source venv/bin/activate  # On Windows use `venv\\Scripts\\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip:
1pip install videosdk-agents videosdk-plugins
2

Step 3: Configure API Keys in a .env file

Create a .env file in your project directory and add your VideoSDK API keys:
1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

In this section, we will walk through the process of building your AI voice agent. Below is the complete code for the agent, which we will break down and explain in subsequent sections.
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "{\n  \"persona\": \"Friendly and efficient virtual assistant\",\n  \"capabilities\": [\n    \"Initiate and manage phone calls with users\",\n    \"Provide information on various topics as requested\",\n    \"Assist with scheduling and reminders\",\n    \"Offer basic troubleshooting for common issues\"\n  ],\n  \"constraints\": [\n    \"You are not a human and should always identify as a virtual assistant\",\n    \"You cannot provide personal opinions or advice\",\n    \"You must include a disclaimer that users should verify critical information independently\",\n    \"You are not authorized to handle sensitive personal data\"\n  ]\n}"
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = AgentSession(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Step 4.1: Generating a VideoSDK Meeting ID

To interact with your agent, you need a meeting ID. You can generate one using the VideoSDK API. Here's a curl command example:
1curl -X POST "https://api.videosdk.live/v1/meetings" \
2-H "Content-Type: application/json" \
3-H "Authorization: Bearer YOUR_API_KEY" \
4-d '{}'
5

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class is where you define your agent's behavior. It inherits from the Agent class and uses the agent_instructions to guide its interactions. The on_enter and on_exit methods define what the agent says when a session starts and ends.

Step 4.3: Defining the Core Pipeline

The CascadingPipeline is crucial for handling the audio processing. It consists of several plugins:
  • DeepgramSTT: Converts speech to text.
  • OpenAILLM: Processes the text to generate responses using a language model.
  • ElevenLabsTTS: Converts the generated text back to speech.
  • SileroVAD: Detects when the user is speaking. Learn more about

    Silero Voice Activity Detection

    .
  • TurnDetector: Manages turn-taking in conversations.

Step 4.4: Managing the Session and Startup Logic

The start_session function is responsible for initiating the agent session. It creates the agent, sets up the conversation flow, and starts the session. The make_context function configures the room options, and the if __name__ == "__main__": block starts the job.

Running and Testing the Agent

Once your agent is built, it's time to test it in action.

Step 5.1: Running the Python Script

Run your script using the following command:
1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

After running the script, you'll see a playground link in the console. Open this link in a browser to interact with your agent. You can speak to the agent and receive responses in real-time. Explore the

AI Agent playground

for more interactive testing.

Advanced Features and Customizations

Enhance your agent by exploring additional features and plugins.

Extending Functionality with Custom Tools

You can extend your agent's capabilities by integrating custom tools. This allows your agent to perform specific tasks beyond basic interactions.

Exploring Other Plugins

VideoSDK supports a variety of plugins for STT, LLM, and TTS. Experiment with different options to find the best fit for your use case.

Troubleshooting Common Issues

Here are some common issues you might encounter and how to resolve them.

API Key and Authentication Errors

Ensure your API keys are correctly configured in the .env file and that you have the necessary permissions.

Audio Input/Output Problems

Check your microphone and speaker settings, and ensure your audio devices are properly connected.

Dependency and Version Conflicts

Use a virtual environment to manage dependencies and avoid version conflicts. Ensure all required packages are installed.

Conclusion

Congratulations! You've built a functional AI voice agent capable of handling phone calls. Continue exploring the VideoSDK framework to enhance your agent's capabilities and learn more about AI development.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ