AI-powered telephony solutions are revolutionizing customer service, sales, and communication workflows. This comprehensive guide shows you how to build a sophisticated AI telephony agent using VideoSDK's powerful voice agent capabilities, deployed seamlessly on Cerebrium's cloud platform.
What We're Building
We'll create a complete AI telephony system that can:
- Handle both inbound and outbound voice calls
- Integrate with SIP providers like Twilio
- Leverage Google's Gemini AI for intelligent conversations
- Deploy automatically on Cerebrium's scalable infrastructure
- Provide real-time voice processing with minimal latency
Architecture Overview
Our AI telephony agent combines several powerful technologies:
- VideoSDK Agents: The core voice agent framework
- SIP Integration: For telephony connectivity via Twilio
- Gemini AI: Real-time conversational intelligence
- Cerebrium: Cloud deployment and scaling platform
Prerequisites
Before we start, you'll need accounts and credentials for the following services. Here are the links to get you started:
- VideoSDK Auth Token
- Twilio SIP trunking setup
- Google API key for Gemini
- Cerebrium account, follow docs here
Project Structure
Our project follows a clean, modular structure:
├── cerebrium.toml
├── main.py
├── requirements.txt
└── README.md
Initialize Your Project
Let's begin by setting up our project directory and basic configuration using the Cerebrium Command Line Interface (CLI).
pip install cerebrium
cerebrium login
cerebrium init videosdk-telephony-agent
Configure Cerebrium Deployment
First, let's set up our cerebrium.toml configuration file for optimal deployment:
[cerebrium.deployment]
name = "sip-ai-agent"
python_version = "3.12"
include = ["./*", "main.py", "cerebrium.toml"]
exclude = [".venv"]
disable_auth = true
[cerebrium.hardware]
region = "us-east-1"
provider = "aws"
compute = "CPU"
cpu = 2
memory = 4.0
gpu_count = 0
[cerebrium.runtime.custom]
port = 8000
entrypoint = ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
healthcheck_endpoint = "/"
[cerebrium.scaling]
min_replicas = 1
max_replicas = 2
cooldown = 30
replica_concurrency = 4
scaling_metric = "concurrency_utilization"
scaling_target = 80
[cerebrium.dependencies.paths]
pip = "requirements.txt"
This configuration ensures:
- Scalability: Auto-scaling between 1-2 replicas based on concurrency, ensuring the app can handle fluctuating call volumes.
- Performance: Optimized CPU and memory allocation for real-time voice processing.
- Dependency Management: It clearly points to our requirements.txt file, keeping our dependencies separate and organized.
Define Project Dependencies
Create a requirements.txt file in your project directory, you can also view or download the complete file directly from the project's official GitHub repository
Build the Core AI Agent
Now, let's create our main application in main.py. Here's the complete implementation:
import asyncio
import os
import logging
from contextlib import asynccontextmanager
from typing import Optional
from dotenv import load_dotenv
from fastapi import FastAPI, Request, Response
import uvicorn
from pyngrok import ngrok
from videosdk.plugins.sip import create_sip_manager
from videosdk.agents import Agent, JobContext, function_tool, RealTimePipeline
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig
load_dotenv()
logging.basicConfig(level=os.getenv("LOG_LEVEL", "INFO"))
logger = logging.getLogger(__name__)
def create_agent_pipeline():
"""Function to create the specific pipeline for our agent."""
model = GeminiRealtime(
api_key=os.getenv("GOOGLE_API_KEY"),
model="gemini-2.0-flash-live-001",
config=GeminiLiveConfig(
voice="Leda", # type: ignore
response_modalities=["AUDIO"], # type: ignore
),
)
return RealTimePipeline(model=model)
class SIPAIAgent(Agent):
"""A AI agent for handling voice calls."""
def __init__(self, ctx: Optional[JobContext] = None):
super().__init__(
instructions=(
"You are a helpful voice assistant that can answer questions and help with tasks. Be friendly and concise."
"Talk to the user as if you are a human and not a robot."
),
tools=[self.end_call], # type: ignore
)
self.ctx = ctx
self.greeting_message = "Hello! Thank you for calling. How can I assist you today?"
logger.info(f"SIPAIAgent created")
async def on_enter(self) -> None:
pass
async def greet_user(self) -> None:
await self.session.say(self.greeting_message) # type: ignore
async def on_exit(self) -> None:
pass
@function_tool
async def end_call(self) -> str:
"""End the current call gracefully"""
await self.session.say("Thank you for calling. Have a great day!") # type: ignore
await asyncio.sleep(1)
await self.session.leave() # type: ignore
return "Call ended gracefully"
sip_manager = create_sip_manager(
provider=os.getenv("SIP_PROVIDER", "twilio"),
videosdk_token=os.getenv("VIDEOSDK_AUTH_TOKEN"),
provider_config={
# Twilio config
"account_sid": os.getenv("TWILIO_ACCOUNT_SID"),
"auth_token": os.getenv("TWILIO_AUTH_TOKEN"),
"phone_number": os.getenv("TWILIO_PHONE_NUMBER"),
}
)
@asynccontextmanager
async def lifespan(app: FastAPI):
"""Lifespan manager for FastAPI app startup and shutdown."""
port = int(os.getenv("PORT", 8000))
try:
ngrok.kill()
ngrok_auth_token = os.getenv("NGROK_AUTHTOKEN")
if ngrok_auth_token:
ngrok.set_auth_token(ngrok_auth_token)
tunnel = ngrok.connect(str(port), "http")
sip_manager.set_base_url(tunnel.public_url) # type: ignore
logger.info(f"Ngrok tunnel created: {tunnel.public_url}")
except Exception as e:
logger.error(f"Failed to start ngrok tunnel: {e}")
yield
try:
ngrok.kill()
logger.info("Ngrok tunnel closed")
except Exception as e:
logger.error(f"Error closing ngrok tunnel: {e}")
app = FastAPI(title="SIP AI Agent", lifespan=lifespan)
@app.post("/call/make")
async def make_call(to_number: str):
if not sip_manager.base_url:
return {"status": "error", "message": "Service not ready (no base URL)."}
agent_config = {"room_name": "Call", "enable_pubsub": True}
details = await sip_manager.make_call(
to_number=to_number,
agent_class=SIPAIAgent,
pipeline=create_agent_pipeline,
agent_config=agent_config
)
return {"status": "success", "details": details}
@app.post("/sip/answer/{room_id}")
async def answer_webhook(room_id: str):
logger.info(f"Answering call for room: {room_id}")
body, status_code, headers = sip_manager.get_sip_response_for_room(room_id)
return Response(content=body, status_code=status_code, media_type=headers.get("Content-Type"))
@app.post("/webhook/incoming")
async def incoming_webhook(request: Request):
try:
content_type = request.headers.get("Content-Type", "")
if "x-www-form-urlencoded" in content_type:
webhook_data = dict(await request.form())
else:
webhook_data = await request.json()
logger.info(f"Received incoming webhook: {webhook_data}")
agent_config = {"room_name": "Incoming Call", "enable_pubsub": True}
body, status_code, headers = await sip_manager.handle_incoming_call(
webhook_data=webhook_data,
agent_class=SIPAIAgent,
pipeline=create_agent_pipeline,
agent_config=agent_config
)
return Response(content=body, status_code=status_code, media_type=headers.get("Content-Type"))
except Exception as e:
logger.error(f"Error in incoming webhook: {e}", exc_info=True)
return Response(content="Error processing request", status_code=500)
@app.get("/sessions")
async def get_sessions():
return {"sessions": sip_manager.get_active_sessions()}
@app.get("/")
async def root():
return {"message": "SIP AI Agent"}
if __name__ == "__main__":
port = int(os.getenv("PORT", 8000))
logger.info(f"Starting SIP AI Agent on port {port}")
uvicorn.run(app, host="0.0.0.0", port=port)
Key Components Explained
1. Agent Pipeline Creation
The create_agent_pipeline() function sets up our Gemini AI model with specific configurations:
- Model: gemini-2.0-flash-live-001 for real-time processing
- Voice: "Leda" for natural-sounding responses
- Modalities: Audio-only responses for telephony
2. SIPAIAgent Class
Our custom agent class inherits from VideoSDK's Agent base class and includes:
- Instructions: Clear behavioral guidelines for the AI
- Tools: Built-in function tools like end_call()
- Lifecycle methods: on_enter(), greet_user(), on_exit()
3. SIP Manager Integration
The SIP manager handles all telephony operations:
- Provider Configuration: Twilio credentials and settings
- Call Management: Both inbound and outbound call handling
- Session Tracking: Active session monitoring
API Endpoints
Our API provides several key endpoints:
- POST /call/make: Initiate outbound calls
- POST /webhook/incoming: Handle incoming call webhooks
- POST /sip/answer/{room_id}: Process SIP responses
- GET /sessions:Monitor active sessions
Environment Configuration
Set up these environment variables for your deployment by adding them to a .env file
# VideoSDK Configuration
VIDEOSDK_AUTH_TOKEN=your_videosdk_token
# AI Configuration
GOOGLE_API_KEY=your_google_api_key
# Twilio Configuration
TWILIO_ACCOUNT_SID=your_account_sid
TWILIO_AUTH_TOKEN=your_auth_token
TWILIO_PHONE_NUMBER=your_phone_number
# Optional
NGROK_AUTHTOKEN=your_ngrok_token
SIP_PROVIDER=twilio
To make these available to your deployment, add them to the Secrets view on your Cerebrium dashboard.
Deploying on Cerebrium
Cerebrium makes deployment incredibly straightforward:
- Install Cerebrium CLI:
pip install cerebrium
- Deploy your application:
cerebrium deploy
After the deployment finishes, you will receive a public URL to interact with your application. It will have a format similar to this:
<https://api.aws.us-east-1.cerebrium.ai/v4/p-xxxxxxxx/sip-ai-agent/>
VideoSDK Documentation Resources
For deeper integration and customization, explore these VideoSDK resources:
- AI Agents Introduction
- Quick Start Guide
- Core Components Overview
- Agent Implementation
- Real-time Pipeline
- SIP Integration
- Multiple Agents
Conclusion
Building an AI telephony agent with VideoSDK and deploying on Cerebrium creates a powerful, scalable communication solution. This architecture provides:
- Rapid Development: Get started in minutes with pre-built components
- Enterprise Scale: Handle thousands of concurrent calls
- Cost Efficiency: Pay-per-use pricing model
- Global Reach: Deploy worldwide with minimal latency
The combination of VideoSDK's robust voice agent framework and Cerebrium's intelligent cloud platform creates an ideal environment for modern AI-powered telephony solutions.
Ready to get started? Check out the VideoSDK AI Agents documentation and deploy your first agent on Cerebrium today!