AI-powered telephony solutions are revolutionizing customer service, sales, and communication workflows. This comprehensive guide shows you how to build a sophisticated AI telephony agent using VideoSDK's powerful voice agent capabilities, deployed seamlessly on Cerebrium's cloud platform.

What We're Building

We'll create a complete AI telephony system that can:

  • Handle both inbound and outbound voice calls
  • Integrate with SIP providers like Twilio
  • Leverage Google's Gemini AI for intelligent conversations
  • Deploy automatically on Cerebrium's scalable infrastructure
  • Provide real-time voice processing with minimal latency
Video SDK Image

Architecture Overview

Our AI telephony agent combines several powerful technologies:

  • VideoSDK Agents: The core voice agent framework
  • SIP Integration: For telephony connectivity via Twilio
  • Gemini AI: Real-time conversational intelligence
  • Cerebrium: Cloud deployment and scaling platform

Prerequisites

Before we start, you'll need accounts and credentials for the following services. Here are the links to get you started:

Project Structure

Our project follows a clean, modular structure:

├── cerebrium.toml
├── main.py
├── requirements.txt
└── README.md

Initialize Your Project

Let's begin by setting up our project directory and basic configuration using the Cerebrium Command Line Interface (CLI).

pip install cerebrium
cerebrium login
cerebrium init videosdk-telephony-agent

Configure Cerebrium Deployment

First, let's set up our cerebrium.toml configuration file for optimal deployment:

[cerebrium.deployment]
name = "sip-ai-agent"
python_version = "3.12"
include = ["./*", "main.py", "cerebrium.toml"]
exclude = [".venv"]
disable_auth = true

[cerebrium.hardware]
region = "us-east-1"
provider = "aws"
compute = "CPU"
cpu = 2
memory = 4.0
gpu_count = 0

[cerebrium.runtime.custom]
port = 8000
entrypoint = ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
healthcheck_endpoint = "/"

[cerebrium.scaling]
min_replicas = 1
max_replicas = 2
cooldown = 30
replica_concurrency = 4
scaling_metric = "concurrency_utilization"
scaling_target = 80

[cerebrium.dependencies.paths]
pip = "requirements.txt"

This configuration ensures:

  • Scalability: Auto-scaling between 1-2 replicas based on concurrency, ensuring the app can handle fluctuating call volumes.
  • Performance: Optimized CPU and memory allocation for real-time voice processing.
  • Dependency Management: It clearly points to our requirements.txt file, keeping our dependencies separate and organized.

Define Project Dependencies

Create a requirements.txt file in your project directory, you can also view or download the complete file directly from the project's official GitHub repository

Build the Core AI Agent

Now, let's create our main application in main.py. Here's the complete implementation:

import asyncio
import os
import logging
from contextlib import asynccontextmanager
from typing import Optional
from dotenv import load_dotenv
from fastapi import FastAPI, Request, Response
import uvicorn
from pyngrok import ngrok
from videosdk.plugins.sip import create_sip_manager
from videosdk.agents import Agent, JobContext, function_tool, RealTimePipeline
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig

load_dotenv()

logging.basicConfig(level=os.getenv("LOG_LEVEL", "INFO"))
logger = logging.getLogger(__name__)

def create_agent_pipeline():
    """Function to create the specific pipeline for our agent."""
    model = GeminiRealtime(
        api_key=os.getenv("GOOGLE_API_KEY"),
        model="gemini-2.0-flash-live-001",
        config=GeminiLiveConfig(
            voice="Leda", # type: ignore
            response_modalities=["AUDIO"], # type: ignore
        ),
    )
    return RealTimePipeline(model=model)

class SIPAIAgent(Agent):
    """A AI agent for handling voice calls."""

    def __init__(self, ctx: Optional[JobContext] = None):
        super().__init__(
            instructions=(
             "You are a helpful voice assistant that can answer questions and help with tasks. Be friendly and concise."
             "Talk to the user as if you are a human and not a robot."
             ),
            tools=[self.end_call], # type: ignore
        )
        self.ctx = ctx
        self.greeting_message = "Hello! Thank you for calling. How can I assist you today?"
        logger.info(f"SIPAIAgent created")

    async def on_enter(self) -> None:
        pass

    async def greet_user(self) -> None:
        await self.session.say(self.greeting_message) # type: ignore

    async def on_exit(self) -> None:
        pass

    @function_tool
    async def end_call(self) -> str:
        """End the current call gracefully"""
        await self.session.say("Thank you for calling. Have a great day!") # type: ignore
        await asyncio.sleep(1)
        await self.session.leave() # type: ignore
        return "Call ended gracefully"

sip_manager = create_sip_manager(
    provider=os.getenv("SIP_PROVIDER", "twilio"),
    videosdk_token=os.getenv("VIDEOSDK_AUTH_TOKEN"),
    provider_config={
        # Twilio config
        "account_sid": os.getenv("TWILIO_ACCOUNT_SID"),
        "auth_token": os.getenv("TWILIO_AUTH_TOKEN"),
        "phone_number": os.getenv("TWILIO_PHONE_NUMBER"),
    }
)

@asynccontextmanager
async def lifespan(app: FastAPI):
    """Lifespan manager for FastAPI app startup and shutdown."""
    port = int(os.getenv("PORT", 8000))
    try:
        ngrok.kill()
        ngrok_auth_token = os.getenv("NGROK_AUTHTOKEN")
        if ngrok_auth_token:
            ngrok.set_auth_token(ngrok_auth_token)
        tunnel = ngrok.connect(str(port), "http")
        sip_manager.set_base_url(tunnel.public_url) # type: ignore
        logger.info(f"Ngrok tunnel created: {tunnel.public_url}")
    except Exception as e:
        logger.error(f"Failed to start ngrok tunnel: {e}")
    yield
    try:
        ngrok.kill()
        logger.info("Ngrok tunnel closed")
    except Exception as e:
        logger.error(f"Error closing ngrok tunnel: {e}")

app = FastAPI(title="SIP AI Agent", lifespan=lifespan)

@app.post("/call/make")
async def make_call(to_number: str):
    if not sip_manager.base_url:
        return {"status": "error", "message": "Service not ready (no base URL)."}
    agent_config = {"room_name": "Call", "enable_pubsub": True}
    details = await sip_manager.make_call(
        to_number=to_number,
        agent_class=SIPAIAgent,
        pipeline=create_agent_pipeline,
        agent_config=agent_config
    )
    return {"status": "success", "details": details}

@app.post("/sip/answer/{room_id}")
async def answer_webhook(room_id: str):
    logger.info(f"Answering call for room: {room_id}")
    body, status_code, headers = sip_manager.get_sip_response_for_room(room_id)
    return Response(content=body, status_code=status_code, media_type=headers.get("Content-Type"))

@app.post("/webhook/incoming")
async def incoming_webhook(request: Request):
    try:
        content_type = request.headers.get("Content-Type", "")
        if "x-www-form-urlencoded" in content_type:
            webhook_data = dict(await request.form())
        else:
            webhook_data = await request.json()
        logger.info(f"Received incoming webhook: {webhook_data}")

        agent_config = {"room_name": "Incoming Call", "enable_pubsub": True}
        body, status_code, headers = await sip_manager.handle_incoming_call(
            webhook_data=webhook_data,
            agent_class=SIPAIAgent,
            pipeline=create_agent_pipeline,
            agent_config=agent_config
        )
        return Response(content=body, status_code=status_code, media_type=headers.get("Content-Type"))
    except Exception as e:
        logger.error(f"Error in incoming webhook: {e}", exc_info=True)
        return Response(content="Error processing request", status_code=500)

@app.get("/sessions")
async def get_sessions():
    return {"sessions": sip_manager.get_active_sessions()}

@app.get("/")
async def root():
    return {"message": "SIP AI Agent"}

if __name__ == "__main__":
    port = int(os.getenv("PORT", 8000))
    logger.info(f"Starting SIP AI Agent on port {port}")
    uvicorn.run(app, host="0.0.0.0", port=port)

Key Components Explained

1. Agent Pipeline Creation

The create_agent_pipeline() function sets up our Gemini AI model with specific configurations:

  • Model: gemini-2.0-flash-live-001 for real-time processing
  • Voice: "Leda" for natural-sounding responses
  • Modalities: Audio-only responses for telephony

2. SIPAIAgent Class

Our custom agent class inherits from VideoSDK's Agent base class and includes:

  • Instructions: Clear behavioral guidelines for the AI
  • Tools: Built-in function tools like end_call()
  • Lifecycle methods: on_enter(), greet_user(), on_exit()

3. SIP Manager Integration

The SIP manager handles all telephony operations:

  • Provider Configuration: Twilio credentials and settings
  • Call Management: Both inbound and outbound call handling
  • Session Tracking: Active session monitoring

API Endpoints

Our API provides several key endpoints:

  • POST /call/make: Initiate outbound calls
  • POST /webhook/incoming: Handle incoming call webhooks
  • POST /sip/answer/{room_id}: Process SIP responses
  • GET /sessions:Monitor active sessions

Environment Configuration

Set up these environment variables for your deployment by adding them to a .env file

# VideoSDK Configuration
VIDEOSDK_AUTH_TOKEN=your_videosdk_token

# AI Configuration
GOOGLE_API_KEY=your_google_api_key

# Twilio Configuration
TWILIO_ACCOUNT_SID=your_account_sid
TWILIO_AUTH_TOKEN=your_auth_token
TWILIO_PHONE_NUMBER=your_phone_number

# Optional
NGROK_AUTHTOKEN=your_ngrok_token
SIP_PROVIDER=twilio

To make these available to your deployment, add them to the Secrets view on your Cerebrium dashboard.

Deploying on Cerebrium

Cerebrium makes deployment incredibly straightforward:

  1. Install Cerebrium CLI:
pip install cerebrium
  1. Deploy your application:
cerebrium deploy

After the deployment finishes, you will receive a public URL to interact with your application. It will have a format similar to this:

<https://api.aws.us-east-1.cerebrium.ai/v4/p-xxxxxxxx/sip-ai-agent/>

VideoSDK Documentation Resources

For deeper integration and customization, explore these VideoSDK resources:

Conclusion

Building an AI telephony agent with VideoSDK and deploying on Cerebrium creates a powerful, scalable communication solution. This architecture provides:

  • Rapid Development: Get started in minutes with pre-built components
  • Enterprise Scale: Handle thousands of concurrent calls
  • Cost Efficiency: Pay-per-use pricing model
  • Global Reach: Deploy worldwide with minimal latency

The combination of VideoSDK's robust voice agent framework and Cerebrium's intelligent cloud platform creates an ideal environment for modern AI-powered telephony solutions.

Ready to get started? Check out the VideoSDK AI Agents documentation and deploy your first agent on Cerebrium today!