What if you could build your own AI-powered voice agent—one that can answer and place calls, handle appointment scheduling, route customers, collect feedback, and even run automated surveys, all in real time? In this blog, I'll show you how to build a robust, production-ready AI telephony agent with SIP and VoIP integration using Python, VideoSDK, and the latest AI models—all open source and totally extensible.
We'll go step by step from project setup to a working inbound and outbound AI voice agent. You’ll get code you can copy, clear explanations, and links to deeper docs for each component. By the end, you’ll have the foundation for scalable, enterprise-grade customer service automation or custom telephony workflows.
Why AI Telephony—And Why Now?
Traditional telephony systems are rigid, expensive, and hard to adapt to new business needs. But with AI voice agents and SIP (Session Initiation Protocol), you can build next-generation solutions: think appointment bots, emergency notification systems, automated feedback collectors, and more. The magic lies in combining real-time VoIP telephony (using SIP trunks from providers like Twilio) with advanced AI—like Google Gemini or OpenAI—for natural conversations and smart call handling.
Architecture Overview: Modular, Extensible, Real-Time
Our architecture separates concerns for maximum flexibility:
- SIP Integration (VoIP telephony, call control, DTMF, call transfer, call recording)
- AI Voice Agent (Powered by VideoSDK’s agent framework, integrates LLMs, STT, TTS, sentiment analysis)
- Session Management (Inbound/outbound call routing, session lifecycle)
- Provider Abstraction (Easily switch SIP providers—Twilio, Plivo, etc.)
- Pluggable AI Capabilities (Swap in Google, OpenAI, or custom models)
You can add features like runtime configuration, call transcription, web dashboards, and more—all with Python.
Project Structure
Let’s start by laying out the recommended project structure, just like the demo repo:
ai-telephony-demo/
├── ai/ # AI and LLM plugins (optional, for custom logic)
├── providers/ # Telephony/SIP provider integrations
├── services/ # Business logic, utilities, and workflow services
├── voice_agent.py # Core AI voice agent
├── server.py # FastAPI application and entrypoint
├── config.py # Environment-driven config
├── requirements.txt # Python dependencies
Dependencies
Install the dependencies listed in requirements.txt
:
pip install -r requirements.txt
Key dependencies include:
fastapi
&uvicorn
for the servervideosdk
,videosdk-agents
, and plugins for agent logictwilio
,google-cloud-speech
,google-cloud-texttospeech
for SIP & AIpython-dotenv
for config
Configuration
Create a .env
file in your project root with all the required keys:
VIDEOSDK_AUTH_TOKEN=your_videosdk_auth_token
VIDEOSDK_SIP_USERNAME=your_sip_username
VIDEOSDK_SIP_PASSWORD=your_sip_password
GOOGLE_API_KEY=your_google_api_key
TWILIO_SID=your_twilio_sid
TWILIO_AUTH_TOKEN=your_twilio_auth_token
TWILIO_NUMBER=your_twilio_phone_number
Your config.py
loads and validates these:
import os
import logging
from dotenv import load_dotenv
load_dotenv()
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class Config:
VIDEOSDK_AUTH_TOKEN = os.getenv("VIDEOSDK_AUTH_TOKEN")
VIDEOSDK_SIP_USERNAME = os.getenv("VIDEOSDK_SIP_USERNAME")
VIDEOSDK_SIP_PASSWORD = os.getenv("VIDEOSDK_SIP_PASSWORD")
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
TWILIO_ACCOUNT_SID = os.getenv("TWILIO_SID")
TWILIO_AUTH_TOKEN = os.getenv("TWILIO_AUTH_TOKEN")
TWILIO_NUMBER = os.getenv("TWILIO_NUMBER")
@classmethod
def validate(cls):
required_vars = {
"VIDEOSDK_AUTH_TOKEN": cls.VIDEOSDK_AUTH_TOKEN,
"VIDEOSDK_SIP_USERNAME": cls.VIDEOSDK_SIP_USERNAME,
"VIDEOSDK_SIP_PASSWORD": cls.VIDEOSDK_SIP_PASSWORD,
"GOOGLE_API_KEY": cls.GOOGLE_API_KEY,
"TWILIO_SID": cls.TWILIO_ACCOUNT_SID,
"TWILIO_AUTH_TOKEN": cls.TWILIO_AUTH_TOKEN,
"TWILIO_NUMBER": cls.TWILIO_NUMBER,
}
missing = [v for v, val in required_vars.items() if not val]
if missing:
for v in missing:
logger.error(f"Missing environment variable: {v}")
raise ValueError(f"Missing required environment variables: {', '.join(missing)}")
logger.info("All required environment variables are set.")
Config.validate()
The Voice Agent: AI-Powered Call Automation
Your agent logic lives in voice_agent.py
. Here’s the real implementation from the repo:
import logging
from typing import Optional, List, Any
from videosdk.agents import Agent
logger = logging.getLogger(__name__)
class VoiceAgent(Agent):
"""An outbound call agent specialized for medical appointment scheduling."""
def __init__(
self,
instructions: str = "You are a medical appointment scheduling assistant. Your goal is to confirm upcoming appointments (5th June 2025 at 11:00 AM) and reschedule if needed.",
tools: Optional[List[Any]] = None,
context: Optional[dict] = None,
) -> None:
super().__init__(
instructions=instructions,
tools=tools or []
)
self.context = context or {}
self.logger = logging.getLogger(__name__)
async def on_enter(self) -> None:
self.logger.info("Agent entered the session.")
initial_greeting = self.context.get(
"initial_greeting",
"Hello, this is Neha, calling from City Medical Center regarding your upcoming appointment. Is this a good time to speak?"
)
await self.session.say(initial_greeting)
async def on_exit(self) -> None:
self.logger.info("Call ended")
You can customize instructions, context, and plug in different tools/plugins for STT, TTS, or LLMs.
The Server: Handling Calls, Routing, and Agent Sessions
The server.py
file uses FastAPI to handle incoming SIP webhooks, manage sessions, and glue everything together:
import logging
from fastapi import FastAPI, Request, Form, BackgroundTasks, HTTPException
from fastapi.responses import PlainTextResponse
from config import Config
from models import OutboundCallRequest, CallResponse, SessionInfo
from providers import get_provider
from services import VideoSDKService, SessionManager
logger = logging.getLogger(__name__)
app = FastAPI(
title="VideoSDK AI Agent Call Server (Modular)",
description="Modular FastAPI server for inbound/outbound calls with VideoSDK AI Agent using different providers.",
version="2.0.0"
)
videosdk_service = VideoSDKService()
session_manager = SessionManager()
sip_provider = get_provider("twilio") # Use your SIP provider
@app.get("/health", response_class=PlainTextResponse)
async def health_check():
active_sessions = session_manager.get_active_sessions_count()
return f"Server is healthy. Active sessions: {active_sessions}"
@app.post("/inbound-call", response_class=PlainTextResponse)
async def inbound_call(
request: Request,
background_tasks: BackgroundTasks,
CallSid: str = Form(...),
From: str = Form(...),
To: str = Form(...),
):
logger.info(f"Inbound call received from {From} to {To}. CallSid: {CallSid}")
try:
room_id = await videosdk_service.create_room()
session = await session_manager.create_session(room_id, "inbound")
background_tasks.add_task(session_manager.run_session, session, room_id)
sip_endpoint = videosdk_service.get_sip_endpoint(room_id)
twiml = sip_provider.generate_twiml(sip_endpoint)
logger.info(f"Responding to {sip_provider.get_provider_name()} inbound call {CallSid} with TwiML to dial SIP: {sip_endpoint}")
return twiml
except HTTPException as e:
logger.error(f"Failed to handle inbound call {CallSid}: {e.detail}")
return PlainTextResponse(f"<Response><Say>An error occurred: {e.detail}</Say></Response>", status_code=500)
except Exception as e:
logger.error(f"Unhandled error in inbound call {CallSid}: {e}", exc_info=True)
return PlainTextResponse("<Response><Say>An unexpected error occurred. Please try again later.</Say></Response>", status_code=500)
@app.post("/outbound-call")
async def outbound_call(request_body: OutboundCallRequest, background_tasks: BackgroundTasks):
to_number = request_body.to_number
initial_greeting = request_body.initial_greeting
logger.info(f"Request to initiate outbound call to: {to_number}")
if not to_number:
raise HTTPException(status_code=400, detail="'to_number' is required.")
try:
room_id = await videosdk_service.create_room()
session = await session_manager.create_session(
room_id,
"outbound",
initial_greeting
)
background_tasks.add_task(session_manager.run_session, session, room_id)
sip_endpoint = videosdk_service.get_sip_endpoint(room_id)
twiml = sip_provider.generate_twiml(sip_endpoint)
call_result = sip_provider.initiate_outbound_call(to_number, twiml)
logger.info(f"Outbound call initiated via {sip_provider.get_provider_name()} to {to_number}. "
f"Call SID: {call_result['call_sid']}. VideoSDK Room: {room_id}")
return CallResponse(
message="Outbound call initiated successfully",
twilio_call_sid=call_result['call_sid'],
videosdk_room_id=room_id
)
except HTTPException as e:
logger.error(f"Failed to initiate outbound call to {to_number}: {e.detail}")
raise e
except Exception as e:
logger.error(f"Unhandled error initiating outbound call to {to_number}: {e}", exc_info=True)
raise HTTPException(status_code=500, detail=f"Failed to initiate outbound call: {e}")
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
Modular Providers, Services, and Models
The demo repo is designed to be modular and extensible:
providers/
contains code for handling different SIP providers (Twilio, Vonage, etc).services/
manages VideoSDK integration, room/session management, and business logic.models.py
defines request/response data for FastAPI endpoints.
You can easily add or swap providers, business rules, and AI models
Extending with MCP & Agent2Agent Protocol
To enable advanced features like agent-to-agent transfer, call control, and real-time management:
- Integrate the MCP protocol for call control, muting, and participant management.
- Use the Agent2Agent protocol to automate handoffs and workflows between agents.
You can build on the provided classes and add hooks in your VoiceAgent
or session logic to coordinate with these protocols.
Running and Testing
- Use tools like ngrok to expose your server to the public internet for SIP webhooks.
- Configure your SIP provider (e.g., Twilio) to point to your
/inbound-call
endpoint. - Trigger inbound or outbound calls and watch your AI agent handle real conversations!
Start your FastAPI server:
uvicorn server:app --reload
Key Takeaways
- This open-source project provides a real, modular foundation for AI-powered telephony using SIP, VoIP, and cloud AI.
- The code is production-grade and extensible—just add your workflows, providers, or AI plugins.
- You can enable advanced call control, routing, A2A communication, and more with VideoSDK protocols.
Resources & Next Steps
- Explore the ai-telephony-demo repo for the full codebase and more docs.
- Learn more about VideoSDK AI Agents, A2A, and MCP.
- Build your own use case: appointment scheduling, customer service automation, or scalable feedback collection!