What if you could build your own AI-powered voice agent—one that can answer and place calls, handle appointment scheduling, route customers, collect feedback, and even run automated surveys, all in real time? In this blog, I'll show you how to build a robust, production-ready AI telephony agent with SIP and VoIP integration using Python, VideoSDK, and the latest AI models—all open source and totally extensible.

We'll go step by step from project setup to a working inbound and outbound AI voice agent. You’ll get code you can copy, clear explanations, and links to deeper docs for each component. By the end, you’ll have the foundation for scalable, enterprise-grade customer service automation or custom telephony workflows.

Why AI Telephony—And Why Now?

Traditional telephony systems are rigid, expensive, and hard to adapt to new business needs. But with AI voice agents and SIP (Session Initiation Protocol), you can build next-generation solutions: think appointment bots, emergency notification systems, automated feedback collectors, and more. The magic lies in combining real-time VoIP telephony (using SIP trunks from providers like Twilio) with advanced AI—like Google Gemini or OpenAI—for natural conversations and smart call handling.

Architecture Overview: Modular, Extensible, Real-Time

Our architecture separates concerns for maximum flexibility:

Video SDK Image
  • SIP Integration (VoIP telephony, call control, DTMF, call transfer, call recording)
  • AI Voice Agent (Powered by VideoSDK’s agent framework, integrates LLMs, STT, TTS, sentiment analysis)
  • Session Management (Inbound/outbound call routing, session lifecycle)
  • Provider Abstraction (Easily switch SIP providers—Twilio, Plivo, etc.)
  • Pluggable AI Capabilities (Swap in Google, OpenAI, or custom models)

You can add features like runtime configuration, call transcription, web dashboards, and more—all with Python.

Project Structure

Let’s start by laying out the recommended project structure, just like the demo repo:

ai-telephony-demo/
├── ai/                  # AI and LLM plugins (optional, for custom logic)
├── providers/           # Telephony/SIP provider integrations
├── services/            # Business logic, utilities, and workflow services
├── voice_agent.py       # Core AI voice agent
├── server.py            # FastAPI application and entrypoint
├── config.py            # Environment-driven config
├── requirements.txt     # Python dependencies

Dependencies

Install the dependencies listed in requirements.txt:

pip install -r requirements.txt

Key dependencies include:

  • fastapi & uvicorn for the server
  • videosdk, videosdk-agents, and plugins for agent logic
  • twilio, google-cloud-speech, google-cloud-texttospeech for SIP & AI
  • python-dotenv for config

Configuration

Create a .env file in your project root with all the required keys:

VIDEOSDK_AUTH_TOKEN=your_videosdk_auth_token
VIDEOSDK_SIP_USERNAME=your_sip_username
VIDEOSDK_SIP_PASSWORD=your_sip_password
GOOGLE_API_KEY=your_google_api_key
TWILIO_SID=your_twilio_sid
TWILIO_AUTH_TOKEN=your_twilio_auth_token
TWILIO_NUMBER=your_twilio_phone_number

Your config.py loads and validates these:

import os
import logging
from dotenv import load_dotenv

load_dotenv()
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

class Config:
    VIDEOSDK_AUTH_TOKEN = os.getenv("VIDEOSDK_AUTH_TOKEN")
    VIDEOSDK_SIP_USERNAME = os.getenv("VIDEOSDK_SIP_USERNAME")
    VIDEOSDK_SIP_PASSWORD = os.getenv("VIDEOSDK_SIP_PASSWORD")
    GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
    TWILIO_ACCOUNT_SID = os.getenv("TWILIO_SID")
    TWILIO_AUTH_TOKEN = os.getenv("TWILIO_AUTH_TOKEN")
    TWILIO_NUMBER = os.getenv("TWILIO_NUMBER")

    @classmethod
    def validate(cls):
        required_vars = {
            "VIDEOSDK_AUTH_TOKEN": cls.VIDEOSDK_AUTH_TOKEN,
            "VIDEOSDK_SIP_USERNAME": cls.VIDEOSDK_SIP_USERNAME,
            "VIDEOSDK_SIP_PASSWORD": cls.VIDEOSDK_SIP_PASSWORD,
            "GOOGLE_API_KEY": cls.GOOGLE_API_KEY,
            "TWILIO_SID": cls.TWILIO_ACCOUNT_SID,
            "TWILIO_AUTH_TOKEN": cls.TWILIO_AUTH_TOKEN,
            "TWILIO_NUMBER": cls.TWILIO_NUMBER,
        }
        missing = [v for v, val in required_vars.items() if not val]
        if missing:
            for v in missing:
                logger.error(f"Missing environment variable: {v}")
            raise ValueError(f"Missing required environment variables: {', '.join(missing)}")
        logger.info("All required environment variables are set.")

Config.validate()

The Voice Agent: AI-Powered Call Automation

Your agent logic lives in voice_agent.py. Here’s the real implementation from the repo:

import logging
from typing import Optional, List, Any
from videosdk.agents import Agent

logger = logging.getLogger(__name__)

class VoiceAgent(Agent):
    """An outbound call agent specialized for medical appointment scheduling."""

    def __init__(
        self,
        instructions: str = "You are a medical appointment scheduling assistant. Your goal is to confirm upcoming appointments (5th June 2025 at 11:00 AM) and reschedule if needed.",
        tools: Optional[List[Any]] = None,
        context: Optional[dict] = None,
    ) -> None:
        super().__init__(
            instructions=instructions,
            tools=tools or []
        )
        self.context = context or {}
        self.logger = logging.getLogger(__name__)
        
    async def on_enter(self) -> None:
        self.logger.info("Agent entered the session.")
        initial_greeting = self.context.get(
            "initial_greeting",
            "Hello, this is Neha, calling from City Medical Center regarding your upcoming appointment. Is this a good time to speak?"
        )
        await self.session.say(initial_greeting)

    async def on_exit(self) -> None:
        self.logger.info("Call ended")

You can customize instructions, context, and plug in different tools/plugins for STT, TTS, or LLMs.

The Server: Handling Calls, Routing, and Agent Sessions

The server.py file uses FastAPI to handle incoming SIP webhooks, manage sessions, and glue everything together:

import logging
from fastapi import FastAPI, Request, Form, BackgroundTasks, HTTPException
from fastapi.responses import PlainTextResponse
from config import Config
from models import OutboundCallRequest, CallResponse, SessionInfo
from providers import get_provider
from services import VideoSDKService, SessionManager

logger = logging.getLogger(__name__)

app = FastAPI(
    title="VideoSDK AI Agent Call Server (Modular)",
    description="Modular FastAPI server for inbound/outbound calls with VideoSDK AI Agent using different providers.",
    version="2.0.0"
)

videosdk_service = VideoSDKService()
session_manager = SessionManager()
sip_provider = get_provider("twilio")  # Use your SIP provider

@app.get("/health", response_class=PlainTextResponse)
async def health_check():
    active_sessions = session_manager.get_active_sessions_count()
    return f"Server is healthy. Active sessions: {active_sessions}"

@app.post("/inbound-call", response_class=PlainTextResponse)
async def inbound_call(
    request: Request,
    background_tasks: BackgroundTasks,
    CallSid: str = Form(...),
    From: str = Form(...),
    To: str = Form(...),
):
    logger.info(f"Inbound call received from {From} to {To}. CallSid: {CallSid}")
    try:
        room_id = await videosdk_service.create_room()
        session = await session_manager.create_session(room_id, "inbound")
        background_tasks.add_task(session_manager.run_session, session, room_id)
        sip_endpoint = videosdk_service.get_sip_endpoint(room_id)
        twiml = sip_provider.generate_twiml(sip_endpoint)
        logger.info(f"Responding to {sip_provider.get_provider_name()} inbound call {CallSid} with TwiML to dial SIP: {sip_endpoint}")
        return twiml
    except HTTPException as e:
        logger.error(f"Failed to handle inbound call {CallSid}: {e.detail}")
        return PlainTextResponse(f"<Response><Say>An error occurred: {e.detail}</Say></Response>", status_code=500)
    except Exception as e:
        logger.error(f"Unhandled error in inbound call {CallSid}: {e}", exc_info=True)
        return PlainTextResponse("<Response><Say>An unexpected error occurred. Please try again later.</Say></Response>", status_code=500)

@app.post("/outbound-call")
async def outbound_call(request_body: OutboundCallRequest, background_tasks: BackgroundTasks):
    to_number = request_body.to_number
    initial_greeting = request_body.initial_greeting
    logger.info(f"Request to initiate outbound call to: {to_number}")

    if not to_number:
        raise HTTPException(status_code=400, detail="'to_number' is required.")

    try:
        room_id = await videosdk_service.create_room()
        session = await session_manager.create_session(
            room_id, 
            "outbound", 
            initial_greeting
        )
        background_tasks.add_task(session_manager.run_session, session, room_id)
        sip_endpoint = videosdk_service.get_sip_endpoint(room_id)
        twiml = sip_provider.generate_twiml(sip_endpoint)
        call_result = sip_provider.initiate_outbound_call(to_number, twiml)
        logger.info(f"Outbound call initiated via {sip_provider.get_provider_name()} to {to_number}. "
                   f"Call SID: {call_result['call_sid']}. VideoSDK Room: {room_id}")
        return CallResponse(
            message="Outbound call initiated successfully",
            twilio_call_sid=call_result['call_sid'],
            videosdk_room_id=room_id
        )
    except HTTPException as e:
        logger.error(f"Failed to initiate outbound call to {to_number}: {e.detail}")
        raise e
    except Exception as e:
        logger.error(f"Unhandled error initiating outbound call to {to_number}: {e}", exc_info=True)
        raise HTTPException(status_code=500, detail=f"Failed to initiate outbound call: {e}")

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000) 

Modular Providers, Services, and Models

The demo repo is designed to be modular and extensible:

  • providers/ contains code for handling different SIP providers (Twilio, Vonage, etc).
  • services/ manages VideoSDK integration, room/session management, and business logic.
  • models.py defines request/response data for FastAPI endpoints.

You can easily add or swap providers, business rules, and AI models

Extending with MCP & Agent2Agent Protocol

To enable advanced features like agent-to-agent transfer, call control, and real-time management:

You can build on the provided classes and add hooks in your VoiceAgent or session logic to coordinate with these protocols.

Running and Testing

  1. Use tools like ngrok to expose your server to the public internet for SIP webhooks.
  2. Configure your SIP provider (e.g., Twilio) to point to your /inbound-call endpoint.
  3. Trigger inbound or outbound calls and watch your AI agent handle real conversations!

Start your FastAPI server:

uvicorn server:app --reload

Key Takeaways

  • This open-source project provides a real, modular foundation for AI-powered telephony using SIP, VoIP, and cloud AI.
  • The code is production-grade and extensible—just add your workflows, providers, or AI plugins.
  • You can enable advanced call control, routing, A2A communication, and more with VideoSDK protocols.

Resources & Next Steps

  • Explore the ai-telephony-demo repo for the full codebase and more docs.
  • Learn more about VideoSDK AI Agents, A2A, and MCP.
  • Build your own use case: appointment scheduling, customer service automation, or scalable feedback collection!