In this blog, you'll learn—step by step—how to build a production-quality, self-contained VideoSDK agent featuring advanced conversational flow, voice activity detection (VAD), and turn detection. By the end, you'll have a complete, working playground example: just run the script, join the playground, and experience a natural, back-and-forth AI conversation with Retrieval-Augmented Generation (RAG) for smart recommendations!
What We're Building
We’re transforming a complex, API-driven service into a simple, playground-ready AI Voice Agent with Python script. This agent will:
- Join a VideoSDK meeting room directly from your terminal
- Support natural conversation with voice activity and turn detection
- Use RAG (Retrieval-Augmented Generation) to give smart, context-aware answers (e.g. travel advice)
- Be easy to run and extend for your own use cases
Prerequisites
You’ll need accounts and API keys for:
- VideoSDK (
VIDEOSDK_AUTH_TOKEN
) - Google AI Studio (
GOOGLE_API_KEY
) - OpenAI (
OPENAI_API_KEY
) - Pinecone (
PINECONE_API_KEY
,PINECONE_INDEX_NAME
)
Project Architecture
├── .env.example
├── agent.py # Agent's personality, VAD, RAG, and dialogue logic
├── build_pinecone_store.py # Utility to build the vector store
├── main.py # Entrypoint: runs the playground agent
├── rag_handler.py # Handles the RAG logic with Pinecone
├── requirements.txt # Python dependencies
├── travel_destinations.csv # Data for the RAG system
└── README.md # This file
Project Setup
Create and Activate a Virtual Environment
python -m venv .venv
# On Windows:
.venv\Scripts\activate
# On macOS/Linux:
source .ven
Install the Required Dependencies
Create a requirements.txt
file with:
videosdk-agents
videosdk-plugins-google
videosdk-plugins-simli
python-dotenv
fastmcp
langchain
pinecone-client
openai
Then install:
pip install -r requirements.txt
Configure Environment Variables
Copy .env.example
to .env
and fill in your API keys:
VIDEOSDK_AUTH_TOKEN=...
GOOGLE_API_KEY=...
OPENAI_API_KEY=...
PINECONE_API_KEY=...
PINECONE_INDEX_NAME
(Core Concept) What is RAG?
Retrieval-Augmented Generation (RAG) combines the power of language models with external knowledge. When the user asks a question, the agent first searches a knowledge base (using Pinecone) for relevant facts, then provides a personalized answer using those facts. This makes your agent much smarter and context-aware—perfect for things like a travel advisor!
Build the Knowledge Base
You must first populate your Pinecone vector store with travel destination data:
python build_pinecone_store.py
This reads travel_destinations.csv
, generates embeddings with OpenAI, and uploads them to Pinecone.
You must do this before you start the agent!
Code Walkthrough
main.py
— Entrypoint and Session Setup
main.py
— Entrypoint and Session Setupimport asyncio
import sys
import os
import requests
from pathlib import Path
from dotenv import load_dotenv
from videosdk.agents import (
AgentSession,
JobContext,
RoomOptions,
WorkerJob,
RealTimePipeline
)
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig
from agent import MyVoiceAgent, MyConversationFlow
load_dotenv(override=True)
def get_room_id(auth_token: str) -> str:
url = "https://api.videosdk.live/v2/rooms"
headers = {"Authorization": auth_token}
response = requests.post(url, headers=headers)
response.raise_for_status()
return response.json()["roomId"]
async def start_session(context: JobContext):
model = GeminiRealtime(
model="gemini-2.0-flash-live-001",
config=GeminiLiveConfig(
voice="Nova",
response_modalities=["AUDIO"]
)
)
pipeline = RealTimePipeline(model=model)
system_prompt = (
"You are a knowledgeable and friendly travel advisor AI assistant. "
"Your goal is to help users find perfect travel destinations based on their interests. "
"Use the context provided from the knowledge base to give helpful, personalized recommendations. "
"Be conversational and friendly - travel planning should be exciting!"
)
agent = MyVoiceAgent(system_prompt=system_prompt, personality="travel_advisor")
conversation_flow = MyConversationFlow(agent)
session = AgentSession(
agent=agent,
pipeline=pipeline,
conversation_flow=conversation_flow
)
await context.connect()
await session.start()
await asyncio.Event().wait()
def make_context() -> JobContext:
auth_token = os.getenv("VIDEOSDK_AUTH_TOKEN")
if not auth_token:
raise ValueError("VIDEOSDK_AUTH_TOKEN environment variable not set!")
room_id = get_room_id(auth_token)
room_options = RoomOptions(
room_id=room_id,
auth_token=auth_token,
name="RAG Travel Agent",
playground=True
)
return JobContext(room_options=room_options)
if __name__ == "__main__":
print("🚀 Starting AI Agent for VideoSDK Playground...")
job = WorkerJob(entrypoint=start_session, jobctx=make_context)
job.start()
agent.py
— Personality, Conversational Flow, VAD, and RAG
agent.py
— Personality, Conversational Flow, VAD, and RAGimport asyncio
from typing import AsyncIterator
from videosdk.agents import Agent, ConversationFlow, function_tool, ChatRole
from rag_handler import search_knowledge_base
class MyVoiceAgent(Agent):
def __init__(self, system_prompt: str, personality: str):
super().__init__(instructions=system_prompt)
self.personality = personality
async def on_enter(self) -> None:
await self.session.say("Hey, I'm your friendly travel advisor! Where are you dreaming of going?")
async def on_exit(self) -> None:
await self.session.say("Happy travels! Goodbye!")
@function_tool
async def end_call(self) -> None:
await self.session.say("It was great planning with you. Goodbye!")
await asyncio.sleep(1)
await self.session.leave()
class MyConversationFlow(ConversationFlow):
async def run(self, transcript: str) -> AsyncIterator[str]:
self.agent.chat_context.add_message(role=ChatRole.USER, content=transcript)
retrieved_context = None
try:
retrieved_context = await search_knowledge_base(transcript)
except Exception as e:
print(f"Error during RAG retrieval: {e}")
if retrieved_context:
self.agent.chat_context.add_message(
role=ChatRole.SYSTEM,
content=f"Here is some relevant context from the travel database: {retrieved_context}"
)
async for response_chunk in self.process_with_llm():
yield response_chunk
rag_handler.py
— Pinecone RAG Handler
rag_handler.py
— Pinecone RAG Handlerimport os
from typing import List, Dict
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from pinecone import Pinecone
import logging
load_dotenv()
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class RAGHandler:
def __init__(self):
self.embeddings_model = None
self.pinecone_index = None
self.initialized = False
async def initialize(self):
if self.initialized:
return
try:
openai_api_key = os.getenv("OPENAI_API_KEY")
pinecone_api_key = os.getenv("PINECONE_API_KEY")
pinecone_index_name = os.getenv("PINECONE_INDEX_NAME")
if not all([openai_api_key, pinecone_api_key, pinecone_index_name]):
raise ValueError("Missing RAG environment variables!")
self.embeddings_model = OpenAIEmbeddings(openai_api_key=openai_api_key)
pc = Pinecone(api_key=pinecone_api_key)
self.pinecone_index = pc.Index(pinecone_index_name)
self.initialized = True
logger.info("RAG Handler initialized successfully.")
except Exception as e:
logger.error(f"Failed to initialize RAG Handler: {e}")
raise
async def search(self, query: str, top_k: int = 3) -> List[Dict]:
if not self.initialized:
await self.initialize()
query_embedding = self.embeddings_model.embed_query(query)
search_results = self.pinecone_index.query(
vector=query_embedding,
top_k=top_k,
include_metadata=True
)
return [
{"content": match.metadata.get("text", ""), "score": match.score}
for match in search_results.matches
]
def format_context_for_llm(self, search_results: List[Dict]) -> str:
if not search_results:
return "No relevant information found in the knowledge base."
context_parts = []
for i, result in enumerate(search_results, 1):
context_parts.append(f"Reference {i}:\n{result['content']}\n")
return "\n".join(context_parts)
rag_handler = RAGHandler()
async def search_knowledge_base(query: str, max_results: int = 3) -> str:
try:
search_results = await rag_handler.search(query, top_k=max_results)
return rag_handler.format_context_for_llm(search_results)
except Exception as e:
logger.error(f"Error in search_knowledge_base: {e}")
return "I apologize, but I'm having trouble accessing the travel database right now."
build_pinecone_store.py
— Build the Knowledge Base
build_pinecone_store.py
— Build the Knowledge BaseThis script is unchanged from the original. It reads your travel_destinations.csv
, generates OpenAI embeddings, and uploads them to Pinecone. Run it before starting the agent!
travel_destinations.csv
— Your Data
travel_destinations.csv
— Your DataThis is your knowledge base: a CSV file of travel destinations to recommend. You can expand it as you wish!
Running the Agent and Seeing the Result
- Talk to your AI agent!
- Ask travel questions, like "Where should I go for great beaches and food?"
- The agent will search your knowledge base and respond in real time, with voice activity and turn detection for smooth conversation!
Open the VideoSDK Playground URL printed in your terminal.
Example:
https://playground.videosdk.live?token=...&meetingId=...
Run the agent:
python main.py
Build your knowledge base (if you haven't already):
python build_pinecone_store.py
Conclusion & Next Steps
You now have a fully working, playground-ready conversational AI with:
- Real-time voice and turn-taking
- RAG-powered smart answers
- Easy extensibility (swap out data, add tools, change the agent’s personality)
Next steps:
- Add new tools (weather, booking, facts)
- Enhance your knowledge base (more CSV data)
- Try different agent personalities or pipelines
For more, see the VideoSDK AI Playground documentation.