Build a Conversational Flow AI Agent with Voice Activity & Turn Detection

In this blog, you'll learn—step by step—how to build a production-quality, self-contained VideoSDK agent featuring advanced conversational flow, voice activity detection (VAD), and turn detection. By the end, you'll have a complete, working playground example: just run the script, join the playground, and experience a natural, back-and-forth AI conversation with Retrieval-Augmented Generation (RAG) for smart recommendations!

What We're Building

We’re transforming a complex, API-driven service into a simple, playground-ready AI Voice Agent with Python script. This agent will:

Join a VideoSDK meeting room directly from your terminal
Support natural conversation with voice activity and turn detection
Use RAG (Retrieval-Augmented Generation) to give smart, context-aware answers (e.g. travel advice)
Be easy to run and extend for your own use cases

Prerequisites

You’ll need accounts and API keys for:

VideoSDK (VIDEOSDK_AUTH_TOKEN)
Google AI Studio (GOOGLE_API_KEY)
OpenAI (OPENAI_API_KEY)
Pinecone (PINECONE_API_KEY, PINECONE_INDEX_NAME)

Project Architecture

├── .env.example
├── agent.py               # Agent's personality, VAD, RAG, and dialogue logic
├── build_pinecone_store.py  # Utility to build the vector store
├── main.py                # Entrypoint: runs the playground agent
├── rag_handler.py         # Handles the RAG logic with Pinecone
├── requirements.txt       # Python dependencies
├── travel_destinations.csv # Data for the RAG system
└── README.md              # This file

Project Setup

Create and Activate a Virtual Environment

python -m venv .venv
# On Windows:
.venv\Scripts\activate
# On macOS/Linux:
source .ven

Install the Required Dependencies

Create a requirements.txt file with:

videosdk-agents
videosdk-plugins-google
videosdk-plugins-simli
python-dotenv
fastmcp
langchain
pinecone-client
openai

Then install:

pip install -r requirements.txt

Configure Environment Variables

Copy .env.example to .env and fill in your API keys:

VIDEOSDK_AUTH_TOKEN=...
GOOGLE_API_KEY=...
OPENAI_API_KEY=...
PINECONE_API_KEY=...
PINECONE_INDEX_NAME

(Core Concept) What is RAG?

Retrieval-Augmented Generation (RAG) combines the power of language models with external knowledge. When the user asks a question, the agent first searches a knowledge base (using Pinecone) for relevant facts, then provides a personalized answer using those facts. This makes your agent much smarter and context-aware—perfect for things like a travel advisor!

Build the Knowledge Base

You must first populate your Pinecone vector store with travel destination data:

python build_pinecone_store.py

This reads travel_destinations.csv, generates embeddings with OpenAI, and uploads them to Pinecone.
You must do this before you start the agent!

Code Walkthrough

`main.py` — Entrypoint and Session Setup

import asyncio
import sys
import os
import requests
from pathlib import Path
from dotenv import load_dotenv

from videosdk.agents import (
    AgentSession,
    JobContext,
    RoomOptions,
    WorkerJob,
    RealTimePipeline
)
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig
from agent import MyVoiceAgent, MyConversationFlow

load_dotenv(override=True)

def get_room_id(auth_token: str) -> str:
    url = "https://api.videosdk.live/v2/rooms"
    headers = {"Authorization": auth_token}
    response = requests.post(url, headers=headers)
    response.raise_for_status()
    return response.json()["roomId"]

async def start_session(context: JobContext):
    model = GeminiRealtime(
        model="gemini-2.0-flash-live-001",
        config=GeminiLiveConfig(
            voice="Nova",
            response_modalities=["AUDIO"]
        )
    )
    pipeline = RealTimePipeline(model=model)

    system_prompt = (
        "You are a knowledgeable and friendly travel advisor AI assistant. "
        "Your goal is to help users find perfect travel destinations based on their interests. "
        "Use the context provided from the knowledge base to give helpful, personalized recommendations. "
        "Be conversational and friendly - travel planning should be exciting!"
    )
    agent = MyVoiceAgent(system_prompt=system_prompt, personality="travel_advisor")
    conversation_flow = MyConversationFlow(agent)

    session = AgentSession(
        agent=agent,
        pipeline=pipeline,
        conversation_flow=conversation_flow
    )

    await context.connect()
    await session.start()
    await asyncio.Event().wait()

def make_context() -> JobContext:
    auth_token = os.getenv("VIDEOSDK_AUTH_TOKEN")
    if not auth_token:
        raise ValueError("VIDEOSDK_AUTH_TOKEN environment variable not set!")
    room_id = get_room_id(auth_token)
    room_options = RoomOptions(
        room_id=room_id,
        auth_token=auth_token,
        name="RAG Travel Agent",
        playground=True
    )
    return JobContext(room_options=room_options)

if __name__ == "__main__":
    print("🚀 Starting AI Agent for VideoSDK Playground...")
    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
    job.start()

`agent.py` — Personality, Conversational Flow, VAD, and RAG

import asyncio
from typing import AsyncIterator
from videosdk.agents import Agent, ConversationFlow, function_tool, ChatRole
from rag_handler import search_knowledge_base

class MyVoiceAgent(Agent):
    def __init__(self, system_prompt: str, personality: str):
        super().__init__(instructions=system_prompt)
        self.personality = personality

    async def on_enter(self) -> None:
        await self.session.say("Hey, I'm your friendly travel advisor! Where are you dreaming of going?")

    async def on_exit(self) -> None:
        await self.session.say("Happy travels! Goodbye!")

    @function_tool
    async def end_call(self) -> None:
        await self.session.say("It was great planning with you. Goodbye!")
        await asyncio.sleep(1)
        await self.session.leave()

class MyConversationFlow(ConversationFlow):
    async def run(self, transcript: str) -> AsyncIterator[str]:
        self.agent.chat_context.add_message(role=ChatRole.USER, content=transcript)
        retrieved_context = None
        try:
            retrieved_context = await search_knowledge_base(transcript)
        except Exception as e:
            print(f"Error during RAG retrieval: {e}")
        if retrieved_context:
            self.agent.chat_context.add_message(
                role=ChatRole.SYSTEM, 
                content=f"Here is some relevant context from the travel database: {retrieved_context}"
            )
        async for response_chunk in self.process_with_llm():
            yield response_chunk

`rag_handler.py` — Pinecone RAG Handler

import os
from typing import List, Dict
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from pinecone import Pinecone
import logging

load_dotenv()

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class RAGHandler:
    def __init__(self):
        self.embeddings_model = None
        self.pinecone_index = None
        self.initialized = False

    async def initialize(self):
        if self.initialized:
            return
        try:
            openai_api_key = os.getenv("OPENAI_API_KEY")
            pinecone_api_key = os.getenv("PINECONE_API_KEY")
            pinecone_index_name = os.getenv("PINECONE_INDEX_NAME")
            if not all([openai_api_key, pinecone_api_key, pinecone_index_name]):
                raise ValueError("Missing RAG environment variables!")
            self.embeddings_model = OpenAIEmbeddings(openai_api_key=openai_api_key)
            pc = Pinecone(api_key=pinecone_api_key)
            self.pinecone_index = pc.Index(pinecone_index_name)
            self.initialized = True
            logger.info("RAG Handler initialized successfully.")
        except Exception as e:
            logger.error(f"Failed to initialize RAG Handler: {e}")
            raise

    async def search(self, query: str, top_k: int = 3) -> List[Dict]:
        if not self.initialized:
            await self.initialize()
        query_embedding = self.embeddings_model.embed_query(query)
        search_results = self.pinecone_index.query(
            vector=query_embedding,
            top_k=top_k,
            include_metadata=True
        )
        return [
            {"content": match.metadata.get("text", ""), "score": match.score}
            for match in search_results.matches
        ]

    def format_context_for_llm(self, search_results: List[Dict]) -> str:
        if not search_results:
            return "No relevant information found in the knowledge base."
        context_parts = []
        for i, result in enumerate(search_results, 1):
            context_parts.append(f"Reference {i}:\n{result['content']}\n")
        return "\n".join(context_parts)

rag_handler = RAGHandler()

async def search_knowledge_base(query: str, max_results: int = 3) -> str:
    try:
        search_results = await rag_handler.search(query, top_k=max_results)
        return rag_handler.format_context_for_llm(search_results)
    except Exception as e:
        logger.error(f"Error in search_knowledge_base: {e}")
        return "I apologize, but I'm having trouble accessing the travel database right now."

`build_pinecone_store.py` — Build the Knowledge Base

This script is unchanged from the original. It reads your travel_destinations.csv, generates OpenAI embeddings, and uploads them to Pinecone. Run it before starting the agent!

`travel_destinations.csv` — Your Data

This is your knowledge base: a CSV file of travel destinations to recommend. You can expand it as you wish!

Running the Agent and Seeing the Result

Talk to your AI agent!
- Ask travel questions, like "Where should I go for great beaches and food?"
- The agent will search your knowledge base and respond in real time, with voice activity and turn detection for smooth conversation!

Open the VideoSDK Playground URL printed in your terminal.
Example:

https://playground.videosdk.live?token=...&meetingId=...

Run the agent:

python main.py

Build your knowledge base (if you haven't already):

python build_pinecone_store.py

Conclusion & Next Steps

You now have a fully working, playground-ready conversational AI with:

Real-time voice and turn-taking
RAG-powered smart answers
Easy extensibility (swap out data, add tools, change the agent’s personality)

Next steps:

Add new tools (weather, booking, facts)
Enhance your knowledge base (more CSV data)
Try different agent personalities or pipelines

For more, see the VideoSDK AI Playground documentation.