How to handle speech in AI Voice Agents with Namo Turn Detection Model

When it comes to building conversational AI, the real challenge isn’t what your AI says it’s when it says it. Timing defines whether your voice agent feels natural and human-like or robotic and awkward.

In real conversations, we overlap, pause, and jump in mid-sentence. Yet humans almost never talk over each other by accident. How? Because we’re constantly predicting intent. Traditional voice agents don’t have that instinct. They wait for silence - literally. That’s why most bots either interrupt you too soon or sit there awkwardly, unsure if you’re done talking.

To fix this, VideoSDK built Namo-v1 - an open-source, high-performance turn detection model that understands meaning, not just silence.

From Silence Detection to Speech Understanding

Most voice agents rely on a silence timer to decide when you’ve finished speaking. For instance, after 800 ms of quiet, the bot assumes you’re done and replies. But what if you were just thinking, or hesitating before your next word?

That’s where Namo changes the game. Instead of reacting to quiet moments, it uses semantic understanding to detect intent and conversation flow.

This above diagram illustrates the VideoSDK Namo-V1 Turn Detection architecture. It shows how a user’s speech (captured in the VideoSDK Room) passes through Speech-to-Text (STT) and ChatContext, which then interfaces with the Turn Detector.

Say, Reply and Interrupt

Every conversation can be broken down into three key speech events:

Say - The agent speaks.
Reply - The agent listens and responds when the user is done.
Interrupt -The agent gracefully stops talking if the user jumps in mid-sentence.

The real test for these voice agents is interruption handling, though what we call barge-in. That moment when the user says, “Wait no, that’s not what I meant…” right in the middle of the AI’s sentence. Most systems panic there. They either keep talking awkwardly or stop too late.

But with VAD + Namo, your agent can detect the user’s intent mid-response, immediately pause its speech output, and switch to listening mode just like a real human conversation.

Watch this YouTube Video : https://youtu.be/IL0OSOD38bo

Implementation: Combining VAD and Namo

1. Voice Activity Detection (VAD)

VAD is your first layer. It detects when speech is happening separating human voice from background noise.

from videosdk.plugins.silero import SileroVAD

# Configure VAD to detect speech activity
vad = SileroVAD(
    threshold=0.5,                    # Sensitivity to speech (0.3-0.8)
    min_speech_duration=0.1,          # Ignore very brief sounds
    min_silence_duration=0.75         # Wait time before considering speech ended
)

This helps your agent stay reactive without misfiring on every background sound.

2. Namo Turn Detection

Once the audio is cleaned up, Namo Turn Detector adds intelligence. It understands the meaning behind the speech and predicts whether the user is truly finished talking or just pausing.

from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model

# Pre-download the multilingual model to avoid runtime delays
pre_download_namo_turn_v1_model()

# Initialize the multilingual Turn Detector
turn_detector = NamoTurnDetectorV1(
    threshold=0.7  # Confidence level for triggering a response
)

Multilingual Support - Works across 20+ languages
Context-Aware - Recognizes thinking pauses
Interrupt Smartness - Responds instantly to barge-ins

3. Pipeline Integration

You can plug both detectors into a unified Cascading Pipeline to give your agent real conversational timing.

from videosdk.agents import CascadingPipeline
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model

# Pre-download the model you intend to use
pre_download_namo_turn_v1_model(language="en")

pipeline = CascadingPipeline(
    stt=your_stt_provider,
    llm=your_llm_provider,
    tts=your_tts_provider,
    vad=SileroVAD(threshold=0.5),
    turn_detector=NamoTurnDetectorV1(language="en", threshold=0.7)
)

Now, when your agent is mid-reply and detects incoming speech via VAD, Namo helps it semantically confirm if it’s an interruption and instantly pause its own output. That’s real-time, real-human responsiveness.

Complete Example

import os
from typing import AsyncIterator
from videosdk.agents import Agent, AgentSession, CascadingPipeline,JobContext, RoomOptions, WorkerJob, ConversationFlow
from videosdk.plugins.openai import OpenAILLM
from videosdk.plugins.deepgram import DeepgramSTT
from videosdk.plugins.elevenlabs import ElevenLabsTTS
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model

# Pre-downloading the Namo Turn Detector model
pre_download_namo_turn_v1_model()

class MyVoiceAgent(Agent):
    def __init__(self):
        super().__init__(
            instructions="You are VideoSDK's Voice Agent. You are a helpful voice assistant that can answer questions about weather.IMPORTANT: don't generate response above 2 lines",
        )

    async def on_enter(self) -> None:
        await self.session.say("Hello, how can I help you today?")
    
    async def on_exit(self) -> None:
        await self.session.say("Goodbye!")
        

async def start_session(context: JobContext):

    agent = MyVoiceAgent()
    conversation_flow = ConversationFlow(agent)
    pipeline = CascadingPipeline(
        stt=DeepgramSTT(),
        llm=OpenAILLM(),
        tts=ElevenLabsTTS(),
        vad=SileroVAD(),
        turn_detector=NamoTurnDetectorV1()
    )

    session = AgentSession(
        agent=agent,
        pipeline=pipeline,
        conversation_flow=conversation_flow
    )

    await context.run_until_shutdown(session=session,wait_for_participant=True)

def make_context() -> JobContext:
    room_options = RoomOptions(
        room_id="<room_id>",
        name="Namo Turn Detector Agent",
        auth_token=os.getenv("VIDEOSDK_AUTH_TOKEN"),
        playground=True,
    )

    return JobContext(room_options=room_options)

if __name__ == "__main__":
    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
    job.start()

Run your agent:

python main.py

Open the VideoSDK playground URL printed in your terminal.
This will look like:

https://playground.videosdk.live?token=...&meetingId=...

It can speak, wait, listen, and even yield mid-sentence when a human jumps in. Good conversation isn’t about perfect grammar it’s about timing, empathy, and flow. By combining VAD and Namo, you give your AI agent the ability to truly listen like a human: to speak when it should, wait when it must, and stop when someone else has something to say.

Looking Ahead: Future Directions

Multi-party turn-taking detection: deciding when one speaker yields to another.
Hybrid signals: combine semantics with prosody, pitch, silence, etc.
Adaptive thresholds & confidence models: dynamic sensitivity based on conversation flow.
Distilled / edge versions for latency-constrained devices.
Continuous learning / feedback loop: let models adapt to usage patterns over time.

Integrate Namo-Turn-Detection-Model on Any Device

Resources and Next Step

For a deep dive into Namo’s architecture, multilingual benchmarks, and model performance, visit the Namo turn detection Plugin Page
You can also explore the Hugging Face model collection to find specialized models for each supported language.
Explore this quick start guide to get you started with Namo Turn Detector
Ecommerce agent with natural turn detection and interruption handling

Citation

@software{namo2025,
  title={Namo Turn Detector v1: Semantic Turn Detection for Conversational AI},
  author={VideoSDK Team},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/collections/videosdk-live/namo-turn-detector-v1-68d52c0564d2164e9d17ca97}
}