Namo-Turn-Detection-v1: Semantic Turn Detection for AI Voice Agents

Turn-taking, the ability to know exactly when a user has finished speaking, is the invisible force behind natural human conversation. Yet most voice agents today rely on Voice Activity Detection (VAD) or fixed silence timers, leading to premature cut-offs or long, robotic pauses.

We introduce state of the art NAMO Turn Detector v1 (NAMO-v1), an open-source, ONNX-optimized semantic turn detector model that predicts conversational boundaries by understanding meaning, not just silence. NAMO achieves <19 ms inference for specialized single-language models, <29 ms for multilingual, and up to 97.3 % accuracy, making it the first practical drop-in replacement for VAD in real-time voice systems.

1. Why Existing Approaches Break Down

Most deployed voice agents use one of two approaches:

Silence-based VAD: very fast and lightweight but either interrupts users mid-sentence or waits too long to be sure they’re done.
ASR endpointing (pause + punctuation): better than raw energy detection, but still a proxy; hesitations and lists often look “finished” when they’re not, and behavior varies wildly across languages.

Both approaches force product teams into a painful latency vs. interruption trade-off: either set a long buffer (safe but robotic) or a short one (fast but rude).

2. NAMO’s Semantic Advantage

NAMO replaces “silence as a proxy” with semantic understanding. The model looks at the text stream from your ASR and predicts whether the thought is complete. This single change brings:

Lower floor transfer time (snappier replies) without raising false cut-offs.
Multilingual robustness: one model works across 23+ languages, no per-language tuning.
Production latency: ONNX-quantized models run in <30 ms on CPU/GPU with almost no accuracy loss.
Observability & tuning: you can get calibrated probabilities and adjust thresholds for “fast vs. safe.”

Namo uses Natural Language Understanding to analyze the semantic meaning and context of speech, distinguishing between:

Complete utterances (user is done speaking)
Incomplete utterances (user will continue speaking)

Key Features

Semantic Understanding: Analyzes meaning and context, not just silence
Ultra-Fast Inference: <19ms for specialized models, <29ms for multilingual
Lightweight: ~135MB (specialized) / ~295MB (multilingual)
High Accuracy: Up to 97.3% for specialized models, 90.25% average for multilingual
Production-Ready: ONNX-optimized for real-time, enterprise-grade applications
Easy Integration: Standalone usage or plug-and-play with VideoSDK Agents SDK

3. Performance Benchmarks

Latency & Throughput

Using ONNX quantization, NAMO moves from 61 ms to 28 ms inference (multilingual) and 38 ms to 14.9 ms (specialized).

Relative speedup: up to 2.56×
Throughput: doubled (35.6 to 66.8 tokens/sec)

Accuracy Impact

Quantization preserves accuracy:

Confusion matrices show virtually unchanged true/false rates before and after quantization.

Language Coverage

Average multilingual accuracy: 90.25 %
Specialized single-language models: 97.3 % (Turkish/Korean), >93 % Hindi, Japanese, German.

5. Impact on Real-Time Voice AI

With NAMO you get:

Snappier responses without the “one Mississippi” delay.
Fewer interruptions when users pause mid-thought.
Consistent UX across markets without tuning for each language.
Cost-effective scaling — works with any STT and runs efficiently on commodity servers.

6. Impact on Real-Time Voice AI

Namo offers both specialized single-language models and a unified multilingual model

Variant	Languages / Focus	Model Size	Latency*	Typical Accuracy
Multilingual	23 languages	~295 MB	< 29 ms	~90.25 % (average)
Language-Specialized	One language per model	~135 MB	< 19 ms	Up to 97.3 %

* Latency measured after quantization on target inference hardware.

Multilingual Model (Recommended):

Model: Namo-Turn-Detector-v1-Multilingual
Base: mmBERT
Languages: All 23 supported languages
Inference: <29ms
Size: ~295MB
Average Accuracy: 90.25%
Model Link: Namo Turn Detector v1 - MultiLingual

Performance Benchmarks for Multilingual Model

Evaluated on 25,000+ diverse utterances across all supported languages.

Language	Accuracy	Precision	Recall	F1 Score	Samples
🇹🇷 Turkish	97.31%	0.9611	0.9853	0.9730	966
🇰🇷 Korean	96.85%	0.9541	0.9842	0.9690	890
🇯🇵 Japanese	94.36%	0.9099	0.9857	0.9463	834
🇩🇪 German	94.25%	0.9135	0.9772	0.9443	1,322
🇮🇳 Hindi	93.98%	0.9276	0.9603	0.9436	1,295
🇳🇱 Dutch	92.79%	0.8959	0.9738	0.9332	1,401
🇳🇴 Norwegian	91.65%	0.8717	0.9801	0.9227	1,976
🇨🇳 Chinese	91.64%	0.8859	0.9608	0.9219	945
🇫🇮 Finnish	91.58%	0.8746	0.9702	0.9199	1,010
🇬🇧 English	90.86%	0.8507	0.9801	0.9108	2,845
🇵🇱 Polish	90.68%	0.8619	0.9568	0.9069	976
🇮🇩 Indonesian	90.22%	0.8514	0.9707	0.9071	971
🇮🇹 Italian	90.15%	0.8562	0.9640	0.9069	782
🇩🇰 Danish	89.73%	0.8517	0.9644	0.9045	779
🇵🇹 Portuguese	89.56%	0.8410	0.9676	0.8999	1,398
🇪🇸 Spanish	88.88%	0.8304	0.9681	0.8940	1,295
🇮🇳 Marathi	88.50%	0.8762	0.9008	0.8883	774
🇺🇦 Ukrainian	87.94%	0.8164	0.9587	0.8819	929
🇷🇺 Russian	87.48%	0.8318	0.9547	0.8890	1,470
🇻🇳 Vietnamese	86.45%	0.8135	0.9439	0.8738	1,004
🇸🇦 Arabic	84.90%	0.7965	0.9439	0.8639	947
🇧🇩 Bengali	79.40%	0.7874	0.7939	0.7907	1,000

Average Accuracy: 90.25% across all languages

Specialized Single-Language Models

Architecture: DistilBERT-based
Inference: <19ms
Size: ~135MB each

Language	Model Link	Accuracy
🇰🇷 Korean	Namo-v1-Korean	97.3%
🇹🇷 Turkish	Namo-v1-Turkish	96.8%
🇯🇵 Japanese	Namo-v1-Japanese	93.5%
🇮🇳 Hindi	Namo-v1-Hindi	93.1%
🇩🇪 German	Namo-v1-German	91.9%
🇬🇧 English	Namo-v1-English	91.5%
🇳🇱 Dutch	Namo-v1-Dutch	90.0%
🇮🇳 Marathi	Namo-v1-Marathi	89.7%
🇨🇳 Chinese	Namo-v1-Chinese	88.8%
🇵🇱 Polish	Namo-v1-Polish	87.8%
🇳🇴 Norwegian	Namo-v1-Norwegian	87.3%
🇮🇩 Indonesian	Namo-v1-Indonesian	87.1%
🇵🇹 Portuguese	Namo-v1-Portuguese	86.9%
🇮🇹 Italian	Namo-v1-Italian	86.8%
🇪🇸 Spanish	Namo-v1-Spanish	86.7%
🇩🇰 Danish	Namo-v1-Danish	86.5%
🇻🇳 Vietnamese	Namo-v1-Vietnamese	86.2%
🇫🇷 French	Namo-v1-French	85.0%
🇫🇮 Finnish	Namo-v1-Finnish	84.8%
🇷🇺 Russian	Namo-v1-Russian	84.1%
🇺🇦 Ukrainian	Namo-v1-Ukrainian	82.4%
🇸🇦 Arabic	Namo-v1-Arabic	79.7%
🇧🇩 Bengali	Namo-v1-Bengali	79.2%

Try It Yourself!

We’ve provided an inference script to help you quickly test these models. Just plug it in and start experimenting!

Hugging Face Models: https://huggingface.co/videosdk-live/models
Github Repo Link: https://github.com/videosdk-live/NAMO-Turn-Detector-v1/tree/main
Official Documentation: https://docs.videosdk.live/ai_agents/core-components/turn-detection-and-vad

Integration with VideoSDK Agents

For seamless integration into your voice agent pipeline:

from videosdk_agents import NamoTurnDetectorV1, pre_download_namo_turn_v1_model

# Download model files (one-time setup)
# For multilingual (default):
pre_download_namo_turn_v1_model()

# For specific language:
# pre_download_namo_turn_v1_model(language="en")

# Initialize turn detector
turn_detector = NamoTurnDetectorV1()  # Multilingual
# turn_detector = NamoTurnDetectorV1(language="en")  # English-specific

# Add to your agent pipeline
from videosdk_agents import CascadingPipeline

pipeline = CascadingPipeline(
    stt=your_stt_service,
    llm=your_llm_service,
    tts=your_tts_service,
    turn_detector=turn_detector  # Namo integration
)

7. Training & Testing

Each model includes Colab notebooks for training and testing:

Training Notebooks: Fine-tune models on your own datasets
Testing Notebooks: Evaluate model performance on custom data

Visit individual model pages for notebook links:

Looking Ahead: Future Directions

Multi-party turn-taking detection: deciding when one speaker yields to another.
Hybrid signals: combine semantics with prosody, pitch, silence, etc.
Adaptive thresholds & confidence models: dynamic sensitivity based on conversation flow.
Distilled / edge versions for latency-constrained devices.
Continuous learning / feedback loop: let models adapt to usage patterns over time.

@software{namo2025,
  title={Namo Turn Detector v1: Semantic Turn Detection for Conversational AI},
  author={VideoSDK Team},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/collections/videosdk-live/namo-turn-detector-v1-68d52c0564d2164e9d17ca97}
}