Turn-taking — the ability to know exactly when a user has finished speaking — is the invisible force behind natural human conversation. Yet most voice agents today rely on Voice Activity Detection (VAD) or fixed silence timers, leading to premature cut-offs or long, robotic pauses.
We introduce NAMO Turn Detector v1 (NAMO-v1), an open-source, ONNX-optimized semantic turn detector that predicts conversational boundaries by understanding meaning, not just silence. NAMO achieves <19 ms inference for specialized single-language models, <29 ms for multilingual, and up to 97.3 % accuracy — making it the first practical drop-in replacement for VAD in real-time voice systems.
1. Why Existing Approaches Break Down
Most deployed voice agents use one of two approaches:
- Silence-based VAD: very fast and lightweight but either interrupts users mid-sentence or waits too long to be sure they’re done.
- ASR endpointing (pause + punctuation): better than raw energy detection, but still a proxy; hesitations and lists often look “finished” when they’re not, and behavior varies wildly across languages.
Both approaches force product teams into a painful latency vs. interruption trade-off: either set a long buffer (safe but robotic) or a short one (fast but rude).
2. NAMO’s Semantic Advantage
NAMO replaces “silence as a proxy” with semantic understanding. The model looks at the text stream from your ASR and predicts whether the thought is complete. This single change brings:
- Lower floor transfer time (snappier replies) without raising false cut-offs.
- Multilingual robustness: one model works across 23+ languages, no per-language tuning.
- Production latency: ONNX-quantized models run in <30 ms on CPU/GPU with almost no accuracy loss.
- Observability & tuning: you can get calibrated probabilities and adjust thresholds for “fast vs. safe.”
Namo uses Natural Language Understanding to analyze the semantic meaning and context of speech, distinguishing between:
- Complete utterances (user is done speaking)
- Incomplete utterances (user will continue speaking)
Key Features
- Semantic Understanding: Analyzes meaning and context, not just silence
- Ultra-Fast Inference: <19ms for specialized models, <29ms for multilingual
- Lightweight: ~135MB (specialized) / ~295MB (multilingual)
- High Accuracy: Up to 97.3% for specialized models, 90.25% average for multilingual
- Production-Ready: ONNX-optimized for real-time, enterprise-grade applications
- Easy Integration: Standalone usage or plug-and-play with VideoSDK Agents SDK
3. Performance Benchmarks
Latency & Throughput
Using ONNX quantization, NAMO moves from 61 ms to 28 ms inference (multilingual) and 38 ms to 14.9 ms (specialized).
- Relative speedup: up to 2.56×
- Throughput: doubled (35.6 to 66.8 tokens/sec)
Accuracy Impact
Quantization preserves accuracy:
Confusion matrices show virtually unchanged true/false rates before and after quantization.
Language Coverage
Average multilingual accuracy: 90.25 %
Specialized single-language models: 97.3 % (Turkish/Korean), >93 % Hindi, Japanese, German.
5. Impact on Real-Time Voice AI
With NAMO you get:
- Snappier responses without the “one Mississippi” delay.
- Fewer interruptions when users pause mid-thought.
- Consistent UX across markets without tuning for each language.
- Cost-effective scaling — works with any STT and runs efficiently on commodity servers.
6. Impact on Real-Time Voice AI
Namo offers both specialized single-language models and a unified multilingual model
Variant | Languages / Focus | Model Size | Latency* | Typical Accuracy |
---|---|---|---|---|
Multilingual | 23 languages | ~295 MB | < 29 ms | ~90.25 % (average) |
Language-Specialized | One language per model | ~135 MB | < 19 ms | Up to 97.3 % |
* Latency measured after quantization on target inference hardware.
Multilingual Model (Recommended):
- Model: Namo-Turn-Detector-v1-Multilingual
- Base: mmBERT
- Languages: All 23 supported languages
- Inference: <29ms
- Size: ~295MB
- Average Accuracy: 90.25%
- Model Link: Namo Turn Detector v1 - MultiLingual
Performance Benchmarks for Multilingual Model
Evaluated on 25,000+ diverse utterances across all supported languages.
Language | Accuracy | Precision | Recall | F1 Score | Samples |
---|---|---|---|---|---|
🇹🇷 Turkish | 97.31% | 0.9611 | 0.9853 | 0.9730 | 966 |
🇰🇷 Korean | 96.85% | 0.9541 | 0.9842 | 0.9690 | 890 |
🇯🇵 Japanese | 94.36% | 0.9099 | 0.9857 | 0.9463 | 834 |
🇩🇪 German | 94.25% | 0.9135 | 0.9772 | 0.9443 | 1,322 |
🇮🇳 Hindi | 93.98% | 0.9276 | 0.9603 | 0.9436 | 1,295 |
🇳🇱 Dutch | 92.79% | 0.8959 | 0.9738 | 0.9332 | 1,401 |
🇳🇴 Norwegian | 91.65% | 0.8717 | 0.9801 | 0.9227 | 1,976 |
🇨🇳 Chinese | 91.64% | 0.8859 | 0.9608 | 0.9219 | 945 |
🇫🇮 Finnish | 91.58% | 0.8746 | 0.9702 | 0.9199 | 1,010 |
🇬🇧 English | 90.86% | 0.8507 | 0.9801 | 0.9108 | 2,845 |
🇵🇱 Polish | 90.68% | 0.8619 | 0.9568 | 0.9069 | 976 |
🇮🇩 Indonesian | 90.22% | 0.8514 | 0.9707 | 0.9071 | 971 |
🇮🇹 Italian | 90.15% | 0.8562 | 0.9640 | 0.9069 | 782 |
🇩🇰 Danish | 89.73% | 0.8517 | 0.9644 | 0.9045 | 779 |
🇵🇹 Portuguese | 89.56% | 0.8410 | 0.9676 | 0.8999 | 1,398 |
🇪🇸 Spanish | 88.88% | 0.8304 | 0.9681 | 0.8940 | 1,295 |
🇮🇳 Marathi | 88.50% | 0.8762 | 0.9008 | 0.8883 | 774 |
🇺🇦 Ukrainian | 87.94% | 0.8164 | 0.9587 | 0.8819 | 929 |
🇷🇺 Russian | 87.48% | 0.8318 | 0.9547 | 0.8890 | 1,470 |
🇻🇳 Vietnamese | 86.45% | 0.8135 | 0.9439 | 0.8738 | 1,004 |
🇸🇦 Arabic | 84.90% | 0.7965 | 0.9439 | 0.8639 | 947 |
🇧🇩 Bengali | 79.40% | 0.7874 | 0.7939 | 0.7907 | 1,000 |
Average Accuracy: 90.25% across all languages
Specialized Single-Language Models
- Architecture: DistilBERT-based
- Inference: <19ms
- Size: ~135MB each
Language | Model Link | Accuracy |
---|---|---|
🇰🇷 Korean | Namo-v1-Korean | 97.3% |
🇹🇷 Turkish | Namo-v1-Turkish | 96.8% |
🇯🇵 Japanese | Namo-v1-Japanese | 93.5% |
🇮🇳 Hindi | Namo-v1-Hindi | 93.1% |
🇩🇪 German | Namo-v1-German | 91.9% |
🇬🇧 English | Namo-v1-English | 91.5% |
🇳🇱 Dutch | Namo-v1-Dutch | 90.0% |
🇮🇳 Marathi | Namo-v1-Marathi | 89.7% |
🇨🇳 Chinese | Namo-v1-Chinese | 88.8% |
🇵🇱 Polish | Namo-v1-Polish | 87.8% |
🇳🇴 Norwegian | Namo-v1-Norwegian | 87.3% |
🇮🇩 Indonesian | Namo-v1-Indonesian | 87.1% |
🇵🇹 Portuguese | Namo-v1-Portuguese | 86.9% |
🇮🇹 Italian | Namo-v1-Italian | 86.8% |
🇪🇸 Spanish | Namo-v1-Spanish | 86.7% |
🇩🇰 Danish | Namo-v1-Danish | 86.5% |
🇻🇳 Vietnamese | Namo-v1-Vietnamese | 86.2% |
🇫🇷 French | Namo-v1-French | 85.0% |
🇫🇮 Finnish | Namo-v1-Finnish | 84.8% |
🇷🇺 Russian | Namo-v1-Russian | 84.1% |
🇺🇦 Ukrainian | Namo-v1-Ukrainian | 82.4% |
🇸🇦 Arabic | Namo-v1-Arabic | 79.7% |
🇧🇩 Bengali | Namo-v1-Bengali | 79.2% |
Try It Yourself!
We’ve provided an inference script to help you quickly test these models. Just plug it in and start experimenting!
- Hugging Face Models: https://huggingface.co/videosdk-live/models
- Github Repo Link: https://github.com/videosdk-live/NAMO-Turn-Detector-v1/tree/main
Integration with VideoSDK Agents
For seamless integration into your voice agent pipeline:
from videosdk_agents import NamoTurnDetectorV1, pre_download_namo_turn_v1_model
# Download model files (one-time setup)
# For multilingual (default):
pre_download_namo_turn_v1_model()
# For specific language:
# pre_download_namo_turn_v1_model(language="en")
# Initialize turn detector
turn_detector = NamoTurnDetectorV1() # Multilingual
# turn_detector = NamoTurnDetectorV1(language="en") # English-specific
# Add to your agent pipeline
from videosdk_agents import CascadingPipeline
pipeline = CascadingPipeline(
stt=your_stt_service,
llm=your_llm_service,
tts=your_tts_service,
turn_detector=turn_detector # Namo integration
)
7. Training & Testing
Each model includes Colab notebooks for training and testing:
- Training Notebooks: Fine-tune models on your own datasets
- Testing Notebooks: Evaluate model performance on custom data
Visit individual model pages for notebook links:
Looking Ahead: Future Directions
- Multi-party turn-taking detection: deciding when one speaker yields to another.
- Hybrid signals: combine semantics with prosody, pitch, silence, etc.
- Adaptive thresholds & confidence models: dynamic sensitivity based on conversation flow.
- Distilled / edge versions for latency-constrained devices.
- Continuous learning / feedback loop: let models adapt to usage patterns over time.
Conclusion
NAMO-v1 turns a long-standing bottleneck — turn-taking — into a solved engineering problem. By combining semantic intelligence with real-time speed, it finally allows voice AI systems to feel human, fast, and globally scalable.
Citation
@software{namo2025,
title={Namo Turn Detector v1: Semantic Turn Detection for Conversational AI},
author={VideoSDK Team},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/collections/videosdk-live/namo-turn-detector-v1-68d52c0564d2164e9d17ca97}
}