Turn-taking — the ability to know exactly when a user has finished speaking — is the invisible force behind natural human conversation. Yet most voice agents today rely on Voice Activity Detection (VAD) or fixed silence timers, leading to premature cut-offs or long, robotic pauses.

We introduce NAMO Turn Detector v1 (NAMO-v1), an open-source, ONNX-optimized semantic turn detector that predicts conversational boundaries by understanding meaning, not just silence. NAMO achieves <19 ms inference for specialized single-language models, <29 ms for multilingual, and up to 97.3 % accuracy — making it the first practical drop-in replacement for VAD in real-time voice systems.

1. Why Existing Approaches Break Down

Most deployed voice agents use one of two approaches:

  • Silence-based VAD: very fast and lightweight but either interrupts users mid-sentence or waits too long to be sure they’re done.
  • ASR endpointing (pause + punctuation): better than raw energy detection, but still a proxy; hesitations and lists often look “finished” when they’re not, and behavior varies wildly across languages.

Both approaches force product teams into a painful latency vs. interruption trade-off: either set a long buffer (safe but robotic) or a short one (fast but rude).

2. NAMO’s Semantic Advantage

NAMO replaces “silence as a proxy” with semantic understanding. The model looks at the text stream from your ASR and predicts whether the thought is complete. This single change brings:

  • Lower floor transfer time (snappier replies) without raising false cut-offs.
  • Multilingual robustness: one model works across 23+ languages, no per-language tuning.
  • Production latency: ONNX-quantized models run in <30 ms on CPU/GPU with almost no accuracy loss.
  • Observability & tuning: you can get calibrated probabilities and adjust thresholds for “fast vs. safe.”

Namo uses Natural Language Understanding to analyze the semantic meaning and context of speech, distinguishing between:

  • Complete utterances (user is done speaking)
  • Incomplete utterances (user will continue speaking)

Key Features

  • Semantic Understanding: Analyzes meaning and context, not just silence
  • Ultra-Fast Inference: <19ms for specialized models, <29ms for multilingual
  • Lightweight: ~135MB (specialized) / ~295MB (multilingual)
  • High Accuracy: Up to 97.3% for specialized models, 90.25% average for multilingual
  • Production-Ready: ONNX-optimized for real-time, enterprise-grade applications
  • Easy Integration: Standalone usage or plug-and-play with VideoSDK Agents SDK

3. Performance Benchmarks

Latency & Throughput

Using ONNX quantization, NAMO moves from 61 ms to 28 ms inference (multilingual) and 38 ms to 14.9 ms (specialized).

Video SDK Image
  • Relative speedup: up to 2.56×
  • Throughput: doubled (35.6 to 66.8 tokens/sec)

Accuracy Impact

Quantization preserves accuracy:

Video SDK Image

Confusion matrices show virtually unchanged true/false rates before and after quantization.

Language Coverage

Average multilingual accuracy: 90.25 %
Specialized single-language models: 97.3 % (Turkish/Korean), >93 % Hindi, Japanese, German.

5. Impact on Real-Time Voice AI

With NAMO you get:

  • Snappier responses without the “one Mississippi” delay.
  • Fewer interruptions when users pause mid-thought.
  • Consistent UX across markets without tuning for each language.
  • Cost-effective scaling — works with any STT and runs efficiently on commodity servers.

6. Impact on Real-Time Voice AI

Namo offers both specialized single-language models and a unified multilingual model

VariantLanguages / FocusModel SizeLatency*Typical Accuracy
Multilingual23 languages~295 MB< 29 ms~90.25 % (average)
Language-SpecializedOne language per model~135 MB< 19 msUp to 97.3 %
* Latency measured after quantization on target inference hardware.
  • Model: Namo-Turn-Detector-v1-Multilingual
  • Base: mmBERT
  • Languages: All 23 supported languages
  • Inference: <29ms
  • Size: ~295MB
  • Average Accuracy: 90.25%
  • Model Link: Namo Turn Detector v1 - MultiLingual

Performance Benchmarks for Multilingual Model

Evaluated on 25,000+ diverse utterances across all supported languages.

LanguageAccuracyPrecisionRecallF1 ScoreSamples
🇹🇷 Turkish97.31%0.96110.98530.9730966
🇰🇷 Korean96.85%0.95410.98420.9690890
🇯🇵 Japanese94.36%0.90990.98570.9463834
🇩🇪 German94.25%0.91350.97720.94431,322
🇮🇳 Hindi93.98%0.92760.96030.94361,295
🇳🇱 Dutch92.79%0.89590.97380.93321,401
🇳🇴 Norwegian91.65%0.87170.98010.92271,976
🇨🇳 Chinese91.64%0.88590.96080.9219945
🇫🇮 Finnish91.58%0.87460.97020.91991,010
🇬🇧 English90.86%0.85070.98010.91082,845
🇵🇱 Polish90.68%0.86190.95680.9069976
🇮🇩 Indonesian90.22%0.85140.97070.9071971
🇮🇹 Italian90.15%0.85620.96400.9069782
🇩🇰 Danish89.73%0.85170.96440.9045779
🇵🇹 Portuguese89.56%0.84100.96760.89991,398
🇪🇸 Spanish88.88%0.83040.96810.89401,295
🇮🇳 Marathi88.50%0.87620.90080.8883774
🇺🇦 Ukrainian87.94%0.81640.95870.8819929
🇷🇺 Russian87.48%0.83180.95470.88901,470
🇻🇳 Vietnamese86.45%0.81350.94390.87381,004
🇸🇦 Arabic84.90%0.79650.94390.8639947
🇧🇩 Bengali79.40%0.78740.79390.79071,000

Average Accuracy: 90.25% across all languages

Specialized Single-Language Models

  • Architecture: DistilBERT-based
  • Inference: <19ms
  • Size: ~135MB each
LanguageModel LinkAccuracy
🇰🇷 KoreanNamo-v1-Korean97.3%
🇹🇷 TurkishNamo-v1-Turkish96.8%
🇯🇵 JapaneseNamo-v1-Japanese93.5%
🇮🇳 HindiNamo-v1-Hindi93.1%
🇩🇪 GermanNamo-v1-German91.9%
🇬🇧 EnglishNamo-v1-English91.5%
🇳🇱 DutchNamo-v1-Dutch90.0%
🇮🇳 MarathiNamo-v1-Marathi89.7%
🇨🇳 ChineseNamo-v1-Chinese88.8%
🇵🇱 PolishNamo-v1-Polish87.8%
🇳🇴 NorwegianNamo-v1-Norwegian87.3%
🇮🇩 IndonesianNamo-v1-Indonesian87.1%
🇵🇹 PortugueseNamo-v1-Portuguese86.9%
🇮🇹 ItalianNamo-v1-Italian86.8%
🇪🇸 SpanishNamo-v1-Spanish86.7%
🇩🇰 DanishNamo-v1-Danish86.5%
🇻🇳 VietnameseNamo-v1-Vietnamese86.2%
🇫🇷 FrenchNamo-v1-French85.0%
🇫🇮 FinnishNamo-v1-Finnish84.8%
🇷🇺 RussianNamo-v1-Russian84.1%
🇺🇦 UkrainianNamo-v1-Ukrainian82.4%
🇸🇦 ArabicNamo-v1-Arabic79.7%
🇧🇩 BengaliNamo-v1-Bengali79.2%

Try It Yourself!

We’ve provided an inference script to help you quickly test these models. Just plug it in and start experimenting!

Integration with VideoSDK Agents

For seamless integration into your voice agent pipeline:

from videosdk_agents import NamoTurnDetectorV1, pre_download_namo_turn_v1_model

# Download model files (one-time setup)
# For multilingual (default):
pre_download_namo_turn_v1_model()

# For specific language:
# pre_download_namo_turn_v1_model(language="en")

# Initialize turn detector
turn_detector = NamoTurnDetectorV1()  # Multilingual
# turn_detector = NamoTurnDetectorV1(language="en")  # English-specific

# Add to your agent pipeline
from videosdk_agents import CascadingPipeline

pipeline = CascadingPipeline(
    stt=your_stt_service,
    llm=your_llm_service,
    tts=your_tts_service,
    turn_detector=turn_detector  # Namo integration
)

7. Training & Testing

Each model includes Colab notebooks for training and testing:

  • Training Notebooks: Fine-tune models on your own datasets
  • Testing Notebooks: Evaluate model performance on custom data

Visit individual model pages for notebook links:

Looking Ahead: Future Directions

  • Multi-party turn-taking detection: deciding when one speaker yields to another.
  • Hybrid signals: combine semantics with prosody, pitch, silence, etc.
  • Adaptive thresholds & confidence models: dynamic sensitivity based on conversation flow.
  • Distilled / edge versions for latency-constrained devices.
  • Continuous learning / feedback loop: let models adapt to usage patterns over time.

Conclusion

NAMO-v1 turns a long-standing bottleneck — turn-taking — into a solved engineering problem. By combining semantic intelligence with real-time speed, it finally allows voice AI systems to feel human, fast, and globally scalable.

Citation

@software{namo2025,
  title={Namo Turn Detector v1: Semantic Turn Detection for Conversational AI},
  author={VideoSDK Team},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/collections/videosdk-live/namo-turn-detector-v1-68d52c0564d2164e9d17ca97}
}