Open Source AI Voice Agent SDK
Integrate voice into your apps with VideoSDK's AI Agents. Connect your chosen LLMs & TTS. Build once, deploy across all platforms.
Star us on GitHubOverview
AssemblyAI delivers industry-leading Speech AI models for transcribing and understanding speech, trusted by startups and enterprises worldwide. The platform provides robust APIs for Speech-to-Text and advanced Speech Understanding (Audio Intelligence), powering innovative voice-enabled products. Processing over 600 million inference calls and 3.5 million audio files daily, AssemblyAI demonstrates ultra-high accuracy, scalability, and support for 99+ languages, with ultra-low streaming latency. Their mission is to democratize superhuman Speech AI models, enabling entirely new voice data applications for businesses.
How It Works
- Integrate the API: Connect your application using robust SDKs and thorough documentation.
- Submit Audio Data: Send pre-recorded or live audio streams for transcription.
- Select AI Models: Choose models such as Slam-1 for English or Universal for multilingual tasks; Universal-Streaming supports real-time use cases.
- Enable Advanced Features: Add parameters to utilize features like Speaker Diarization, PII Redaction, or Sentiment Analysis.
- Receive Processed Data: Obtain accurate transcripts and extracted insights to drive your products.
- Test and Iterate: Utilize the no-code playground to prototype and refine before full deployment.
Use Cases
Voice Agents
Build intuitive, human-like voice agents with ultra-low latency Speech-to-Text for real-time, responsive conversations.
Conversational Intelligence
Power best-in-class platforms by extracting insights from customer interactions and accelerating product workflows.
Content Creation & Accessibility
Automatically produce accurate transcripts and subtitles to improve accessibility and discoverability for audio and video content.
Features & Benefits
- Industry-leading accuracy (>93.3%)
- Slam-1 Model for English with domain customization
- Universal Model for 99+ languages and low latency
- Speaker Diarization
- Automatic Language Detection
- Profanity Filtering
- Custom Vocabulary & Keyterm Prompting
- Multichannel Transcription
- Filler Word Filtering
- Custom Spelling
- Word Timestamps
- Auto Punctuation and Casing
- ITN/Formatting
- Confidence Scores
- Word Search in transcripts
- Export SRT/VTT Captions
- Export Paragraphs/Sentences
- Universal-Streaming for real-time transcription
- End of Turn Detection
- Unlimited Concurrency
- LeMUR (LLMs for Speech) integrations
- Entity Detection
- Topic Detection
- Key Phrase Extraction
- PII Redaction (Text & Audio)
- Sentiment Analysis
- Content Moderation
- Auto Chapters
- Summarisation
- Developer-preferred SDKs & docs
- Scalable, enterprise-grade infrastructure
- Security: GDPR, PCI-DSS, SOC2, BAA for HIPAA, EU residency
- Weekly new features, production-ready
- Flexible pricing and volume discounts
Target Audience
- Developers: Building prototypes or integrating Speech AI via SDKs and documentation.
- Startups & Enterprises: From emerging companies to Fortune 500s needing reliable Speech-to-Text and Speech Understanding.
- Product Teams: Building AI-first features or platforms reliant on audio/video data insights.
- Organizations Scaling AI: Needing secure and flexible Speech AI at high volumes.
- Research & Academia: Scientists and engineers advancing AI for voice data.
- Industries: Healthcare, customer service, content creation, media, and market research leveraging voice analysis.
Pricing
- Free
- $50 in free credits
- Ideal for prototyping; access industry-leading models
- Up to 185 hours pre-recorded or 333 hours streaming audio
- Developer docs and community support
- Pay as you go (Most Popular)
- Streaming Speech-to-Text from $0.15/hr; pre-recorded Slam-1/Universal $0.27/hr
- LeMUR models and audio intelligence features priced per 1k tokens or per hour
- Unlimited access, technical support via live chat/email
- Concurrency starting at 100 streams, scaling up automatically
- Custom
- Tailored rates and unlimited scales for high-volume users
- Dedicated technical support, BAA, EU residency, custom SLAs
- Early access to new models; self-hosted deployments coming soon
FAQs
What are the differences between Speech-to-Text models?
Universal is a high-accuracy English model for general use, supporting features like diarization and streaming. Slam-1 is the most advanced language model, optimized for contextual understanding and domain-specific customization. Universal-Streaming excels at ultra-fast, real-time streaming for voice agents.
Can I sign up for free?
Yes, signing up grants $50 in free credits for API usage. Add a credit card to obtain more credits.
Do you offer volume discounts?
Yes. For high volumes, contact the sales team to discuss volume-based discounts.
How does Universal-Streaming concurrency work?
Unlimited simultaneous streams are supported. Free accounts start with 5 new streams per minute. Pay-as-you-go accounts begin at 100 per minute, scaling up with demand. Unlimited ceiling available.
How does Universal-Streaming session-based pricing work?
You are billed based on total session duration (connection open time) regardless of audio activity, giving control and transparent usage-based costs.
How fast does it take to process audio and video files?
Most files are processed in under 60 seconds. For example, a 30-minute file may process in 23 seconds with the Universal model.
Open Source AI Voice Agent SDK
Integrate voice into your apps with VideoSDK's AI Agents. Connect your chosen LLMs & TTS. Build once, deploy across all platforms.
Star us on GitHub