Build with VideoSDK’s AI Agents and Get 10,000 Free Minutes!
Integrate voice into your apps with VideoSDK’s AI Agents. Connect your chosen LLMs & TTS. Build once, deploy across all platforms.
Start BuildingOverview
AssemblyAI delivers industry-leading Speech AI models for transcribing and understanding speech, trusted by startups and enterprises worldwide. The platform provides robust APIs for Speech-to-Text and advanced Speech Understanding (Audio Intelligence), powering innovative voice-enabled products. Processing over 600 million inference calls and 3.5 million audio files daily, AssemblyAI demonstrates ultra-high accuracy, scalability, and support for 99+ languages, with ultra-low streaming latency. Their mission is to democratize superhuman Speech AI models, enabling entirely new voice data applications for businesses.
How It Works
- Integrate the API: Connect your application using robust SDKs and thorough documentation.
- Submit Audio Data: Send pre-recorded or live audio streams for transcription.
- Select AI Models: Choose models such as Slam-1 for English or Universal for multilingual tasks; Universal-Streaming supports real-time use cases.
- Enable Advanced Features: Add parameters to utilize features like Speaker Diarization, PII Redaction, or Sentiment Analysis.
- Receive Processed Data: Obtain accurate transcripts and extracted insights to drive your products.
- Test and Iterate: Utilize the no-code playground to prototype and refine before full deployment.
Use Cases
Voice Agents
Build intuitive, human-like voice agents with ultra-low latency Speech-to-Text for real-time, responsive conversations.
Conversational Intelligence
Power best-in-class platforms by extracting insights from customer interactions and accelerating product workflows.
Content Creation & Accessibility
Automatically produce accurate transcripts and subtitles to improve accessibility and discoverability for audio and video content.
Features & Benefits
- Industry-leading accuracy (>93.3%)
- Slam-1 Model for English with domain customization
- Universal Model for 99+ languages and low latency
- Speaker Diarization
- Automatic Language Detection
- Profanity Filtering
- Custom Vocabulary & Keyterm Prompting
- Multichannel Transcription
- Filler Word Filtering
- Custom Spelling
- Word Timestamps
- Auto Punctuation and Casing
- ITN/Formatting
- Confidence Scores
- Word Search in transcripts
- Export SRT/VTT Captions
- Export Paragraphs/Sentences
- Universal-Streaming for real-time transcription
- End of Turn Detection
- Unlimited Concurrency
- LeMUR (LLMs for Speech) integrations
- Entity Detection
- Topic Detection
- Key Phrase Extraction
- PII Redaction (Text & Audio)
- Sentiment Analysis
- Content Moderation
- Auto Chapters
- Summarisation
- Developer-preferred SDKs & docs
- Scalable, enterprise-grade infrastructure
- Security: GDPR, PCI-DSS, SOC2, BAA for HIPAA, EU residency
- Weekly new features, production-ready
- Flexible pricing and volume discounts
Target Audience
- Developers: Building prototypes or integrating Speech AI via SDKs and documentation.
- Startups & Enterprises: From emerging companies to Fortune 500s needing reliable Speech-to-Text and Speech Understanding.
- Product Teams: Building AI-first features or platforms reliant on audio/video data insights.
- Organizations Scaling AI: Needing secure and flexible Speech AI at high volumes.
- Research & Academia: Scientists and engineers advancing AI for voice data.
- Industries: Healthcare, customer service, content creation, media, and market research leveraging voice analysis.
Pricing
- Free
- $50 in free credits
- Ideal for prototyping; access industry-leading models
- Up to 185 hours pre-recorded or 333 hours streaming audio
- Developer docs and community support
- Pay as you go (Most Popular)
- Streaming Speech-to-Text from $0.15/hr; pre-recorded Slam-1/Universal $0.27/hr
- LeMUR models and audio intelligence features priced per 1k tokens or per hour
- Unlimited access, technical support via live chat/email
- Concurrency starting at 100 streams, scaling up automatically
- Custom
- Tailored rates and unlimited scales for high-volume users
- Dedicated technical support, BAA, EU residency, custom SLAs
- Early access to new models; self-hosted deployments coming soon
FAQs
What are the differences between Speech-to-Text models?
Universal is a high-accuracy English model for general use, supporting features like diarization and streaming. Slam-1 is the most advanced language model, optimized for contextual understanding and domain-specific customization. Universal-Streaming excels at ultra-fast, real-time streaming for voice agents.
Can I sign up for free?
Yes, signing up grants $50 in free credits for API usage. Add a credit card to obtain more credits.
Do you offer volume discounts?
Yes. For high volumes, contact the sales team to discuss volume-based discounts.
How does Universal-Streaming concurrency work?
Unlimited simultaneous streams are supported. Free accounts start with 5 new streams per minute. Pay-as-you-go accounts begin at 100 per minute, scaling up with demand. Unlimited ceiling available.
How does Universal-Streaming session-based pricing work?
You are billed based on total session duration (connection open time) regardless of audio activity, giving control and transparent usage-based costs.
How fast does it take to process audio and video files?
Most files are processed in under 60 seconds. For example, a 30-minute file may process in 23 seconds with the Universal model.
Build with VideoSDK’s AI Agents and Get 10,000 Free Minutes!
Integrate voice into your apps with VideoSDK’s AI Agents. Connect your chosen LLMs & TTS. Build once, deploy across all platforms.
Start Building