AssemblyAI logo

AssemblyAI

Speech-to-Text to Powerful Outcomes: AI Models to Transcribe and Understand Speech

4.3
Try AssemblyAI

Open Source AI Voice Agent SDK

Integrate voice into your apps with VideoSDK's AI Agents. Connect your chosen LLMs & TTS. Build once, deploy across all platforms.

Star us on GitHub

Overview

AssemblyAI delivers industry-leading Speech AI models for transcribing and understanding speech, trusted by startups and enterprises worldwide. The platform provides robust APIs for Speech-to-Text and advanced Speech Understanding (Audio Intelligence), powering innovative voice-enabled products. Processing over 600 million inference calls and 3.5 million audio files daily, AssemblyAI demonstrates ultra-high accuracy, scalability, and support for 99+ languages, with ultra-low streaming latency. Their mission is to democratize superhuman Speech AI models, enabling entirely new voice data applications for businesses.

How It Works

  • Integrate the API: Connect your application using robust SDKs and thorough documentation.
  • Submit Audio Data: Send pre-recorded or live audio streams for transcription.
  • Select AI Models: Choose models such as Slam-1 for English or Universal for multilingual tasks; Universal-Streaming supports real-time use cases.
  • Enable Advanced Features: Add parameters to utilize features like Speaker Diarization, PII Redaction, or Sentiment Analysis.
  • Receive Processed Data: Obtain accurate transcripts and extracted insights to drive your products.
  • Test and Iterate: Utilize the no-code playground to prototype and refine before full deployment.

Use Cases

Voice Agents
Build intuitive, human-like voice agents with ultra-low latency Speech-to-Text for real-time, responsive conversations.
Conversational Intelligence
Power best-in-class platforms by extracting insights from customer interactions and accelerating product workflows.
Content Creation & Accessibility
Automatically produce accurate transcripts and subtitles to improve accessibility and discoverability for audio and video content.

Features & Benefits

  • Industry-leading accuracy (>93.3%)
  • Slam-1 Model for English with domain customization
  • Universal Model for 99+ languages and low latency
  • Speaker Diarization
  • Automatic Language Detection
  • Profanity Filtering
  • Custom Vocabulary & Keyterm Prompting
  • Multichannel Transcription
  • Filler Word Filtering
  • Custom Spelling
  • Word Timestamps
  • Auto Punctuation and Casing
  • ITN/Formatting
  • Confidence Scores
  • Word Search in transcripts
  • Export SRT/VTT Captions
  • Export Paragraphs/Sentences
  • Universal-Streaming for real-time transcription
  • End of Turn Detection
  • Unlimited Concurrency
  • LeMUR (LLMs for Speech) integrations
  • Entity Detection
  • Topic Detection
  • Key Phrase Extraction
  • PII Redaction (Text & Audio)
  • Sentiment Analysis
  • Content Moderation
  • Auto Chapters
  • Summarisation
  • Developer-preferred SDKs & docs
  • Scalable, enterprise-grade infrastructure
  • Security: GDPR, PCI-DSS, SOC2, BAA for HIPAA, EU residency
  • Weekly new features, production-ready
  • Flexible pricing and volume discounts

Target Audience

  • Developers: Building prototypes or integrating Speech AI via SDKs and documentation.
  • Startups & Enterprises: From emerging companies to Fortune 500s needing reliable Speech-to-Text and Speech Understanding.
  • Product Teams: Building AI-first features or platforms reliant on audio/video data insights.
  • Organizations Scaling AI: Needing secure and flexible Speech AI at high volumes.
  • Research & Academia: Scientists and engineers advancing AI for voice data.
  • Industries: Healthcare, customer service, content creation, media, and market research leveraging voice analysis.

Pricing

  • Free
    • $50 in free credits
    • Ideal for prototyping; access industry-leading models
    • Up to 185 hours pre-recorded or 333 hours streaming audio
    • Developer docs and community support
  • Pay as you go (Most Popular)
    • Streaming Speech-to-Text from $0.15/hr; pre-recorded Slam-1/Universal $0.27/hr
    • LeMUR models and audio intelligence features priced per 1k tokens or per hour
    • Unlimited access, technical support via live chat/email
    • Concurrency starting at 100 streams, scaling up automatically
  • Custom
    • Tailored rates and unlimited scales for high-volume users
    • Dedicated technical support, BAA, EU residency, custom SLAs
    • Early access to new models; self-hosted deployments coming soon

FAQs

What are the differences between Speech-to-Text models?

Universal is a high-accuracy English model for general use, supporting features like diarization and streaming. Slam-1 is the most advanced language model, optimized for contextual understanding and domain-specific customization. Universal-Streaming excels at ultra-fast, real-time streaming for voice agents.

Can I sign up for free?

Yes, signing up grants $50 in free credits for API usage. Add a credit card to obtain more credits.

Do you offer volume discounts?

Yes. For high volumes, contact the sales team to discuss volume-based discounts.

How does Universal-Streaming concurrency work?

Unlimited simultaneous streams are supported. Free accounts start with 5 new streams per minute. Pay-as-you-go accounts begin at 100 per minute, scaling up with demand. Unlimited ceiling available.

How does Universal-Streaming session-based pricing work?

You are billed based on total session duration (connection open time) regardless of audio activity, giving control and transparent usage-based costs.

How fast does it take to process audio and video files?

Most files are processed in under 60 seconds. For example, a 30-minute file may process in 23 seconds with the Universal model.

Featured Products