Azure TTS API: A Developer's Guide to Text-to-Speech

A comprehensive guide for developers on utilizing the Azure Text to Speech API, covering setup, features, integration, and comparisons with other TTS services.

Introduction to Azure TTS API

The Azure TTS API, a part of Azure Cognitive Services, provides a powerful and versatile solution for converting text into natural-sounding speech. Leveraging advanced neural networks, Azure TTS creates realistic and expressive voices, enabling a wide range of applications from virtual assistants to accessibility tools. It's a robust tool within the Azure Speech Service offering, allowing developers to bring their text to life.

What is Azure TTS API?

Azure TTS API (Azure Text to Speech API) is a cloud-based service that converts written text into spoken audio using advanced machine learning techniques. It employs neural text-to-speech to generate high-quality, natural-sounding voices, enabling applications to audibly communicate information to users. The service can also be referred to as the Microsoft Azure TTS.

Key Features and Benefits

The Azure TTS API offers numerous benefits:
  • High-Quality Voices: Utilizing neural networks, it produces remarkably natural and human-like speech.
  • Customization: Create custom neural voices tailored to your brand or specific application needs.
  • Multi-Language Support: Supports a wide range of languages and dialects.
  • SSML Support: Offers Speech Synthesis Markup Language (SSML) for fine-grained control over pronunciation, intonation, and other aspects of speech.
  • Scalability: As part of Azure, it provides excellent scalability to handle varying workloads.
  • Integration: Simple and straightforward integration with various programming languages and platforms using Azure TTS SDK.
  • Azure Cognitive Services Speech: Integrates with Azure Cognitive Services for broader AI capabilities.
These features make the Azure TTS API ideal for applications requiring high-quality and customizable speech output.

Target Audience

This guide is intended for developers who want to integrate text-to-speech functionality into their applications. Whether you're building a virtual assistant, an accessibility tool, or any application requiring synthesized speech, this guide provides the necessary information to get started with the Azure TTS API. This includes developers working with Azure TTS Python, Azure TTS JavaScript, Azure TTS Node.js, Azure TTS C#, and Azure TTS Java.

Getting Started with Azure TTS API

This section outlines the steps to get up and running with the Azure TTS API, covering everything from setting up your Azure account to making your first API call. Understanding Azure TTS documentation is key to efficiently using the service.

Setting up an Azure Account and Speech Service

Before using the Azure TTS API, you need an active Azure subscription. If you don't have one, you can sign up for a free Azure account. Once you have an account, create a Speech Service resource in the Azure portal.
  1. Log in to the

    Azure portal

    .
  2. Click on "Create a resource" and search for "Speech".
  3. Select "Speech" and click "Create".
  4. Provide the necessary information, such as your subscription, resource group, region, and resource name.
  5. Choose a pricing tier that suits your needs, keeping in mind Azure TTS pricing.
  6. Click "Review + create" and then "Create".

Obtaining API Keys and Connection Strings

After creating the Speech Service resource, you need to obtain the API keys and connection strings to authenticate your application.

Azure Portal

1Go to your Speech Service resource in the Azure portal.
2
3In the "Resource Management" section, click on "Keys and Endpoint".
4
5You will find two keys (KEY 1 and KEY 2) and the Endpoint URL. Keep these values safe, as they are required to access the Azure TTS API.
6

Choosing a Voice and Language

The Azure TTS API offers a variety of voices and languages to choose from. You can select a voice that matches your application's needs. Refer to the

Azure TTS documentation

for a complete list of supported voices and languages.
When choosing a voice, consider factors such as gender, accent, and speaking style. For example, you might prefer a more formal voice for business applications and a more casual voice for entertainment applications. You can also explore custom voice Azure TTS to build a unique brand voice.

Making your First API Call

Here's a simple example of making an API call using Python. You'll need to install the azure-cognitiveservices-speech SDK.

python

1import azure.cognitiveservices.speech as speechsdk
2
3def synthesize_speech(text):
4    speech_config = speechsdk.SpeechConfig(subscription="YOUR_SPEECH_KEY", region="YOUR_SPEECH_REGION")
5
6    # Set the voice name.  
7    speech_config.speech_synthesis_voice_name='en-US-JennyNeural' # Example Voice
8
9    audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)
10
11    # Creates a speech synthesizer using the speech configuration and audio config
12    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
13
14    # Get text from the console and synthesize to the default speaker
15    speech_synthesis_result = speech_synthesizer.speak_text_async(text).get()
16
17    if speech_synthesis_result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
18        print("Speech synthesized for text [{}]".format(text))
19    elif speech_synthesis_result.reason == speechsdk.ResultReason.Canceled:
20        cancellation_details = speech_synthesis_result.cancellation_details
21        print("Speech synthesis canceled: {}".format(cancellation_details.reason))
22        if cancellation_details.reason == speechsdk.CancellationReason.Error:
23            print("Error details: {}".format(cancellation_details.error_details))
24
25if __name__ == '__main__':
26    text_to_speak = "Hello, world! This is Azure Text-to-Speech."
27    synthesize_speech(text_to_speak)
28
Replace YOUR_SPEECH_KEY and YOUR_SPEECH_REGION with your actual key and region from the Azure portal. This script synthesizes the text "Hello, world! This is Azure Text-to-Speech." and plays it through your default speaker.

Understanding Azure TTS API Parameters

To effectively use the Azure TTS API, it's crucial to understand the various request and response parameters. This section provides a detailed overview of these parameters.

Request Parameters

The request parameters allow you to control various aspects of the speech synthesis process, such as voice, language, style, and format. The primary way to interact is via the Azure TTS REST API.
Key request parameters include:
  • text: The text to be synthesized. (Required)
  • voiceName: The name of the voice to use. Example: en-US-JennyNeural
  • outputFormat: The format of the audio output. Examples: audio-16khz-128kbitrate-mono-mp3, riff-24khz-16bit-mono-pcm
  • style: (Applicable for certain voices) Specifies the speaking style. Example: chat, newscast
  • pitch: Adjust the pitch of the voice.
  • rate: Adjust the speaking rate.
  • volume: Adjust the volume of the speech.
These parameters can be passed as part of the API request body (usually in JSON format for REST API calls) or through the SDK.

python

1import azure.cognitiveservices.speech as speechsdk
2
3def synthesize_speech_with_params(text):
4    speech_config = speechsdk.SpeechConfig(subscription="YOUR_SPEECH_KEY", region="YOUR_SPEECH_REGION")
5
6    # Set the voice name.
7    speech_config.speech_synthesis_voice_name = 'en-US-JennyNeural'
8
9    # Configure speech synthesis output format.
10    speech_config.speech_synthesis_output_format = speechsdk.SpeechSynthesisOutputFormat.Audio16Khz128KBitRateMonoMp3
11
12    audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)
13
14    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
15
16    speech_synthesis_result = speech_synthesizer.speak_text_async(text).get()
17
18    if speech_synthesis_result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
19        print("Speech synthesized for text [{}]".format(text))
20    elif speech_synthesis_result.reason == speechsdk.ResultReason.Canceled:
21        cancellation_details = speech_synthesis_result.cancellation_details
22        print("Speech synthesis canceled: {}".format(cancellation_details.reason))
23        if cancellation_details.reason == speechsdk.CancellationReason.Error:
24            print("Error details: {}".format(cancellation_details.error_details))
25
26
27if __name__ == '__main__':
28    text_to_speak = "Hello, world! This is Azure Text-to-Speech with custom parameters."
29    synthesize_speech_with_params(text_to_speak)
30

Response Parameters

The API response typically includes the synthesized audio data in the specified output format. Other response parameters may include:
  • statusCode: The HTTP status code indicating the success or failure of the request.
  • contentType: The content type of the audio data (e.g., audio/mpeg for MP3).
  • contentLength: The size of the audio data in bytes.
  • x-requestid: A unique identifier for the request.
The audio data can be streamed directly to the client or saved to a file for later use.

Error Handling and Troubleshooting

When using the Azure TTS API, you may encounter errors such as invalid API keys, incorrect request parameters, or service outages. Refer to the Azure TTS documentation for a list of common error codes and troubleshooting tips.
Proper error handling is essential for ensuring the reliability of your application. Implement robust error handling mechanisms to gracefully handle errors and provide informative feedback to the user.

Advanced Azure TTS API Features

The Azure TTS API offers several advanced features for customizing and enhancing the speech synthesis process. These features include custom neural voices, SSML support, and batch synthesis.

Custom Neural Voices

One of the most powerful features of the Azure TTS API is the ability to create custom neural voices. This allows you to train a unique voice model based on your own voice data, enabling you to create a brand-specific voice for your applications. This is particularly useful for building consistent brand experiences and creating a unique identity. Using custom voices can be complex, consult Azure Speech Service documentation to understand training requirements and Azure TTS pricing for custom models.

SSML Support (Speech Synthesis Markup Language)

SSML (Speech Synthesis Markup Language) is an XML-based markup language that allows you to control various aspects of the synthesized speech, such as pronunciation, intonation, pitch, rate, and volume. SSML provides fine-grained control over the speech output, enabling you to create more natural and expressive voices. The neural text to speech Azure provides shines best with SSML integration.

SSML Example

1<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
2    <voice name="en-US-JennyNeural">
3        Here are <emphasis level="strong">three</emphasis> things to know about <break time="300ms"/> Azure TTS.
4        <mstts:expressive-as style="cheerful">
5            It's really amazing!
6        </mstts:expressive-as>
7    </voice>
8</speak>
9
This example demonstrates how to use SSML to emphasize a word, add a pause, and express a cheerful style.

Batch Synthesis

The Batch Synthesis API allows you to synthesize large amounts of text in a single request. This is useful for processing large documents or generating audio for multiple files. Batch synthesis can significantly improve performance compared to making individual API calls for each text segment.

Integrating Azure TTS API into Your Applications

Integrating the Azure TTS API into your applications is relatively straightforward, thanks to the availability of SDKs for various programming languages. This section provides guidance on integrating the API into web, mobile, and desktop applications.
The Azure TTS API provides SDKs for several programming languages, including Python, JavaScript, and C#. Here's a brief overview of integration with each language:

Python

As shown in previous examples, the azure-cognitiveservices-speech package simplifies integration with Python. You can easily synthesize text to speech with just a few lines of code.

JavaScript

For JavaScript, you can use the microsoft-cognitiveservices-speech-sdk package. This allows you to integrate the Azure TTS API into web applications and Node.js applications.

javascript

1const sdk = require("microsoft-cognitiveservices-speech-sdk");
2
3async function synthesizeSpeech(text) {
4    const speechConfig = sdk.SpeechConfig.fromSubscription("YOUR_SPEECH_KEY", "YOUR_SPEECH_REGION");
5    speechConfig.speechSynthesisVoiceName = "en-US-JennyNeural";
6
7    const audioConfig = sdk.AudioConfig.fromAudioFileOutput("output.wav");
8
9    const synthesizer = new sdk.SpeechSynthesizer(speechConfig, audioConfig);
10
11    synthesizer.speakTextAsync(text,
12        function (result) {
13        if (result.reason === sdk.ResultReason.SynthesizingAudioCompleted) {
14            console.log("Speech synthesized to [" + outputFile + "]");
15        } else if (result.reason === sdk.ResultReason.Canceled) {
16            console.log("Speech synthesis canceled: " + result.errorDetails);
17        }
18        synthesizer.close();
19        synthesizer = null;
20    },
21        function (error) {
22        console.log(error);
23        synthesizer.close();
24        synthesizer = null;
25    });
26}
27
28synthesizeSpeech("Hello, world! This is Azure Text-to-Speech in JavaScript.");
29

C#

The C# SDK also provides a straightforward way to integrate the Azure TTS API into your applications.

C#

1using Microsoft.CognitiveServices.Speech;
2
3public async Task SynthesizeSpeechAsync(string text)
4{
5    var config = SpeechConfig.FromSubscription("YOUR_SPEECH_KEY", "YOUR_SPEECH_REGION");
6    config.SpeechSynthesisVoiceName = "en-US-JennyNeural";
7
8    using (var synthesizer = new SpeechSynthesizer(config))
9    {
10        var result = await synthesizer.SpeakTextAsync(text);
11
12        if (result.Reason == ResultReason.SynthesizingAudioCompleted)
13        {
14            Console.WriteLine($"Speech synthesized to speaker for text "{text}"");
15        }
16        else if (result.Reason == ResultReason.Canceled)
17        {
18            var cancellation = SpeechSynthesisCancellationDetails.FromResult(result);
19            Console.WriteLine($"CANCELED: Reason={cancellation.Reason}");
20
21            if (cancellation.Reason == CancellationReason.Error)
22            {
23                Console.WriteLine($"CANCELED: ErrorCode={cancellation.ErrorCode}");
24                Console.WriteLine($"CANCELED: ErrorDetails={cancellation.ErrorDetails}");
25                Console.WriteLine($"CANCELED: Did you set the speech resource key and region values?");
26            }
27        }
28    }
29}
30

Web Application Integration

For web applications, you can use JavaScript to make API calls to the Azure TTS API directly from the client-side or use a server-side language like Python, Node.js or C# to handle the API calls and stream the audio to the client. Server-side integration is generally preferred for security reasons, as it prevents exposing your API keys to the client.

Mobile Application Integration

For mobile applications, you can use the native SDKs for Android and iOS or use a cross-platform framework like Xamarin or React Native. The integration process is similar to web applications, but you need to consider factors such as battery life and network connectivity.

Considerations for Scalability and Performance

When integrating the Azure TTS API into your applications, it's essential to consider scalability and performance. Optimize your API calls by using batch synthesis where possible, caching synthesized audio, and choosing appropriate output formats. Also, monitor your usage and adjust your pricing tier as needed.

Azure TTS API Pricing and Limits

Understanding the pricing model and usage limits is crucial for managing the cost of using the Azure TTS API.

Pricing Models and Tiers

The Azure TTS API offers a pay-as-you-go pricing model. The cost is based on the number of characters synthesized. There are different pricing tiers available, depending on your usage volume. Review Azure TTS pricing tiers to determine your specific usage and cost.

Usage Limits and Throttling

The Azure TTS API has usage limits to prevent abuse and ensure fair usage. These limits may include the number of requests per minute, the number of characters synthesized per month, and the size of the text input. If you exceed these limits, your requests may be throttled. Check the Azure documentation for specific throttling details.

Cost Optimization Strategies

To optimize the cost of using the Azure TTS API, consider the following strategies:
  • Cache synthesized audio to avoid resynthesizing the same text repeatedly.
  • Use batch synthesis to synthesize large amounts of text in a single request.
  • Choose a lower output format if high-quality audio is not required.
  • Monitor your usage and adjust your pricing tier as needed.

Comparison with Other Text-to-Speech APIs

While Azure TTS is a powerful option, it's worth comparing it with other popular Text-to-Speech APIs to determine the best fit for your needs.

Azure TTS vs. Google Cloud Text-to-Speech

Both Azure TTS and Google Cloud Text-to-Speech offer high-quality voices and extensive customization options. Azure TTS is known for its custom neural voice capabilities, while Google Cloud Text-to-Speech is known for its wide range of voices and languages. Evaluate your specific requirements, such as custom voice needs, language support and Azure TTS pricing vs Google pricing to make the right choice.

Azure TTS vs. Amazon Polly

Amazon Polly is another popular Text-to-Speech API. Azure TTS generally offers more advanced customization features and natural-sounding voices compared to Amazon Polly. Polly, however, can be more cost-effective for certain use cases. Factors such as voice quality expectations, customization needs, and overall budget should be considered.

Choosing the Right API for Your Needs

When choosing a Text-to-Speech API, consider factors such as voice quality, customization options, language support, pricing, and ease of integration. Evaluate your specific requirements and compare the features and benefits of each API before making a decision. For many, Azure TTS provides a strong balance of cost, customization and voice quality.

Conclusion

The Azure TTS API is a powerful and versatile tool for converting text into natural-sounding speech. With its high-quality voices, customization options, and extensive language support, it's an excellent choice for a wide range of applications. By following the guidelines in this guide, you can quickly get started with the Azure TTS API and integrate it into your applications. Be sure to check the Azure Speech Service Documentation and Azure TTS examples for more information.

Get 10,000 Free Minutes Every Months

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ