Building AI Voice Agents with WebSockets

Step-by-step guide to building AI Voice Agents with WebSockets for voice streaming using VideoSDK.

Introduction to AI Voice Agents in WebSockets for Voice Streaming

What is an AI Voice Agent?

AI Voice Agents are sophisticated software entities capable of understanding and responding to human speech. They leverage advanced technologies like speech-to-text (STT), natural language processing (NLP), and text-to-speech (TTS) to interact with users in a conversational manner.

Why are They Important for the WebSockets for Voice Streaming Industry?

In the realm of voice streaming, AI Voice Agents play a crucial role. They enable real-time interaction and data processing, allowing for seamless communication over WebSockets. This is particularly beneficial in applications like customer support, virtual assistants, and interactive voice response systems.

Core Components of a Voice Agent

  • STT (Speech-to-Text): Converts spoken language into text.
  • LLM (Large Language Model): Processes and understands the text to generate appropriate responses.
  • TTS (Text-to-Speech): Converts text back into spoken language.

What You'll Build in This Tutorial

In this guide, you will learn how to build an AI Voice Agent using WebSockets for voice streaming. We'll use the VideoSDK framework to implement a fully functional agent capable of real-time interaction. For a detailed walkthrough, refer to the

Voice Agent Quick Start Guide

.

Architecture and Core Concepts

High-Level Architecture Overview

The AI Voice Agent processes audio input from users, converts it to text using STT, generates a response with an LLM, and finally converts the response back to speech using TTS. All these components are orchestrated in a

cascading pipeline in AI voice Agents

, allowing for seamless data flow.
1sequenceDiagram
2    participant User
3    participant Agent
4    participant STT
5    participant LLM
6    participant TTS
7    User->>Agent: Speak
8    Agent->>STT: Convert Speech to Text
9    STT->>Agent: Text
10    Agent->>LLM: Process Text
11    LLM->>Agent: Response
12    Agent->>TTS: Convert Text to Speech
13    TTS->>Agent: Speech
14    Agent->>User: Respond
15

Understanding Key Concepts in the VideoSDK Framework

  • Agent: Represents the core entity that handles user interaction.
  • CascadingPipeline: Manages the flow of data through STT, LLM, and TTS.
  • VAD & TurnDetector: These components help the agent determine when to listen and when to speak. For more information, see the

    AI voice Agent core components overview

    .

Setting Up the Development Environment

Prerequisites

Before starting, ensure you have Python 3.11+ installed and a VideoSDK account at app.videosdk.live.

Step 1: Create a Virtual Environment

Create a virtual environment to manage dependencies:
1python -m venv venv
2source venv/bin/activate  # On Windows use `venv\\Scripts\\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip:
1pip install videosdk-agents videosdk-plugins
2

Step 3: Configure API Keys in a .env File

Create a .env file in your project directory and add your VideoSDK API keys:
1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

Let's start by presenting the complete code block that we'll break down in the following sections:
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "{
14  \"persona\": \"WebSockets Streaming Specialist\",
15  \"capabilities\": [
16    \"Explain the concept of WebSockets and how they are used for voice streaming.\",
17    \"Guide users through setting up a WebSocket connection for real-time voice data transmission.\",
18    \"Provide troubleshooting tips for common WebSocket connection issues.\",
19    \"Offer best practices for optimizing WebSocket performance in voice streaming applications.\"
20  ],
21  \"constraints\": [
22    \"You are not a network engineer and should advise users to consult professional network specialists for complex issues.\",
23    \"Ensure users understand that WebSockets require a stable internet connection for optimal performance.\",
24    \"You cannot provide legal advice on data privacy and should recommend consulting legal experts for compliance matters.\"
25  ]
26}"
27
28class MyVoiceAgent(Agent):
29    def __init__(self):
30        super().__init__(instructions=agent_instructions)
31    async def on_enter(self): await self.session.say("Hello! How can I help?")
32    async def on_exit(self): await self.session.say("Goodbye!")
33
34async def start_session(context: JobContext):
35    # Create agent and conversation flow
36    agent = MyVoiceAgent()
37    conversation_flow = ConversationFlow(agent)
38
39    # Create pipeline
40    pipeline = CascadingPipeline(
41        stt=DeepgramSTT(model="nova-2", language="en"),
42        llm=OpenAILLM(model="gpt-4o"),
43        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
44        vad=SileroVAD(threshold=0.35),
45        turn_detector=TurnDetector(threshold=0.8)
46    )
47
48    session = AgentSession(
49        agent=agent,
50        pipeline=pipeline,
51        conversation_flow=conversation_flow
52    )
53
54    try:
55        await context.connect()
56        await session.start()
57        # Keep the session running until manually terminated
58        await asyncio.Event().wait()
59    finally:
60        # Clean up resources when done
61        await session.close()
62        await context.shutdown()
63
64def make_context() -> JobContext:
65    room_options = RoomOptions(
66    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
67        name="VideoSDK Cascaded Agent",
68        playground=True
69    )
70
71    return JobContext(room_options=room_options)
72
73if __name__ == "__main__":
74    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
75    job.start()
76

Step 4.1: Generating a VideoSDK Meeting ID

To interact with the voice agent, you need a meeting ID. Use the following curl command to generate one:
1curl -X POST https://api.videosdk.live/v1/meetings -H "Authorization: YOUR_API_KEY"
2

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class defines the behavior of your voice agent. It inherits from the Agent class and specifies actions on entering and exiting a session.
1class MyVoiceAgent(Agent):
2    def __init__(self):
3        super().__init__(instructions=agent_instructions)
4    async def on_enter(self): await self.session.say("Hello! How can I help?")
5    async def on_exit(self): await self.session.say("Goodbye!")
6

Step 4.3: Defining the Core Pipeline

The CascadingPipeline orchestrates the flow of data through various plugins, including the

Deepgram STT Plugin for voice agent

,

OpenAI LLM Plugin for voice agent

, and

ElevenLabs TTS Plugin for voice agent

.
1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8

Step 4.4: Managing the Session and Startup Logic

The start_session function initializes and manages the agent session. The make_context function sets up the job context, and the main block starts the job.
1async def start_session(context: JobContext):
2    agent = MyVoiceAgent()
3    conversation_flow = ConversationFlow(agent)
4    pipeline = CascadingPipeline(...)
5    session = AgentSession(agent=agent, pipeline=pipeline, conversation_flow=conversation_flow)
6    try:
7        await context.connect()
8        await session.start()
9        await asyncio.Event().wait()
10    finally:
11        await session.close()
12        await context.shutdown()
13
14def make_context() -> JobContext:
15    room_options = RoomOptions(name="VideoSDK Cascaded Agent", playground=True)
16    return JobContext(room_options=room_options)
17
18if __name__ == "__main__":
19    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
20    job.start()
21

Running and Testing the Agent

Step 5.1: Running the Python Script

To run your agent, execute the script using Python:
1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

Once the script is running, find the playground link in the console output. Use this link to join the session and interact with your agent.

Advanced Features and Customizations

Extending Functionality with Custom Tools

You can enhance your agent's capabilities by integrating custom tools using the function_tool concept, allowing for more tailored interactions.

Exploring Other Plugins

Experiment with different STT, LLM, and TTS plugins available in the VideoSDK framework to optimize your agent's performance. Consider using the

Silero Voice Activity Detection

and

Turn detector for AI voice Agents

to further refine your agent's interaction capabilities.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correctly configured in the .env file and that you have the necessary permissions.

Audio Input/Output Problems

Verify your audio devices are properly connected and configured. Check the system settings if you encounter issues.

Dependency and Version Conflicts

Use a virtual environment to manage dependencies and avoid conflicts. Ensure all packages are up-to-date.

Conclusion

Summary of What You've Built

You've successfully created an AI Voice Agent capable of real-time interaction using WebSockets for voice streaming.

Next Steps and Further Learning

Explore additional features and plugins to enhance your agent further. Consider diving deeper into the VideoSDK documentation for more advanced capabilities.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ