Build an AI Voice Assistant for Telecom

Step-by-step guide to building an AI voice assistant for the telecom industry using VideoSDK.

Introduction to AI Voice Agents in the Telecom Industry

In today's fast-paced world, AI voice agents have become integral to various industries, including telecom. These agents can handle customer inquiries, provide information, and even troubleshoot common issues, making them invaluable tools for enhancing customer service and operational efficiency.

What is an AI Voice Agent?

An AI Voice Agent is a software application that uses artificial intelligence to interact with users through voice commands. It processes spoken language, understands the intent, and responds appropriately, often mimicking human conversation.

Why are they important for the Telecom Industry?

In the telecom industry, AI voice agents can streamline customer support by handling routine inquiries, assisting with technical support, and offering personalized recommendations. They can operate 24/7, reducing wait times and improving customer satisfaction.

Core Components of a Voice Agent

A typical voice agent relies on several core components:
  • Speech-to-Text (STT): Converts spoken words into text.
  • Large Language Model (LLM): Understands and processes the text to determine the intent.
  • Text-to-Speech (TTS): Converts the response text back into spoken words.
For a comprehensive understanding of these components, refer to the

AI voice Agent core components overview

.

What You'll Build in This Tutorial

In this tutorial, we'll guide you through building a fully functional AI voice assistant tailored for the telecom industry using the VideoSDK framework. You can start by following the

Voice Agent Quick Start Guide

.

Architecture and Core Concepts

High-Level Architecture Overview

Let's explore the architecture of our AI voice agent. The process begins with the user speaking into the system. The speech is converted to text by the STT component, processed by the LLM to generate a response, and then converted back to speech by the TTS component.
1sequenceDiagram
2    participant User
3    participant STT
4    participant LLM
5    participant TTS
6    participant Agent
7    User->>STT: Speak
8    STT->>LLM: Text
9    LLM->>TTS: Response
10    TTS->>User: Speak
11

Understanding Key Concepts in the VideoSDK Framework

  • Agent: The core class representing your bot, responsible for managing interactions.
  • CascadingPipeline: Orchestrates the flow of audio processing from STT to LLM to TTS. Learn more about the

    Cascading pipeline in AI voice Agents

    .
  • VAD & TurnDetector: These components help the agent determine when to listen and when to speak. Explore the

    Turn detector for AI voice Agents

    .

Setting Up the Development Environment

Prerequisites

Before we start, ensure you have Python 3.11+ installed and a VideoSDK account. You can sign up at app.videosdk.live.

Step 1: Create a Virtual Environment

Create a virtual environment to manage your project dependencies:
1python3 -m venv venv
2source venv/bin/activate  # On Windows use `venv\\Scripts\\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip:
1pip install videosdk-agents silero-vad deepgram openai elevenlabs
2

Step 3: Configure API Keys in a .env File

Create a .env file in your project directory and add your API keys:
1VIDEOSDK_API_KEY=your_videosdk_api_key
2DEEPGRAM_API_KEY=your_deepgram_api_key
3OPENAI_API_KEY=your_openai_api_key
4ELEVENLABS_API_KEY=your_elevenlabs_api_key
5

Building the AI Voice Agent: A Step-by-Step Guide

Below is the complete code for our AI voice agent. We'll break it down to understand each part.
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are a knowledgeable AI Voice Assistant specialized in the telecom industry. Your primary role is to assist users with telecom-related inquiries and tasks. You can provide information about telecom services, help troubleshoot common issues, guide users through setting up their devices, and offer insights into telecom plans and offers. However, you are not a certified telecom technician, so you must advise users to contact their service provider for complex technical issues or account-specific inquiries. Always ensure that your responses are clear, concise, and helpful, maintaining a professional and friendly tone."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = AgentSession(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Step 4.1: Generating a VideoSDK Meeting ID

To interact with your agent, you'll need a meeting ID. Use the following curl command to generate one:
1curl -X POST https://api.videosdk.live/v1/meetings -H "Authorization: Bearer YOUR_VIDEOSDK_API_KEY"
2

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class is where we define the agent's behavior. It inherits from the Agent class and specifies what the agent should say when a session starts and ends.
1class MyVoiceAgent(Agent):
2    def __init__(self):
3        super().__init__(instructions=agent_instructions)
4    async def on_enter(self): await self.session.say("Hello! How can I help?")
5    async def on_exit(self): await self.session.say("Goodbye!")
6

Step 4.3: Defining the Core Pipeline

The CascadingPipeline is crucial for processing audio. It connects STT, LLM, and TTS plugins, ensuring smooth data flow. For more details, check out the

Deepgram STT Plugin for voice agent

,

OpenAI LLM Plugin for voice agent

, and

ElevenLabs TTS Plugin for voice agent

.
1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8

Step 4.4: Managing the Session and Startup Logic

The start_session function manages the lifecycle of the agent's session, while make_context sets up the environment for the agent to operate in. For more information on sessions, refer to

AI voice Agent Sessions

.
1async def start_session(context: JobContext):
2    # Create agent and conversation flow
3    agent = MyVoiceAgent()
4    conversation_flow = ConversationFlow(agent)
5
6    # Create pipeline
7    pipeline = CascadingPipeline(
8        stt=DeepgramSTT(model="nova-2", language="en"),
9        llm=OpenAILLM(model="gpt-4o"),
10        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
11        vad=SileroVAD(threshold=0.35),
12        turn_detector=TurnDetector(threshold=0.8)
13    )
14
15    session = AgentSession(
16        agent=agent,
17        pipeline=pipeline,
18        conversation_flow=conversation_flow
19    )
20
21    try:
22        await context.connect()
23        await session.start()
24        # Keep the session running until manually terminated
25        await asyncio.Event().wait()
26    finally:
27        # Clean up resources when done
28        await session.close()
29        await context.shutdown()
30
31def make_context() -> JobContext:
32    room_options = RoomOptions(
33    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
34        name="VideoSDK Cascaded Agent",
35        playground=True
36    )
37
38    return JobContext(room_options=room_options)
39
40if __name__ == "__main__":
41    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
42    job.start()
43

Running and Testing the Agent

Step 5.1: Running the Python Script

To start your AI voice agent, run the script using Python:
1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

Once the script is running, you'll receive a link to the VideoSDK playground in the console. Open it in your browser to interact with your agent.

Advanced Features and Customizations

Extending Functionality with Custom Tools

The VideoSDK framework allows you to extend your agent's functionality with custom tools, enabling more complex interactions and integrations.

Exploring Other Plugins

Beyond the plugins used in this tutorial, VideoSDK supports various STT, LLM, and TTS options, allowing you to tailor your agent's capabilities to your specific needs. Consider exploring the

Silero Voice Activity Detection

for enhanced voice activity detection.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correctly set in the .env file and that they are valid.

Audio Input/Output Problems

Check your microphone and speaker settings if you encounter issues with audio input or output.

Dependency and Version Conflicts

Ensure all dependencies are installed and compatible with your Python version.

Conclusion

Summary of What You've Built

Congratulations! You've built a fully functional AI voice assistant tailored for the telecom industry.

Next Steps and Further Learning

Explore additional plugins and features in the VideoSDK framework to enhance your agent's capabilities further.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ