Build an AI Voice Assistant for Utilities

Step-by-step guide to building an AI voice assistant for the utilities industry using VideoSDK.

Introduction to AI Voice Agents in the Utilities Industry

What is an AI

Voice Agent

?

An AI

Voice Agent

is a sophisticated software application designed to interact with users through voice commands, providing responses and performing tasks based on the input. These agents leverage technologies like Speech-to-Text (STT), Language Learning Models (LLM), and Text-to-Speech (TTS) to process and respond to user queries effectively.

Why are they important for the Utilities Industry?

In the utilities industry, AI Voice Agents can revolutionize customer service by providing instant responses to queries about billing, service outages, and energy-saving tips. They can guide users through troubleshooting common issues and even schedule service appointments. This automation not only enhances customer satisfaction but also reduces the operational burden on human agents.

Core Components of a

Voice Agent

  • STT (Speech-to-Text): Converts spoken language into text for processing.
  • LLM (Language Learning Model): Analyzes and generates appropriate responses based on the input text.
  • TTS (Text-to-Speech): Converts the generated text response back into speech.

What You'll Build in This Tutorial

In this tutorial, we will guide you through building a fully functional AI Voice Assistant tailored for the utilities industry using the VideoSDK framework. You'll learn how to set up the environment, implement the core components, and test your agent.

Architecture and Core Concepts

High-Level Architecture Overview

The architecture of an AI

Voice Agent

involves several key components working in tandem. The user's speech is first captured and converted to text via STT. This text is then processed by an LLM to generate a response, which is finally converted back to speech using TTS. The agent uses VAD (Voice

Activity Detection

) and

Turn Detectors

to manage when to listen and respond.
Diagram

Understanding Key Concepts in the VideoSDK Framework

  • Agent: This is the core class representing your AI Voice Assistant, responsible for handling interactions.
  • CascadingPipeline: A sequence of processes that handle audio input and output, including STT, LLM, and TTS.
  • VAD & TurnDetector: These components ensure the agent knows when to listen and when to speak, improving interaction flow.

Setting Up the Development Environment

Prerequisites

Before you begin, ensure you have Python 3.11+ installed and a VideoSDK account available at app.videosdk.live. This setup is crucial for accessing the necessary APIs and tools.

Step 1: Create a Virtual Environment

Creating a virtual environment helps manage dependencies and keep your project organized. Run the following commands in your terminal:
1python -m venv myenv
2source myenv/bin/activate  # On Windows use `myenv\Scripts\activate`
3

Step 2: Install Required Packages

Install the VideoSDK package and other necessary plugins using pip:
1pip install videosdk
2pip install videosdk-plugins-silero
3pip install videosdk-plugins-turn_detector
4pip install videosdk-plugins-deepgram
5pip install videosdk-plugins-openai
6pip install videosdk-plugins-elevenlabs
7

Step 3: Configure API Keys in a .env file

Create a .env file in your project directory to securely store your API keys:
1VIDEOSDK_API_KEY=your_videosdk_api_key
2DEEPGRAM_API_KEY=your_deepgram_api_key
3OPENAI_API_KEY=your_openai_api_key
4ELEVENLABS_API_KEY=your_elevenlabs_api_key
5

Building the AI Voice Agent: A Step-by-Step Guide

Let's begin by presenting the complete, runnable code for our AI Voice Assistant:
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are a knowledgeable AI Voice Assistant specialized in the utilities industry. Your primary role is to assist users with inquiries related to utility services such as electricity, water, and gas. You can provide information on billing, service outages, energy-saving tips, and guide users through troubleshooting common issues. You are capable of scheduling service appointments and directing users to the appropriate department for further assistance. However, you are not a certified technician and must advise users to contact professional services for technical repairs or emergencies. Always ensure to maintain user privacy and comply with data protection regulations."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = [AgentSession](https://docs.videosdk.live/ai_agents/core-components/agent-session)(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Step 4.1: Generating a VideoSDK Meeting ID

To create a meeting ID, use the following curl command:
1curl -X POST \\
2  https://api.videosdk.live/v1/rooms \\
3  -H "Authorization: Bearer YOUR_VIDEOSDK_API_KEY" \\
4  -H "Content-Type: application/json" \\
5  -d '{"name":"Utilities Assistant Meeting"}'
6

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class is where we define the behavior of our AI Voice Assistant. It inherits from the Agent class and provides custom responses on session enter and exit.
1class MyVoiceAgent(Agent):
2    def __init__(self):
3        super().__init__(instructions=agent_instructions)
4    async def on_enter(self): await self.session.say("Hello! How can I help?")
5    async def on_exit(self): await self.session.say("Goodbye!")
6

Step 4.3: Defining the Core Pipeline

The CascadingPipeline is crucial as it defines the flow of data through various plugins:
1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=[OpenAI LLM Plugin for voice agent](https://docs.videosdk.live/ai_agents/plugins/llm/openai)(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8
This pipeline integrates STT, LLM, TTS, VAD, and a Turn Detector to process and respond to user inputs effectively.

Step 4.4: Managing the Session and Startup Logic

The session management and startup logic are handled in the start_session function and the main block:
1async def start_session(context: JobContext):
2    # Create agent and conversation flow
3    agent = MyVoiceAgent()
4    conversation_flow = ConversationFlow(agent)
5
6    # Create pipeline
7    pipeline = CascadingPipeline(
8        stt=DeepgramSTT(model="nova-2", language="en"),
9        llm=OpenAILLM(model="gpt-4o"),
10        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
11        vad=SileroVAD(threshold=0.35),
12        turn_detector=TurnDetector(threshold=0.8)
13    )
14
15    session = AgentSession(
16        agent=agent,
17        pipeline=pipeline,
18        conversation_flow=conversation_flow
19    )
20
21    try:
22        await context.connect()
23        await session.start()
24        # Keep the session running until manually terminated
25        await asyncio.Event().wait()
26    finally:
27        # Clean up resources when done
28        await session.close()
29        await context.shutdown()
30
31def make_context() -> JobContext:
32    room_options = RoomOptions(
33    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
34        name="VideoSDK Cascaded Agent",
35        playground=True
36    )
37
38    return JobContext(room_options=room_options)
39
40if __name__ == "__main__":
41    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
42    job.start()
43

Running and Testing the Agent

Step 5.1: Running the Python Script

To run the agent, execute the following command in your terminal:
1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

Once the script is running, you will find a playground link in the console. Use this link to join the session and interact with your AI Voice Assistant. The agent will respond to your queries as configured.

Advanced Features and Customizations

Extending Functionality with Custom Tools

The VideoSDK framework allows you to extend your agent's capabilities by integrating custom tools. This can include additional plugins or custom logic to handle specific tasks.

Exploring Other Plugins

Beyond the default plugins, you can explore other STT, LLM, and TTS options available in the VideoSDK framework to tailor the agent's performance to your needs.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correctly set in the .env file and that your account has the necessary permissions.

Audio Input/Output Problems

Check your microphone and speaker settings, and ensure the correct devices are selected in your system preferences.

Dependency and Version Conflicts

Use a virtual environment to manage dependencies and avoid conflicts between different package versions.

Conclusion

Summary of What You've Built

In this tutorial, you have successfully built an AI Voice Assistant tailored for the utilities industry. You learned how to set up the environment, implement core components, and test your agent.

Next Steps and Further Learning

Consider exploring additional features and plugins to enhance your agent's capabilities. Stay updated with the latest advancements in AI and voice technology to continually improve your applications. For more details on

AI voice Agent deployment

, you can explore further resources.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ