Build an AI Voice Assistant for Hospitality

Step-by-step guide to building an AI voice assistant for hospitality using VideoSDK.

Introduction to AI Voice Agents in the Hospitality Industry

What is an AI

Voice Agent

?

An AI

Voice Agent

is a software application that uses artificial intelligence to interact with users through voice commands. These agents can understand spoken language, process the information, and respond in a human-like manner. They are designed to perform tasks such as answering questions, providing information, and executing commands, making them highly versatile tools in various industries.

Why are they important for the hospitality industry?

In the hospitality industry, AI Voice Agents enhance guest experiences by providing quick and efficient service. They can assist with booking services, answering questions about hotel amenities, offering local area information, and handling basic customer service inquiries. This not only improves customer satisfaction but also allows staff to focus on more complex tasks.

Core Components of a

Voice Agent

  • Speech-to-Text (STT): Converts spoken language into text.
  • Large Language Model (LLM): Processes the text to understand the intent and generate responses.
  • Text-to-Speech (TTS): Converts the text response back into spoken language.

What You'll Build in This Tutorial

In this tutorial, you will build an AI Voice Assistant tailored for the hospitality industry using the VideoSDK framework. By the end, you will have a functional

voice agent

capable of interacting with users and providing valuable services.

Architecture and Core Concepts

High-Level Architecture Overview

The architecture of an AI

Voice Agent

involves several components working together to process user input and generate responses. The process begins with capturing the user's speech, which is then converted to text using STT. The text is processed by an LLM to determine the appropriate response, which is then converted back to speech using TTS.
Diagram

Understanding Key Concepts in the VideoSDK Framework

  • Agent: The core class representing your bot, responsible for handling interactions.
  • CascadingPipeline: The flow of audio processing from STT to LLM to TTS, ensuring seamless communication. For a detailed explanation, refer to the

    Cascading pipeline in AI voice Agents

    .
  • VAD & TurnDetector: Tools that help the agent determine when to listen and when to respond, enhancing interaction efficiency. Learn more about the

    Turn detector for AI voice Agents

    .

Setting Up the Development Environment

Prerequisites

To get started, ensure you have Python 3.11+ installed and a VideoSDK account. Sign up at app.videosdk.live to access the necessary API keys.

Step 1: Create a Virtual Environment

Create a virtual environment to manage dependencies:
1python -m venv myenv
2source myenv/bin/activate  # On Windows use `myenv\\Scripts\\activate`
3

Step 2: Install Required Packages

Install the required packages using pip:
1pip install videosdk
2

Step 3: Configure API Keys in a .env file

Create a .env file in your project directory and add your VideoSDK API keys:
1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

Here's the complete code for building your AI Voice Agent:
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are a friendly and efficient AI Voice Assistant designed specifically for the hospitality industry. Your primary role is to enhance guest experiences by providing quick and accurate information. You can assist guests with booking services, answering questions about hotel amenities, providing local area information, and handling basic customer service inquiries. However, you must always maintain a polite and professional tone. You are not authorized to handle financial transactions or provide personal opinions. Always remind guests to contact the front desk for any issues that require human intervention or for detailed inquiries beyond your capabilities. Your goal is to make the guest's stay as pleasant and seamless as possible while respecting their privacy and security."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = AgentSession(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Step 4.1: Generating a VideoSDK Meeting ID

To generate a meeting ID, use the following curl command:
1curl -X POST \
2  https://api.videosdk.live/v1/meetings \
3  -H "Authorization: Bearer YOUR_API_KEY"
4
This will return a meeting ID that you can use to connect your agent.

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class is where you define the behavior of your voice agent. It inherits from the Agent class and uses the agent_instructions to guide its interactions. The on_enter and on_exit methods define what the agent says when a session starts and ends.

Step 4.3: Defining the Core Pipeline

The

CascadingPipeline

is the backbone of the voice agent, connecting the STT, LLM, TTS, VAD, and TurnDetector plugins. Each plugin plays a crucial role:
  • DeepgramSTT: Converts speech to text.
  • OpenAILLM: Processes the text to generate a response.
  • ElevenLabsTTS: Converts the response text back to speech.
  • SileroVAD: Detects voice activity to manage when the agent should listen.
  • TurnDetector: Determines when the agent should respond.

Step 4.4: Managing the Session and Startup Logic

The start_session function initializes the agent, conversation flow, and pipeline, and manages the session lifecycle. The

AI voice Agent Sessions

are crucial for maintaining the flow of interaction. The make_context function sets up the environment for the session, including the room options. The main block runs the agent, using WorkerJob to start the session.

Running and Testing the Agent

Step 5.1: Running the Python Script

Run the script using:
1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

Once running, the console will display a playground link. Open this link in a browser to interact with your agent. You can speak to the agent and receive responses in real-time.

Advanced Features and Customizations

Extending Functionality with Custom Tools

The VideoSDK framework allows for extending functionality using custom tools, enabling you to add unique features tailored to specific needs.

Exploring Other Plugins

Explore other plugins for STT, LLM, and TTS to customize your agent further. Options like Cartesia for STT or Google Gemini for LLM can offer different capabilities.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correctly configured in the .env file and that they have the necessary permissions.

Audio Input/Output Problems

Check your microphone and speaker settings to ensure they are correctly configured and functioning.

Dependency and Version Conflicts

Ensure all dependencies are installed with compatible versions, and consider using a virtual environment to manage them.

Conclusion

Summary of What You've Built

You have successfully built an AI Voice Assistant for the hospitality industry using the VideoSDK framework. This agent can interact with users, providing valuable services and enhancing guest experiences.

Next Steps and Further Learning

Consider exploring additional plugins and custom tools to expand your agent's capabilities. Stay updated with the latest developments in AI and voice technology to continuously improve your solutions. Additionally, understanding

AI voice Agent tracing and observability

can help in monitoring and improving the performance of your voice agent.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ