Introduction to AI Voice Agents in Conversational AI for Banking
What is an AI Voice Agent
?
An AI
Voice Agent
is a software program that interacts with users through voice commands, interpreting spoken language, and responding in a conversational manner. These agents utilize technologies like Speech-to-Text (STT), Language Models (LLM), and Text-to-Speech (TTS) to process and generate human-like interactions.Why are they important for the Conversational AI for Banking Industry?
In the banking industry, AI Voice Agents play a crucial role by providing customers with 24/7 access to banking services, reducing wait times, and enhancing user experience. They can assist with tasks such as checking account balances, answering queries about transactions, and providing information on banking products.
Core Components of a Voice Agent
- Speech-to-Text (STT): Converts spoken language into text.
- Language Models (LLM): Processes the text to understand and generate responses.
- Text-to-Speech (TTS): Converts the text response back into spoken language.
What You'll Build in This Tutorial
In this tutorial, you will create a conversational AI
voice agent
for banking using the VideoSDK framework. This agent will be able to assist users with common banking inquiries and tasks.Architecture and Core Concepts
High-Level Architecture Overview
The architecture of an AI
Voice Agent
involves capturing user speech, processing it through various stages, and generating a response. The data flow typically follows these steps:- User Speech: Captured via microphone.
- Speech-to-Text (STT): Converts speech to text.
- Language Model (LLM): Understands and processes the text.
- Text-to-Speech (TTS): Converts the response text back to speech.
- Response: Delivered back to the user.

Understanding Key Concepts in the VideoSDK Framework
- Agent: The core class representing your bot, handling interactions.
Cascading Pipeline in AI voice Agents
: Manages the flow of audio processing from STT to LLM to TTS.- VAD & TurnDetector: These components help the agent know when to listen and when to speak.
Setting Up the Development Environment
Prerequisites
Before starting, ensure you have Python 3.11+ installed and a VideoSDK account at app.videosdk.live.
Step 1: Create a Virtual Environment
Create a virtual environment to manage your project dependencies:
1python -m venv myenv
2source myenv/bin/activate # On Windows use `myenv\\Scripts\\activate`
3Step 2: Install Required Packages
Install the necessary packages using pip:
1pip install videosdk
2Step 3: Configure API Keys in a .env File
Create a
.env file in your project directory and add your VideoSDK API keys:1VIDEOSDK_API_KEY=your_api_key_here
2Building the AI Voice Agent: A Step-by-Step Guide
To begin, here is the complete code for the AI Voice Agent:
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10pre_download_model()
11
12agent_instructions = "You are a knowledgeable and friendly banking assistant AI designed to help customers with their banking needs. Your primary role is to provide information and assistance related to banking services, such as account balances, recent transactions, loan inquiries, and branch locations. You can also guide users through basic banking procedures and answer frequently asked questions about banking products.\n\nCapabilities:\n1. Provide real-time account information, including balances and recent transactions.\n2. Assist with loan inquiries and provide information on different types of loans available.\n3. Guide users on how to perform basic banking tasks, such as transferring money or setting up direct deposits.\n4. Offer information about branch locations and operating hours.\n5. Answer general questions about banking products and services.\n\nConstraints and Limitations:\n1. You do not have access to personal data beyond what the user provides during the interaction.\n2. You cannot perform transactions or access sensitive account details without explicit user consent and verification.\n3. You must remind users to verify any critical information through official banking channels.\n4. You are not a financial advisor and should not provide investment advice.\n5. Always include a disclaimer that users should contact their bank directly for any urgent or complex issues."
13
14class MyVoiceAgent(Agent):
15 def __init__(self):
16 super().__init__(instructions=agent_instructions)
17 async def on_enter(self): await self.session.say("Hello! How can I help?")
18 async def on_exit(self): await self.session.say("Goodbye!")
19
20async def start_session(context: JobContext):
21 agent = MyVoiceAgent()
22 conversation_flow = ConversationFlow(agent)
23
24 pipeline = CascadingPipeline(
25 stt=DeepgramSTT(model="nova-2", language="en"),
26 llm=OpenAILLM(model="gpt-4o"),
27 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
28 vad=[Silero Voice Activity Detection](https://docs.videosdk.live/ai_agents/plugins/silero-vad)(threshold=0.35),
29 turn_detector=[Turn detector for AI voice Agents](https://docs.videosdk.live/ai_agents/plugins/turn-detector)(threshold=0.8)
30 )
31
32 session = [AI voice Agent Sessions](https://docs.videosdk.live/ai_agents/core-components/agent-session)(
33 agent=agent,
34 pipeline=pipeline,
35 conversation_flow=conversation_flow
36 )
37
38 try:
39 await context.connect()
40 await session.start()
41 await asyncio.Event().wait()
42 finally:
43 await session.close()
44 await context.shutdown()
45
46def make_context() -> JobContext:
47 room_options = RoomOptions(
48 name="VideoSDK Cascaded Agent",
49 playground=True
50 )
51
52 return JobContext(room_options=room_options)
53
54if __name__ == "__main__":
55 job = WorkerJob(entrypoint=start_session, jobctx=make_context)
56 job.start()
57Step 4.1: Generating a VideoSDK Meeting ID
To create a meeting ID, use the following
curl command:1curl -X POST https://api.videosdk.live/v1/meetings \
2-H "Authorization: Bearer YOUR_API_KEY" \
3-H "Content-Type: application/json"
4Step 4.2: Creating the Custom Agent Class
The
MyVoiceAgent class extends the Agent class, defining the behavior of the voice agent. It uses the agent_instructions to guide interactions and defines actions on entering and exiting a session.Step 4.3: Defining the Core Pipeline
The
CascadingPipeline is the backbone of the voice agent, integrating various plugins:- DeepgramSTT: Converts speech to text using the "nova-2" model.
- OpenAILLM: Processes text with the "gpt-4o" model for understanding and response generation.
- ElevenLabsTTS: Converts text responses back to speech.
- SileroVAD: Detects voice activity to manage when the agent should listen.
- TurnDetector: Helps determine when the agent should speak.
Step 4.4: Managing the Session and Startup Logic
The
start_session function initializes the agent, pipeline, and session. It connects to the VideoSDK service and starts the session, running indefinitely until manually stopped. The make_context function sets up the session's context, including room options.Running and Testing the Agent
Step 5.1: Running the Python Script
Execute the script using:
1python main.py
2Step 5.2: Interacting with the Agent in the Playground
After running the script, find the playground link in the console output. Use this link to join the session and interact with your voice agent.
Advanced Features and Customizations
Extending Functionality with Custom Tools
You can extend the agent's functionality by integrating custom tools using the
function_tool feature of the VideoSDK framework.Exploring Other Plugins
Consider exploring other STT, LLM, and TTS plugins available in the VideoSDK framework to enhance your agent's capabilities.
Troubleshooting Common Issues
API Key and Authentication Errors
Ensure your API keys are correctly configured in the
.env file and that they have the necessary permissions.Audio Input/Output Problems
Check your microphone and speaker settings to ensure they are correctly configured and functioning.
Dependency and Version Conflicts
Ensure all dependencies are installed with compatible versions, especially when using a virtual environment.
Conclusion
Summary of What You've Built
In this guide, you've built a conversational AI voice agent for banking using the VideoSDK framework, capable of handling common banking inquiries and tasks.
Next Steps and Further Learning
Explore additional features and plugins in the VideoSDK framework to further enhance your AI voice agent's capabilities.
Want to level-up your learning? Subscribe now
Subscribe to our newsletter for more tech based insights
FAQ