Conversational AI for Banking: Build a Voice Agent

Step-by-step guide to building a conversational AI voice agent for banking using VideoSDK.

Introduction to AI Voice Agents in Conversational AI for Banking

What is an AI

Voice Agent

?

An AI

Voice Agent

is a software program that interacts with users through voice commands, interpreting spoken language, and responding in a conversational manner. These agents utilize technologies like Speech-to-Text (STT), Language Models (LLM), and Text-to-Speech (TTS) to process and generate human-like interactions.

Why are they important for the Conversational AI for Banking Industry?

In the banking industry, AI Voice Agents play a crucial role by providing customers with 24/7 access to banking services, reducing wait times, and enhancing user experience. They can assist with tasks such as checking account balances, answering queries about transactions, and providing information on banking products.

Core Components of a

Voice Agent

  • Speech-to-Text (STT): Converts spoken language into text.
  • Language Models (LLM): Processes the text to understand and generate responses.
  • Text-to-Speech (TTS): Converts the text response back into spoken language.

What You'll Build in This Tutorial

In this tutorial, you will create a conversational AI

voice agent

for banking using the VideoSDK framework. This agent will be able to assist users with common banking inquiries and tasks.

Architecture and Core Concepts

High-Level Architecture Overview

The architecture of an AI

Voice Agent

involves capturing user speech, processing it through various stages, and generating a response. The data flow typically follows these steps:
  1. User Speech: Captured via microphone.
  2. Speech-to-Text (STT): Converts speech to text.
  3. Language Model (LLM): Understands and processes the text.
  4. Text-to-Speech (TTS): Converts the response text back to speech.
  5. Response: Delivered back to the user.
Diagram

Understanding Key Concepts in the VideoSDK Framework

  • Agent: The core class representing your bot, handling interactions.
  • Cascading Pipeline in AI voice Agents

    :
    Manages the flow of audio processing from STT to LLM to TTS.
  • VAD & TurnDetector: These components help the agent know when to listen and when to speak.

Setting Up the Development Environment

Prerequisites

Before starting, ensure you have Python 3.11+ installed and a VideoSDK account at app.videosdk.live.

Step 1: Create a Virtual Environment

Create a virtual environment to manage your project dependencies:
1python -m venv myenv
2source myenv/bin/activate  # On Windows use `myenv\\Scripts\\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip:
1pip install videosdk
2

Step 3: Configure API Keys in a .env File

Create a .env file in your project directory and add your VideoSDK API keys:
1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

To begin, here is the complete code for the AI Voice Agent:
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10pre_download_model()
11
12agent_instructions = "You are a knowledgeable and friendly banking assistant AI designed to help customers with their banking needs. Your primary role is to provide information and assistance related to banking services, such as account balances, recent transactions, loan inquiries, and branch locations. You can also guide users through basic banking procedures and answer frequently asked questions about banking products.\n\nCapabilities:\n1. Provide real-time account information, including balances and recent transactions.\n2. Assist with loan inquiries and provide information on different types of loans available.\n3. Guide users on how to perform basic banking tasks, such as transferring money or setting up direct deposits.\n4. Offer information about branch locations and operating hours.\n5. Answer general questions about banking products and services.\n\nConstraints and Limitations:\n1. You do not have access to personal data beyond what the user provides during the interaction.\n2. You cannot perform transactions or access sensitive account details without explicit user consent and verification.\n3. You must remind users to verify any critical information through official banking channels.\n4. You are not a financial advisor and should not provide investment advice.\n5. Always include a disclaimer that users should contact their bank directly for any urgent or complex issues."
13
14class MyVoiceAgent(Agent):
15    def __init__(self):
16        super().__init__(instructions=agent_instructions)
17    async def on_enter(self): await self.session.say("Hello! How can I help?")
18    async def on_exit(self): await self.session.say("Goodbye!")
19
20async def start_session(context: JobContext):
21    agent = MyVoiceAgent()
22    conversation_flow = ConversationFlow(agent)
23
24    pipeline = CascadingPipeline(
25        stt=DeepgramSTT(model="nova-2", language="en"),
26        llm=OpenAILLM(model="gpt-4o"),
27        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
28        vad=[Silero Voice Activity Detection](https://docs.videosdk.live/ai_agents/plugins/silero-vad)(threshold=0.35),
29        turn_detector=[Turn detector for AI voice Agents](https://docs.videosdk.live/ai_agents/plugins/turn-detector)(threshold=0.8)
30    )
31
32    session = [AI voice Agent Sessions](https://docs.videosdk.live/ai_agents/core-components/agent-session)(
33        agent=agent,
34        pipeline=pipeline,
35        conversation_flow=conversation_flow
36    )
37
38    try:
39        await context.connect()
40        await session.start()
41        await asyncio.Event().wait()
42    finally:
43        await session.close()
44        await context.shutdown()
45
46def make_context() -> JobContext:
47    room_options = RoomOptions(
48        name="VideoSDK Cascaded Agent",
49        playground=True
50    )
51
52    return JobContext(room_options=room_options)
53
54if __name__ == "__main__":
55    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
56    job.start()
57

Step 4.1: Generating a VideoSDK Meeting ID

To create a meeting ID, use the following curl command:
1curl -X POST https://api.videosdk.live/v1/meetings \
2-H "Authorization: Bearer YOUR_API_KEY" \
3-H "Content-Type: application/json"
4

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class extends the Agent class, defining the behavior of the voice agent. It uses the agent_instructions to guide interactions and defines actions on entering and exiting a session.

Step 4.3: Defining the Core Pipeline

The CascadingPipeline is the backbone of the voice agent, integrating various plugins:
  • DeepgramSTT: Converts speech to text using the "nova-2" model.
  • OpenAILLM: Processes text with the "gpt-4o" model for understanding and response generation.
  • ElevenLabsTTS: Converts text responses back to speech.
  • SileroVAD: Detects voice activity to manage when the agent should listen.
  • TurnDetector: Helps determine when the agent should speak.

Step 4.4: Managing the Session and Startup Logic

The start_session function initializes the agent, pipeline, and session. It connects to the VideoSDK service and starts the session, running indefinitely until manually stopped. The make_context function sets up the session's context, including room options.

Running and Testing the Agent

Step 5.1: Running the Python Script

Execute the script using:
1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

After running the script, find the playground link in the console output. Use this link to join the session and interact with your voice agent.

Advanced Features and Customizations

Extending Functionality with Custom Tools

You can extend the agent's functionality by integrating custom tools using the function_tool feature of the VideoSDK framework.

Exploring Other Plugins

Consider exploring other STT, LLM, and TTS plugins available in the VideoSDK framework to enhance your agent's capabilities.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correctly configured in the .env file and that they have the necessary permissions.

Audio Input/Output Problems

Check your microphone and speaker settings to ensure they are correctly configured and functioning.

Dependency and Version Conflicts

Ensure all dependencies are installed with compatible versions, especially when using a virtual environment.

Conclusion

Summary of What You've Built

In this guide, you've built a conversational AI voice agent for banking using the VideoSDK framework, capable of handling common banking inquiries and tasks.

Next Steps and Further Learning

Explore additional features and plugins in the VideoSDK framework to further enhance your AI voice agent's capabilities.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ