Why are AI Voice Agents important in finance?

They provide instant support and personalized financial advice, improving customer service and reducing the need for human intervention.

What are the core components of a Voice Agent?

The core components include Speech-to-Text (STT), Large Language Model (LLM), and Text-to-Speech (TTS).

How do I generate a VideoSDK meeting ID?

You can generate a meeting ID using the VideoSDK API with a POST request to the meetings endpoint.

What should I do if I encounter API key errors?

Check your API keys in the `.env` file for accuracy and ensure your VideoSDK account is active.

Build a Conversational AI for Finance

Q: What is an AI Voice Agent?

An AI Voice Agent is a software application that uses speech recognition and natural language processing to interact with users through voice commands.

Step-by-step guide to building a conversational AI voice agent for finance using VideoSDK, complete with code examples.

Introduction to AI Voice Agents in Conversational AI for Finance

What is an AI
Voice Agent
?

An AI

Voice Agent

is a software application capable of understanding and responding to human speech. These agents leverage technologies like Speech-to-Text (STT), Natural Language Processing (NLP), and Text-to-Speech (TTS) to interact with users in a conversational manner. They are designed to automate customer service, provide information, and perform tasks through voice commands.

Why are they important for the conversational AI for finance industry?

In the finance industry, AI Voice Agents can revolutionize customer interactions by providing instant support and personalized financial advice. They can handle inquiries about account balances, transaction histories, investment options, and more, all while reducing the need for human intervention.

Core Components of a
Voice Agent

Speech-to-Text (STT): Converts spoken language into text.
Large Language Model (LLM): Processes and understands the text to generate a response.
Text-to-Speech (TTS): Converts the text response back into spoken language.

What You'll Build in This Tutorial

In this tutorial, you will build a conversational AI

voice agent

tailored for finance-related queries using the VideoSDK framework. This agent will understand and respond to user inquiries about financial topics, providing a seamless conversational experience.

Architecture and Core Concepts

High-Level Architecture Overview

The architecture of our AI

Voice Agent

involves several key components that work together to process user input and generate responses. The flow typically starts with the user's speech being captured and converted into text using STT. This text is then processed by an LLM to generate a response, which is finally converted back into speech using TTS.

Understanding Key Concepts in the VideoSDK Framework

Agent: The core class representing your bot, responsible for managing interactions.
Cascading Pipeline in AI voice Agents
: Manages the flow of audio processing from STT to LLM to TTS.
VAD & TurnDetector
: These components help the agent determine when to listen and when to speak, ensuring smooth conversational flow.

Setting Up the Development Environment

Prerequisites

To get started, ensure you have Python 3.11+ installed. You will also need a VideoSDK account, which you can create at app.videosdk.live.

Step 1: Create a Virtual Environment

Create a virtual environment to manage your project dependencies:

1python -m venv venv
2source venv/bin/activate  # On Windows use `venv\Scripts\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip:

1pip install videosdk
2

Step 3: Configure API Keys in a `.env` file

Create a .env file in your project directory and add your VideoSDK API keys:

1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

Here is the complete code for your AI Voice Agent:

1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are a knowledgeable financial assistant specializing in providing conversational AI support for finance-related inquiries. Your primary role is to assist users with understanding financial concepts, providing insights into financial products, and offering guidance on personal finance management. You can answer questions about budgeting, investment options, savings plans, and financial terminology. However, you are not a certified financial advisor, and you must include a disclaimer advising users to consult with a professional for personalized financial advice. You should maintain a professional and informative tone, ensuring that all information provided is accurate and up-to-date. You are also capable of directing users to reputable financial resources and tools for further assistance."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=[Silero Voice Activity Detection](https://docs.videosdk.live/ai_agents/plugins/silero-vad)(threshold=0.35),
32        turn_detector=[Turn detector for AI voice Agents](https://docs.videosdk.live/ai_agents/plugins/turn-detector)(threshold=0.8)
33    )
34
35    session = [AI voice Agent Sessions](https://docs.videosdk.live/ai_agents/core-components/agent-session)(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Step 4.1: Generating a VideoSDK Meeting ID

To interact with your agent, you need a meeting ID. You can generate one using the VideoSDK API:

1curl -X POST 'https://api.videosdk.live/v1/meetings' \
2-H 'Authorization: YOUR_API_KEY' \
3-H 'Content-Type: application/json'
4

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class is where we define the behavior of our agent. It extends the Agent class from the VideoSDK framework. The on_enter and on_exit methods handle the initial greeting and farewell messages.

Step 4.3: Defining the Core Pipeline

The CascadingPipeline is crucial for processing audio data. It defines the flow from STT to LLM to TTS, using plugins for each stage:

STT (DeepgramSTT): Converts speech to text.
LLM (OpenAILLM): Processes the text to generate a response.
TTS (ElevenLabsTTS): Converts the response text back to speech.
VAD (SileroVAD) & TurnDetector: Manage when the agent listens and responds.

Step 4.4: Managing the Session and Startup Logic

The start_session function sets up the agent session, connecting the agent and pipeline. The make_context function configures the session environment, including room options. The main block initializes and starts the agent job.

Running and Testing the Agent

Step 5.1: Running the Python Script

Run your Python script to start the agent:

1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

After starting the agent, use the console output to find the

AI Agent playground

link. Join the session and interact with your agent to test its functionality.

Advanced Features and Customizations

Extending Functionality with Custom Tools

You can extend your agent's capabilities by integrating custom tools. This involves defining new functions and incorporating them into the agent's logic.

Exploring Other Plugins

The VideoSDK framework supports various plugins for STT, LLM, and TTS. Explore other options to find the best fit for your needs.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correctly configured in the .env file. Check for typos and verify your account status.

Audio Input/Output Problems

Verify that your microphone and speakers are correctly set up and functioning. Check your system's audio settings.

Dependency and Version Conflicts

Ensure all dependencies are installed and compatible with your Python version. Use a virtual environment to manage packages.

Conclusion

Summary of What You've Built

You've successfully created a conversational AI voice agent for finance, capable of handling user inquiries and providing financial insights.

Next Steps and Further Learning

Consider exploring additional features and customizations to enhance your agent. Continue learning about AI and voice technologies to expand your skills.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ

Free $20 Balance for AI Voice Agents & Video Calls