Building Conversational AI in Finance

Implement a conversational AI voice agent for finance using VideoSDK. Follow our step-by-step guide with code examples.

Introduction to AI Voice Agents in Conversational AI in Finance

What is an AI

Voice Agent

?

An AI

Voice Agent

is a sophisticated software application designed to interact with users through voice commands. These agents leverage technologies like speech-to-text (STT), natural language processing (NLP), and text-to-speech (TTS) to understand and respond to user queries. They are increasingly becoming integral in various industries, providing a seamless, hands-free user experience.

Why are they important for the conversational AI in finance industry?

In the finance sector, conversational AI voice agents can revolutionize customer service by providing instant responses to queries about account balances, transaction histories, and investment advice. They can operate 24/7, reducing the need for human intervention and improving customer satisfaction. Additionally, they can assist in fraud detection and compliance by monitoring transactions and alerting users to suspicious activities.

Core Components of a

Voice Agent

  • Speech-to-Text (STT): Converts spoken language into text.
  • Large Language Model (LLM): Processes the text to understand and generate responses.
  • Text-to-Speech (TTS): Converts the generated text back into spoken language.

What You'll Build in This Tutorial

In this tutorial, we'll guide you through building a conversational AI

voice agent

tailored for the finance industry using the VideoSDK framework. You'll learn how to set up the environment, create a custom agent, and deploy it for real-world applications.

Architecture and Core Concepts

High-Level Architecture Overview

The architecture of a conversational AI

voice agent

involves several key components working in harmony. When a user speaks, the agent captures the audio input, processes it through a series of transformations, and responds with synthesized speech. The process involves:
  1. Voice

    Activity Detection

    (VAD)
    : Determines when the user has finished speaking.
  2. Speech-to-Text (STT): Converts the captured audio into text.
  3. Large Language Model (LLM): Analyzes the text and generates a response.
  4. Text-to-Speech (TTS): Converts the response text back into audio.
Diagram

Understanding Key Concepts in the VideoSDK Framework

Setting Up the Development Environment

Prerequisites

Before we begin, ensure you have Python 3.11+ installed. You'll also need a VideoSDK account, which you can create at app.videosdk.live.

Step 1: Create a Virtual Environment

To keep dependencies organized, create a virtual environment:
1python -m venv venv
2source venv/bin/activate  # On Windows use `venv\Scripts\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip:
1pip install videosdk
2pip install python-dotenv
3

Step 3: Configure API Keys in a .env file

Create a .env file in your project directory and add your VideoSDK API keys:
1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

Here is the complete code for the AI Voice Agent:
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are a knowledgeable financial assistant specializing in conversational AI for the finance sector. Your primary role is to assist users with financial inquiries, provide insights into financial products, and offer guidance on financial planning. You can answer questions about banking services, investment options, and financial regulations. However, you are not a certified financial advisor, and you must include a disclaimer advising users to consult with a professional for personalized financial advice. You should maintain a professional and courteous tone, ensuring that all information provided is accurate and up-to-date. You are also capable of integrating with financial APIs to fetch real-time data, but you must ensure user data privacy and comply with relevant data protection regulations."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = [AI voice Agent Sessions](https://docs.videosdk.live/ai_agents/core-components/agent-session)(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Step 4.1: Generating a VideoSDK Meeting ID

To interact with your AI agent, you need a meeting ID. You can generate it using the following curl command:
1curl -X POST \
2  https://api.videosdk.live/v1/meetings \
3  -H "Authorization: Bearer YOUR_API_KEY" \
4  -H "Content-Type: application/json"
5

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class extends the Agent class from the VideoSDK framework. It initializes with specific instructions tailored for financial queries. The on_enter and on_exit methods define what the agent says when a session starts and ends.

Step 4.3: Defining the Core Pipeline

The CascadingPipeline is central to processing audio inputs and generating responses. It consists of:
  • DeepgramSTT: Converts speech to text using the "nova-2" model.
  • OpenAILLM: Processes the text and generates a response using the "gpt-4o" model.
  • ElevenLabsTTS: Converts the response text back into speech.
  • SileroVAD: Detects when the user has finished speaking.
  • TurnDetector: Manages conversation flow by detecting speaker turns.

Step 4.4: Managing the Session and Startup Logic

The start_session function initializes the agent, conversation flow, and pipeline. It connects to the VideoSDK service and maintains the session until manually terminated. The make_context function sets up the room options, enabling the playground mode for testing.

Running and Testing the Agent

Step 5.1: Running the Python Script

Run the script with:
1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

Once the script is running, you'll see a link to the VideoSDK playground in the console. Open it in your browser to interact with your AI voice agent.

Advanced Features and Customizations

Extending Functionality with Custom Tools

You can extend your agent's capabilities by integrating custom tools and APIs, allowing it to fetch real-time financial data or perform specific tasks.

Exploring Other Plugins

The VideoSDK framework supports various plugins for STT, LLM, and TTS. Experiment with different models to optimize performance and accuracy.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API key is correctly set in the .env file and that your VideoSDK account is active.

Audio Input/Output Problems

Check your microphone and speaker settings. Ensure they are properly configured and accessible by the application.

Dependency and Version Conflicts

Verify that all dependencies are installed with compatible versions. Use a virtual environment to manage packages effectively.

Conclusion

Summary of What You've Built

You've successfully built a conversational AI voice agent for the finance industry using VideoSDK. This agent can handle financial queries and provide insights, enhancing user experience.

Next Steps and Further Learning

Explore additional features and plugins to expand your agent's capabilities. Consider integrating with financial APIs for real-time data access and further refining your agent's responses.
Additionally, explore

AI voice Agent deployment

strategies to ensure your agent is accessible and performs optimally in various environments.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ