What are the key components of a Voice Agent?

The key components include Speech-to-Text (STT), Large Language Model (LLM), and Text-to-Speech (TTS), which work together to process and respond to user input.

How do I set up the development environment for building an AI Voice Agent?

You need Python 3.11+, a VideoSDK account, and to install necessary packages in a virtual environment. Configure API keys in a `.env` file.

What are some common issues when building AI Voice Agents?

Common issues include API key errors, audio input/output problems, and dependency conflicts. Ensure correct configuration and use a virtual environment.

AI Voice Agent for Entity Extraction

Q: What is an AI Voice Agent?

An AI Voice Agent is a software system that can understand and respond to human speech, often using technologies like speech-to-text, natural language processing, and text-to-speech.

Q: How does entity extraction work with AI Voice Agents?

Entity extraction involves identifying and extracting key pieces of information from spoken language, such as names, dates, and locations, using AI Voice Agents.

Build an AI Voice Agent for entity extraction with VideoSDK. Follow our step-by-step tutorial with code examples.

Introduction to AI Voice Agents in Entity Extraction

What is an AI
Voice Agent
?

AI Voice Agents are software systems that can understand and respond to human speech. They leverage technologies like speech-to-text (STT), natural language processing (NLP), and text-to-speech (TTS) to provide interactive voice-based interfaces. These agents are capable of performing tasks such as answering questions, controlling smart devices, and more.

Why are they important for the entity extraction industry?

In the field of entity extraction, AI Voice Agents can streamline processes by automatically identifying and extracting key pieces of information from spoken language. This is particularly useful in industries like customer service, healthcare, and finance, where quick access to relevant data is crucial.

Core Components of a
Voice Agent

STT (Speech-to-Text): Converts spoken language into text.
LLM (Large Language Model): Processes the text to understand and generate responses.
TTS (Text-to-Speech): Converts text responses back into speech.

What You'll Build in This Tutorial

In this tutorial, you will build an AI

Voice Agent

using the VideoSDK framework, capable of extracting entities from user input and providing informative responses.

Architecture and Core Concepts

High-Level Architecture Overview

The AI

Voice Agent

processes user input through a series of steps: speech is converted to text, analyzed for entity extraction, and then a response is generated and spoken back to the user. This process is managed by a

Cascading pipeline in AI voice Agents

, which efficiently handles the flow of data through various stages.

Understanding Key Concepts in the VideoSDK Framework

Agent: The core class representing your bot, responsible for managing interactions.
CascadingPipeline: Manages the flow of audio processing from STT to LLM to TTS.
VAD & TurnDetector: These components help the agent know when to listen and when to respond, utilizing
Silero Voice Activity Detection
and a
Turn detector for AI voice Agents
.

Setting Up the Development Environment

Prerequisites

To get started, ensure you have Python 3.11+ installed and a VideoSDK account. You can sign up at app.videosdk.live.

Step 1: Create a Virtual Environment

Create a virtual environment to manage dependencies:

1python -m venv venv
2source venv/bin/activate  # On Windows use `venv\\Scripts\\activate`
3

Step 2: Install Required Packages

Install the necessary Python packages:

1pip install videosdk
2

Step 3: Configure API Keys in a `.env` file

Create a .env file in your project directory and add your VideoSDK API key:

1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

Below is the complete code for your AI Voice Agent implementation:

1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are an AI Voice Agent specialized in entity extraction. Your persona is that of a knowledgeable data analyst assistant. Your primary capability is to extract and identify key entities from user-provided text, such as names, dates, locations, and other relevant information. You can also provide brief explanations of the extracted entities if requested. However, you are not capable of making subjective judgments or providing opinions. You must clearly state that your responses are based on the data provided and that users should verify the information independently. You are not a substitute for professional data analysis services and should include a disclaimer advising users to consult a professional for complex data analysis needs."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = AgentSession(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Step 4.1: Generating a VideoSDK Meeting ID

To generate a meeting ID, you can use the VideoSDK API. Here is an example using curl:

1curl -X POST https://api.videosdk.live/v1/meetings \
2-H "Authorization: YOUR_API_KEY"
3

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class is where you define the behavior of your agent. It inherits from Agent and uses the agent_instructions to guide its responses. This class handles the initial greeting and farewell messages.

1class MyVoiceAgent(Agent):
2    def __init__(self):
3        super().__init__(instructions=agent_instructions)
4    async def on_enter(self): await self.session.say("Hello! How can I help?")
5    async def on_exit(self): await self.session.say("Goodbye!")
6

Step 4.3: Defining the Core Pipeline

The

CascadingPipeline

is crucial as it defines how audio is processed. Each plugin has a specific role:

DeepgramSTT: Converts speech to text.
OpenAILLM: Processes text for entity extraction.
ElevenLabsTTS: Converts text back to speech.
SileroVAD & TurnDetector: Manage when the agent listens and responds.

1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8

Step 4.4: Managing the Session and Startup Logic

The start_session function handles the setup and management of the agent session. It initializes the agent, pipeline, and manages the conversation flow.

1async def start_session(context: JobContext):
2    agent = MyVoiceAgent()
3    conversation_flow = ConversationFlow(agent)
4    pipeline = CascadingPipeline(
5        stt=DeepgramSTT(model="nova-2", language="en"),
6        llm=OpenAILLM(model="gpt-4o"),
7        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
8        vad=SileroVAD(threshold=0.35),
9        turn_detector=TurnDetector(threshold=0.8)
10    )
11    session = AgentSession(
12        agent=agent,
13        pipeline=pipeline,
14        conversation_flow=conversation_flow
15    )
16    try:
17        await context.connect()
18        await session.start()
19        await asyncio.Event().wait()
20    finally:
21        await session.close()
22        await context.shutdown()
23

The make_context function creates a JobContext with room options, and the main block starts the agent job.

1def make_context() -> JobContext:
2    room_options = RoomOptions(
3        name="VideoSDK Cascaded Agent",
4        playground=True
5    )
6    return JobContext(room_options=room_options)
7
8if __name__ == "__main__":
9    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
10    job.start()
11

Running and Testing the Agent

Step 5.1: Running the Python Script

To start your agent, run the script using:

1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

Once the script is running, you will see a link to the VideoSDK playground in the console. Use this link to join the session and interact with your AI Voice Agent.

Advanced Features and Customizations

Extending Functionality with Custom Tools

You can extend your agent's functionality by adding custom tools. This involves defining a function_tool that the agent can use to perform specific tasks.

Exploring Other Plugins

VideoSDK supports various plugins for STT, LLM, and TTS. You can experiment with different options to suit your needs, such as the

OpenAI LLM Plugin for voice agent

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API key is correctly set in the .env file and that you have access to the VideoSDK services.

Audio Input/Output Problems

Check your microphone and speaker settings to ensure they are properly configured.

Dependency and Version Conflicts

Ensure all dependencies are installed with compatible versions. Use a virtual environment to manage these dependencies effectively.

Conclusion

Summary of What You've Built

You have successfully built an AI Voice Agent capable of extracting entities from spoken language using the VideoSDK framework, leveraging

AI voice Agent core components overview

and managing interactions through

AI voice Agent Sessions

Next Steps and Further Learning

Explore additional features and plugins offered by VideoSDK to enhance your agent's capabilities and learn more about AI and voice technologies.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ

Free $20 Balance for AI Voice Agents & Video Calls