Introduction to AI Voice Agents in Entity Extraction
What is an AI Voice Agent
?
AI Voice Agents are software systems that can understand and respond to human speech. They leverage technologies like speech-to-text (STT), natural language processing (NLP), and text-to-speech (TTS) to provide interactive voice-based interfaces. These agents are capable of performing tasks such as answering questions, controlling smart devices, and more.
Why are they important for the entity extraction industry?
In the field of entity extraction, AI Voice Agents can streamline processes by automatically identifying and extracting key pieces of information from spoken language. This is particularly useful in industries like customer service, healthcare, and finance, where quick access to relevant data is crucial.
Core Components of a Voice Agent
- STT (Speech-to-Text): Converts spoken language into text.
- LLM (Large Language Model): Processes the text to understand and generate responses.
- TTS (Text-to-Speech): Converts text responses back into speech.
What You'll Build in This Tutorial
In this tutorial, you will build an AI
Voice Agent
using the VideoSDK framework, capable of extracting entities from user input and providing informative responses.Architecture and Core Concepts
High-Level Architecture Overview
The AI
Voice Agent
processes user input through a series of steps: speech is converted to text, analyzed for entity extraction, and then a response is generated and spoken back to the user. This process is managed by aCascading pipeline in AI voice Agents
, which efficiently handles the flow of data through various stages.
Understanding Key Concepts in the VideoSDK Framework
- Agent: The core class representing your bot, responsible for managing interactions.
- CascadingPipeline: Manages the flow of audio processing from STT to LLM to TTS.
- VAD & TurnDetector: These components help the agent know when to listen and when to respond, utilizing
Silero Voice Activity Detection
and aTurn detector for AI voice Agents
.
Setting Up the Development Environment
Prerequisites
To get started, ensure you have Python 3.11+ installed and a VideoSDK account. You can sign up at app.videosdk.live.
Step 1: Create a Virtual Environment
Create a virtual environment to manage dependencies:
1python -m venv venv
2source venv/bin/activate # On Windows use `venv\\Scripts\\activate`
3Step 2: Install Required Packages
Install the necessary Python packages:
1pip install videosdk
2Step 3: Configure API Keys in a .env file
Create a
.env file in your project directory and add your VideoSDK API key:1VIDEOSDK_API_KEY=your_api_key_here
2Building the AI Voice Agent: A Step-by-Step Guide
Below is the complete code for your AI Voice Agent implementation:
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are an AI Voice Agent specialized in entity extraction. Your persona is that of a knowledgeable data analyst assistant. Your primary capability is to extract and identify key entities from user-provided text, such as names, dates, locations, and other relevant information. You can also provide brief explanations of the extracted entities if requested. However, you are not capable of making subjective judgments or providing opinions. You must clearly state that your responses are based on the data provided and that users should verify the information independently. You are not a substitute for professional data analysis services and should include a disclaimer advising users to consult a professional for complex data analysis needs."
14
15class MyVoiceAgent(Agent):
16 def __init__(self):
17 super().__init__(instructions=agent_instructions)
18 async def on_enter(self): await self.session.say("Hello! How can I help?")
19 async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22 # Create agent and conversation flow
23 agent = MyVoiceAgent()
24 conversation_flow = ConversationFlow(agent)
25
26 # Create pipeline
27 pipeline = CascadingPipeline(
28 stt=DeepgramSTT(model="nova-2", language="en"),
29 llm=OpenAILLM(model="gpt-4o"),
30 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31 vad=SileroVAD(threshold=0.35),
32 turn_detector=TurnDetector(threshold=0.8)
33 )
34
35 session = AgentSession(
36 agent=agent,
37 pipeline=pipeline,
38 conversation_flow=conversation_flow
39 )
40
41 try:
42 await context.connect()
43 await session.start()
44 # Keep the session running until manually terminated
45 await asyncio.Event().wait()
46 finally:
47 # Clean up resources when done
48 await session.close()
49 await context.shutdown()
50
51def make_context() -> JobContext:
52 room_options = RoomOptions(
53 # room_id="YOUR_MEETING_ID", # Set to join a pre-created room; omit to auto-create
54 name="VideoSDK Cascaded Agent",
55 playground=True
56 )
57
58 return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61 job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62 job.start()
63Step 4.1: Generating a VideoSDK Meeting ID
To generate a meeting ID, you can use the VideoSDK API. Here is an example using
curl:1curl -X POST https://api.videosdk.live/v1/meetings \
2-H "Authorization: YOUR_API_KEY"
3Step 4.2: Creating the Custom Agent Class
The
MyVoiceAgent class is where you define the behavior of your agent. It inherits from Agent and uses the agent_instructions to guide its responses. This class handles the initial greeting and farewell messages.1class MyVoiceAgent(Agent):
2 def __init__(self):
3 super().__init__(instructions=agent_instructions)
4 async def on_enter(self): await self.session.say("Hello! How can I help?")
5 async def on_exit(self): await self.session.say("Goodbye!")
6Step 4.3: Defining the Core Pipeline
The
CascadingPipeline
is crucial as it defines how audio is processed. Each plugin has a specific role:- DeepgramSTT: Converts speech to text.
- OpenAILLM: Processes text for entity extraction.
- ElevenLabsTTS: Converts text back to speech.
- SileroVAD & TurnDetector: Manage when the agent listens and responds.
1pipeline = CascadingPipeline(
2 stt=DeepgramSTT(model="nova-2", language="en"),
3 llm=OpenAILLM(model="gpt-4o"),
4 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5 vad=SileroVAD(threshold=0.35),
6 turn_detector=TurnDetector(threshold=0.8)
7)
8Step 4.4: Managing the Session and Startup Logic
The
start_session function handles the setup and management of the agent session. It initializes the agent, pipeline, and manages the conversation flow.1async def start_session(context: JobContext):
2 agent = MyVoiceAgent()
3 conversation_flow = ConversationFlow(agent)
4 pipeline = CascadingPipeline(
5 stt=DeepgramSTT(model="nova-2", language="en"),
6 llm=OpenAILLM(model="gpt-4o"),
7 tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
8 vad=SileroVAD(threshold=0.35),
9 turn_detector=TurnDetector(threshold=0.8)
10 )
11 session = AgentSession(
12 agent=agent,
13 pipeline=pipeline,
14 conversation_flow=conversation_flow
15 )
16 try:
17 await context.connect()
18 await session.start()
19 await asyncio.Event().wait()
20 finally:
21 await session.close()
22 await context.shutdown()
23The
make_context function creates a JobContext with room options, and the main block starts the agent job.1def make_context() -> JobContext:
2 room_options = RoomOptions(
3 name="VideoSDK Cascaded Agent",
4 playground=True
5 )
6 return JobContext(room_options=room_options)
7
8if __name__ == "__main__":
9 job = WorkerJob(entrypoint=start_session, jobctx=make_context)
10 job.start()
11Running and Testing the Agent
Step 5.1: Running the Python Script
To start your agent, run the script using:
1python main.py
2Step 5.2: Interacting with the Agent in the Playground
Once the script is running, you will see a link to the VideoSDK playground in the console. Use this link to join the session and interact with your AI Voice Agent.
Advanced Features and Customizations
Extending Functionality with Custom Tools
You can extend your agent's functionality by adding custom tools. This involves defining a
function_tool that the agent can use to perform specific tasks.Exploring Other Plugins
VideoSDK supports various plugins for STT, LLM, and TTS. You can experiment with different options to suit your needs, such as the
OpenAI LLM Plugin for voice agent
.Troubleshooting Common Issues
API Key and Authentication Errors
Ensure your API key is correctly set in the
.env file and that you have access to the VideoSDK services.Audio Input/Output Problems
Check your microphone and speaker settings to ensure they are properly configured.
Dependency and Version Conflicts
Ensure all dependencies are installed with compatible versions. Use a virtual environment to manage these dependencies effectively.
Conclusion
Summary of What You've Built
You have successfully built an AI Voice Agent capable of extracting entities from spoken language using the VideoSDK framework, leveraging
AI voice Agent core components overview
and managing interactions throughAI voice Agent Sessions
.Next Steps and Further Learning
Explore additional features and plugins offered by VideoSDK to enhance your agent's capabilities and learn more about AI and voice technologies.
Want to level-up your learning? Subscribe now
Subscribe to our newsletter for more tech based insights
FAQ