Build a Sentiment Analysis Voice Agent

Step-by-step guide to building a sentiment analysis voice agent with VideoSDK, including complete code examples and testing instructions.

Introduction to AI Voice Agents in Sentiment Analysis Voice

AI Voice Agents are intelligent systems designed to interact with users through voice commands. These agents leverage advanced technologies such as Speech-to-Text (STT), Language Learning Models (LLM), and Text-to-Speech (TTS) to understand and respond to user inputs. In the sentiment analysis voice industry, these agents play a crucial role in interpreting the emotional tone of spoken content, providing insights into whether the sentiment expressed is positive, negative, or neutral.
The importance of AI Voice Agents in sentiment analysis lies in their ability to offer real-time emotional insights, which can be invaluable in customer service, mental health applications, and user experience enhancement. For instance, businesses can use these insights to tailor their responses to customers, improving satisfaction and engagement.
In this tutorial, you will build a sentiment analysis

voice agent

using the VideoSDK framework. This agent will analyze voice inputs, determine the sentiment expressed, and provide feedback to users. You will learn about the core components of a

voice agent

, including STT, LLM, and TTS, and how they work together to process and respond to user inputs.

Architecture and Core Concepts

High-Level Architecture Overview

The architecture of our AI

Voice Agent

involves a seamless flow of data from user speech to agent response. When a user speaks, the audio input is processed by the

Deepgram STT Plugin for voice agent

, which converts it into text. This text is then analyzed by the

OpenAI LLM Plugin for voice agent

to determine the sentiment. Finally, the TTS component converts the response back into speech, which is played back to the user.
Diagram

Understanding Key Concepts in the VideoSDK Framework

  • Agent: The core class representing your bot. It handles the interaction logic and defines how the agent responds to user inputs.
  • CascadingPipeline: This is the flow of audio processing, where audio data is passed through various stages such as STT, LLM, and TTS. The

    Cascading pipeline in AI voice Agents

    ensures that each component works in harmony to deliver accurate sentiment analysis.
  • VAD & TurnDetector: Voice

    Activity Detection

    (VAD) and

    Turn detector for AI voice Agents

    are crucial for determining when the agent should listen and respond, ensuring smooth interaction.

Setting Up the Development Environment

Prerequisites

To get started, ensure you have Python 3.11 or higher installed. You will also need a VideoSDK account, which you can create at app.videosdk.live.

Step 1: Create a Virtual Environment

Create a virtual environment to manage your project dependencies:
1python -m venv venv
2source venv/bin/activate  # On Windows use `venv\\Scripts\\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip:
1pip install videosdk
2

Step 3: Configure API Keys

Create a .env file in your project root and add your VideoSDK API keys:
1VIDEOSDK_API_KEY=your_api_key_here
2

Building the AI Voice Agent: A Step-by-Step Guide

Here is the complete code for the AI Voice Agent:
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are a sentiment analysis voice agent designed to assist users in understanding the emotional tone of spoken content. Your primary role is to analyze voice inputs and provide insights into the sentiment expressed, such as positive, negative, or neutral emotions. You are a friendly and informative assistant, always aiming to help users gain a deeper understanding of the emotional context of their conversations.\n\nCapabilities:\n1. Analyze voice inputs to determine the sentiment expressed.\n2. Provide a summary of the emotional tone, including specific emotions detected.\n3. Offer suggestions on how to improve communication based on sentiment analysis.\n4. Answer general questions about sentiment analysis and its applications.\n\nConstraints and Limitations:\n1. You are not a licensed therapist or counselor, and your insights should not be considered professional mental health advice.\n2. Always include a disclaimer advising users to consult with a qualified professional for serious emotional or psychological concerns.\n3. You can only analyze voice inputs in English and may not accurately interpret sentiment in other languages or dialects.\n4. Your analysis is based on the data provided and may not capture the full context or nuances of the conversation."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = [AI voice Agent Sessions](https://docs.videosdk.live/ai_agents/core-components/agent-session)(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Step 4.1: Generating a VideoSDK Meeting ID

To interact with the agent, you need a meeting ID. Use the following curl command to generate one:
1curl -X POST \\
2  https://api.videosdk.live/v1/meetings \\
3  -H "Authorization: Bearer your_api_key_here" \\
4  -H "Content-Type: application/json"
5

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class extends the Agent class, providing custom behavior for entering and leaving conversations. This is where you define how the agent greets users and says goodbye.
1class MyVoiceAgent(Agent):
2    def __init__(self):
3        super().__init__(instructions=agent_instructions)
4    async def on_enter(self): await self.session.say("Hello! How can I help?")
5    async def on_exit(self): await self.session.say("Goodbye!")
6

Step 4.3: Defining the Core Pipeline

The CascadingPipeline is central to processing audio data. It defines the flow from STT to LLM to TTS, using plugins like DeepgramSTT, OpenAILLM, and ElevenLabsTTS to handle each step.
1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8

Step 4.4: Managing the Session and Startup Logic

The start_session function initializes the agent session, connecting it to the VideoSDK platform and starting the conversation flow. The make_context function sets up the room options, and the main block starts the job.
1async def start_session(context: JobContext):
2    agent = MyVoiceAgent()
3    conversation_flow = ConversationFlow(agent)
4
5    pipeline = CascadingPipeline(
6        stt=DeepgramSTT(model="nova-2", language="en"),
7        llm=OpenAILLM(model="gpt-4o"),
8        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
9        vad=SileroVAD(threshold=0.35),
10        turn_detector=TurnDetector(threshold=0.8)
11    )
12
13    session = AgentSession(
14        agent=agent,
15        pipeline=pipeline,
16        conversation_flow=conversation_flow
17    )
18
19    try:
20        await context.connect()
21        await session.start()
22        await asyncio.Event().wait()
23    finally:
24        await session.close()
25        await context.shutdown()
26
27if __name__ == "__main__":
28    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
29    job.start()
30

Running and Testing the Agent

Step 5.1: Running the Python Script

To run your agent, execute the following command in your terminal:
1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

Once the script is running, you will receive a playground link in the console. Visit this link to interact with your agent. Speak into your microphone, and the agent will analyze the sentiment of your speech and respond accordingly.

Advanced Features and Customizations

Extending Functionality with Custom Tools

The VideoSDK framework allows you to extend your agent's functionality using custom tools. These tools can be integrated into the pipeline to add new capabilities, such as additional sentiment analysis features or language support.

Exploring Other Plugins

While this tutorial uses specific plugins for STT, LLM, and TTS, VideoSDK supports a variety of options. You can explore other plugins to find the best fit for your needs, such as Cartesia for STT or Google Gemini for LLM.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correctly configured in the .env file. Double-check the key values and the environment variable names.

Audio Input/Output Problems

If you encounter issues with audio, verify that your microphone and speakers are properly connected and configured. Check the system settings to ensure they are selected as the default devices.

Dependency and Version Conflicts

If you experience dependency issues, ensure all packages are up-to-date. Use pip list to check installed versions and update them as needed.

Conclusion

Summary of What You've Built

In this tutorial, you have built a sentiment analysis voice agent capable of interpreting the emotional tone of spoken content. You learned how to set up the development environment, create a custom agent, and define a processing pipeline.

Next Steps and Further Learning

To further enhance your agent, consider exploring additional plugins and custom tools. Experiment with different models and configurations to improve sentiment analysis accuracy and responsiveness.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ