Function Calling with LLMs: AI Voice Agent Guide

Step-by-step guide to building AI Voice Agents for function calling with LLMs using VideoSDK.

Introduction to AI Voice Agents in Function Calling with LLMs

AI Voice Agents are intelligent systems designed to interact with users through voice commands. They leverage technologies such as Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS) to understand and respond to user queries. In the context of function calling with LLMs, these agents can automate complex interactions, making them invaluable in industries where voice-driven operations are essential.

What is an AI

Voice Agent

?

An AI

Voice Agent

is a software application that uses voice recognition, natural language processing, and speech synthesis to perform tasks or provide information in response to user voice commands. These agents are capable of understanding context, maintaining conversations, and executing specific functions as instructed.

Why are they important for the function calling with LLMs industry?

Incorporating AI Voice Agents in the function calling with LLMs industry allows for seamless automation of tasks such as scheduling, information retrieval, and executing commands without manual intervention. This enhances productivity and user experience by providing a hands-free, efficient way to interact with complex systems.

Core Components of a

Voice Agent

  • STT (Speech-to-Text): Converts spoken language into text.
  • LLM (Large Language Model): Processes the text to understand and generate responses.
  • TTS (Text-to-Speech): Converts text responses back into audible speech.

What You'll Build in This Tutorial

In this tutorial, you will build an AI

Voice Agent

capable of function calling with LLMs using the VideoSDK framework. You will learn how to set up the environment, create a custom agent, and test its capabilities.

Architecture and Core Concepts

High-Level Architecture Overview

The architecture of an AI

Voice Agent

involves a seamless flow of data from user speech to agent response. The process begins with capturing the user's voice, converting it to text, processing it with an LLM, and finally generating a spoken response.
Diagram

Understanding Key Concepts in the VideoSDK Framework

  • Agent: The core class representing your bot, responsible for managing interactions.
  • Cascading Pipeline in AI voice Agents

    :
    Defines the flow of audio processing through various stages like STT, LLM, and TTS.
  • VAD & TurnDetector: These components help the agent determine when to listen and when to respond, ensuring smooth interactions.

Setting Up the Development Environment

Prerequisites

Before you begin, ensure you have Python 3.11+ installed and a VideoSDK account available at app.videosdk.live.

Step 1: Create a Virtual Environment

Create a virtual environment to manage dependencies:
1python -m venv myenv
2source myenv/bin/activate  # On Windows use `myenv\\Scripts\\activate`
3

Step 2: Install Required Packages

Install the necessary packages using pip:
1pip install videosdk-agents videosdk-plugins-silero videosdk-plugins-turn_detector videosdk-plugins-deepgram videosdk-plugins-openai videosdk-plugins-elevenlabs
2

Step 3: Configure API Keys in a .env File

Create a .env file to securely store your API keys. Ensure it contains the keys for VideoSDK and any other services you are using.

Building the AI Voice Agent: A Step-by-Step Guide

To build your AI Voice Agent, you will start by reviewing the complete code and then break it down into manageable parts.
1import asyncio, os
2from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob, ConversationFlow
3from videosdk.plugins.silero import SileroVAD
4from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
5from videosdk.plugins.deepgram import DeepgramSTT
6from videosdk.plugins.openai import OpenAILLM
7from videosdk.plugins.elevenlabs import ElevenLabsTTS
8from typing import AsyncIterator
9
10# Pre-downloading the Turn Detector model
11pre_download_model()
12
13agent_instructions = "You are a knowledgeable AI Voice Agent specializing in 'function calling with LLMs' (Large Language Models). Your primary role is to assist developers and tech enthusiasts in understanding and implementing function calling capabilities using LLMs. \n\nCapabilities:\n1. Explain the concept of function calling with LLMs and its applications.\n2. Provide step-by-step guidance on setting up function calls within LLM frameworks.\n3. Offer examples of code snippets and best practices for efficient function calling.\n4. Answer frequently asked questions related to function calling with LLMs.\n\nConstraints:\n1. You are not a substitute for professional software engineering advice and should encourage users to consult documentation or experts for complex issues.\n2. You must not provide any proprietary or confidential information.\n3. Ensure that all examples and explanations are clear, concise, and suitable for a general audience with basic programming knowledge."
14
15class MyVoiceAgent(Agent):
16    def __init__(self):
17        super().__init__(instructions=agent_instructions)
18    async def on_enter(self): await self.session.say("Hello! How can I help?")
19    async def on_exit(self): await self.session.say("Goodbye!")
20
21async def start_session(context: JobContext):
22    # Create agent and conversation flow
23    agent = MyVoiceAgent()
24    conversation_flow = ConversationFlow(agent)
25
26    # Create pipeline
27    pipeline = CascadingPipeline(
28        stt=DeepgramSTT(model="nova-2", language="en"),
29        llm=OpenAILLM(model="gpt-4o"),
30        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
31        vad=SileroVAD(threshold=0.35),
32        turn_detector=TurnDetector(threshold=0.8)
33    )
34
35    session = AgentSession(
36        agent=agent,
37        pipeline=pipeline,
38        conversation_flow=conversation_flow
39    )
40
41    try:
42        await context.connect()
43        await session.start()
44        # Keep the session running until manually terminated
45        await asyncio.Event().wait()
46    finally:
47        # Clean up resources when done
48        await session.close()
49        await context.shutdown()
50
51def make_context() -> JobContext:
52    room_options = RoomOptions(
53    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
54        name="VideoSDK Cascaded Agent",
55        playground=True
56    )
57
58    return JobContext(room_options=room_options)
59
60if __name__ == "__main__":
61    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
62    job.start()
63

Step 4.1: Generating a VideoSDK Meeting ID

To start, generate a meeting ID using the VideoSDK API. This ID is crucial for connecting your agent to a session.
1curl -X POST \
2  https://api.videosdk.live/v1/meetings \
3  -H "Authorization: YOUR_API_KEY" \
4  -H "Content-Type: application/json" \
5  -d '{}'
6

Step 4.2: Creating the Custom Agent Class

The MyVoiceAgent class is a custom implementation of the Agent class. It defines the agent's behavior when entering and exiting a session.
1class MyVoiceAgent(Agent):
2    def __init__(self):
3        super().__init__(instructions=agent_instructions)
4    async def on_enter(self): await self.session.say("Hello! How can I help?")
5    async def on_exit(self): await self.session.say("Goodbye!")
6

Step 4.3: Defining the Core Pipeline

The CascadingPipeline is the heart of the agent, defining how audio is processed through STT, LLM, and TTS stages. It incorporates components like

Silero Voice Activity Detection

and a

Turn detector for AI voice Agents

to ensure efficient interaction flow.
1pipeline = CascadingPipeline(
2    stt=DeepgramSTT(model="nova-2", language="en"),
3    llm=OpenAILLM(model="gpt-4o"),
4    tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
5    vad=SileroVAD(threshold=0.35),
6    turn_detector=TurnDetector(threshold=0.8)
7)
8

Step 4.4: Managing the Session and Startup Logic

The start_session function manages the session lifecycle, while make_context sets up the environment for the agent. This setup is crucial for initiating

AI voice Agent Sessions

effectively.
1def make_context() -> JobContext:
2    room_options = RoomOptions(
3    #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
4        name="VideoSDK Cascaded Agent",
5        playground=True
6    )
7
8    return JobContext(room_options=room_options)
9
10async def start_session(context: JobContext):
11    # Create agent and conversation flow
12    agent = MyVoiceAgent()
13    conversation_flow = ConversationFlow(agent)
14
15    # Create pipeline
16    pipeline = CascadingPipeline(
17        stt=DeepgramSTT(model="nova-2", language="en"),
18        llm=OpenAILLM(model="gpt-4o"),
19        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
20        vad=SileroVAD(threshold=0.35),
21        turn_detector=TurnDetector(threshold=0.8)
22    )
23
24    session = AgentSession(
25        agent=agent,
26        pipeline=pipeline,
27        conversation_flow=conversation_flow
28    )
29
30    try:
31        await context.connect()
32        await session.start()
33        # Keep the session running until manually terminated
34        await asyncio.Event().wait()
35    finally:
36        # Clean up resources when done
37        await session.close()
38        await context.shutdown()
39
40if __name__ == "__main__":
41    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
42    job.start()
43

Running and Testing the Agent

Step 5.1: Running the Python Script

Run the script using Python to start the agent:
1python main.py
2

Step 5.2: Interacting with the Agent in the Playground

Once the script is running, you can interact with the agent through the

AI Agent playground

. Use the test URL provided in the console to join the session and start communicating with your AI Voice Agent.

Advanced Features and Customizations

Extending Functionality with Custom Tools

The VideoSDK framework allows you to extend your agent's functionality by integrating custom tools. These can be used to perform specific tasks or enhance the agent's capabilities.

Exploring Other Plugins

While this tutorial focuses on specific plugins, the VideoSDK framework supports various STT, LLM, and TTS options. Explore these to customize your agent further.

Troubleshooting Common Issues

API Key and Authentication Errors

Ensure your API keys are correctly configured in the .env file. Double-check for any typos or incorrect values.

Audio Input/Output Problems

Verify that your microphone and speaker settings are correctly configured and that the necessary permissions are granted.

Dependency and Version Conflicts

Ensure all dependencies are installed with compatible versions. Use a virtual environment to manage these dependencies effectively.

Conclusion

Summary of What You've Built

You have successfully built an AI Voice Agent capable of function calling with LLMs using the VideoSDK framework. This agent can understand voice commands, process them, and respond intelligently.

Next Steps and Further Learning

To enhance your agent, explore additional plugins and custom tools. Consider diving deeper into the VideoSDK documentation to unlock more advanced features and capabilities.

Start Building With Free $20 Balance

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ