GetStream / Vision-Agents
- четверг, 29 января 2026 г. в 00:00:06
Open Vision Agents by Stream. Build Vision Agents quickly with any model or video provider. Uses Stream's edge network for ultra-low latency.
Vision Agents give you the building blocks to create intelligent, low-latency video experiences powered by your models, your infrastructure, and your use cases.
create response), Gemini (generate), and Claude (
create message) — always access the latest LLM capabilities.This example shows you how to build golf coaching AI with YOLO and Gemini Live. Combining a fast object detection model (like YOLO) with a full realtime AI is useful for many different video AI use cases. For example: Drone fire detection, sports/video game coaching, physical therapy, workout coaching, just dance style games etc.
# partial example, full example: examples/02_golf_coach_example/golf_coach_example.py
agent = Agent(
edge=getstream.Edge(),
agent_user=agent_user,
instructions="Read @golf_coach.md",
llm=gemini.Realtime(fps=10),
# llm=openai.Realtime(fps=1), # Careful with FPS can get expensive
processors=[ultralytics.YOLOPoseProcessor(model_path="yolo11n-pose.pt", device="cuda")],
)This example shows a security camera system that detects faces, tracks packages and detects when a package is stolen. It automatically generates "WANTED" posters, posting them to X in real-time.
It combines face recognition, YOLOv11 object detection, Nano Banana and Gemini for a complete security workflow with voice interaction.
# partial example, full example: examples/04_security_camera_example/security_camera_example.py
security_processor = SecurityCameraProcessor(
fps=5,
model_path="weights_custom.pt", # YOLOv11 for package detection
package_conf_threshold=0.7,
)
agent = Agent(
edge=getstream.Edge(),
agent_user=User(name="Security AI", id="agent"),
instructions="Read @instructions.md",
processors=[security_processor],
llm=gemini.LLM("gemini-2.5-flash-lite"),
tts=elevenlabs.TTS(),
stt=deepgram.STT(),
)Apps like Cluely offer realtime coaching via an invisible overlay. This example shows you how you can build your own invisible assistant. It combines Gemini realtime (to watch your screen and audio), and doesn't broadcast audio (only text). This approach is quite versatile and can be used for: Sales coaching, job interview cheating, physical world/ on the job coaching with glasses
Demo video
agent = Agent(
edge=StreamEdge(), # low latency edge. clients for React, iOS, Android, RN, Flutter etc.
agent_user=agent_user, # the user object for the agent (name, image etc)
instructions="You are silently helping the user pass this interview. See @interview_coach.md",
# gemini realtime, no need to set tts, or sst (though that's also supported)
llm=gemini.Realtime()
)Step 1: Install via uv
uv add vision-agents
Step 2: (Optional) Install with extra integrations
uv add "vision-agents[getstream, openai, elevenlabs, deepgram]"
Step 3: Obtain your Stream API credentials
Get a free API key from Stream. Developers receive 333,000 participant minutes per month, plus extra credits via the Maker Program.
| Feature | Description |
|---|---|
| True real-time via WebRTC | Stream directly to model providers that support it for instant visual understanding. |
| Interval/processor pipeline | For providers without WebRTC, process frames with pluggable video processors (e.g., YOLO, Roboflow, or custom PyTorch/ONNX) before/after model calls. |
| Turn detection & diarization | Keep conversations natural; know when the agent should speak or stay quiet and who's talking. |
| Voice activity detection (VAD) | Trigger actions intelligently and use resources efficiently. |
| Speech↔Text↔Speech | Enable low-latency loops for smooth, conversational voice UX. |
| Tool/function calling | Execute arbitrary code and APIs mid-conversation. Create Linear issues, query weather, trigger telephony, or hit internal services. |
| Built-in memory via Stream Chat | Agents recall context naturally across turns and sessions. |
| Text back-channel | Message the agent silently during a call. |
| Phone and RAG | Interact with the Agent via inbound or outbound phone calls using Twilio and Turbopuffer |
| Plugin Name | Description | Docs Link |
|---|---|---|
| AWS Bedrock | Realtime speech-to-speech plugin using Amazon Nova models with automatic reconnection | AWS |
| AWS Polly | TTS plugin using Amazon's cloud-based service with natural-sounding voices and neural engine support | AWS Polly |
| Cartesia | TTS plugin for realistic voice synthesis in real-time voice applications | Cartesia |
| Decart | Real-time AI video transformation service for applying artistic styles and effects to video streams | Decart |
| Deepgram | STT plugin for fast, accurate real-time transcription with speaker diarization | Deepgram |
| ElevenLabs | TTS plugin with highly realistic and expressive voices for conversational agents | ElevenLabs |
| Fast-Whisper | High-performance STT plugin using OpenAI's Whisper model with CTranslate2 for fast inference | Fast-Whisper |
| Fish Audio | STT and TTS plugin with automatic language detection and voice cloning capabilities | Fish Audio |
| Gemini | Realtime API for building conversational agents with support for both voice and video | Gemini |
| HeyGen | Realtime interactive avatars powered by HeyGen | HeyGen |
| Hugging Face | LLM plugin providing access to many open-source language models hosted on the Hugging Face Hub and powered by external providers (Cerebras, Together, Groq, etc.) | Hugging Face |
| Inworld | TTS plugin with high-quality streaming voices for real-time conversational AI agents | Inworld |
| Kokoro | Local TTS engine for offline voice synthesis with low latency | Kokoro |
| Moondream | Moondream provides realtime detection and VLM capabilities. Developers can choose from using the hosted API or running locally on their CUDA devices. Vision Agents supports Moondream's Detect, Caption and VQA skills out-of-the-box. | Moondream |
| NVIDIA Cosmos 2 | VLM plugin using NVIDIA's Cosmos 2 models for video understanding with automatic frame buffering and streaming responses | NVIDIA |
| OpenAI | Realtime API for building conversational agents with out of the box support for real-time video directly over WebRTC, LLMs and Open AI TTS | OpenAI |
| OpenRouter | LLM plugin providing access to multiple providers (Anthropic, Google, OpenAI) through a unified API | OpenRouter |
| Qwen | Realtime audio plugin using Alibaba's Qwen3 with native audio output and built-in speech recognition | Qwen |
| Roboflow | Object detection processor using Roboflow's hosted API or local RF-DETR models | Roboflow |
| Smart Turn | Advanced turn detection system combining Silero VAD, Whisper, and neural models for natural conversation flow | Smart Turn |
| TurboPuffer | RAG plugin using TurboPuffer for hybrid search (vector + BM25) with Gemini embeddings for retrieval augmented generation | TurboPuffer |
| Twilio | Voice call integration plugin enabling bidirectional audio streaming via Twilio Media Streams with call registry and audio conversion | Twilio |
| Ultralytics | Real-time pose detection processor using YOLO models with skeleton overlays | Ultralytics |
| Vogent | Neural turn detection system for intelligent turn-taking in voice conversations | Vogent |
| Wizper | STT plugin with real-time translation capabilities powered by Whisper v3 | Wizper |
| xAI | LLM plugin using xAI's Grok models with advanced reasoning and real-time knowledge | xAI |
Processors let your agent manage state and handle audio/video in real-time.
They take care of the hard stuff, like:
… so you can focus on your agent logic.
Check out our getting started guide at VisionAgents.ai.
| 🔮 Demo Applications | |
|---|---|
Using Cartesia's Sonic 3 model to visually look at what's in the frame and tell a story with emotion. • Real-time visual understanding • Emotional storytelling • Frame-by-frame analysis >Source Code and tutorial |
![]() |
Realtime stable diffusion using Vision Agents and Decart's Mirage 2 model to create interactive scenes and stories. • Real-time video restyling • Interactive scene generation • Stable diffusion integration >Source Code and tutorial |
![]() |
Using Gemini Live together with Vision Agents and Ultralytics YOLO, we're able to track the user's pose and provide realtime actionable feedback on their golf game. • Real-time pose tracking • Actionable coaching feedback • YOLO pose detection • Gemini Live integration >Source Code and tutorial |
![]() |
Together with OpenAI Realtime and Vision Agents, we can take GeoGuesser to the next level by asking it to identify places in our real world surroundings. • Real-world location identification • OpenAI Realtime integration • Visual scene understanding >Source Code and tutorial |
![]() |
Interact with your Agent over the phone using Twilio. This example demonstrates how to use TurboPuffer for Retrieval Augmented Generation (RAG) to give your agent specialized knowledge. • Inbound/Outbound telephony • Twilio Media Streams integration • Vector search with TurboPuffer • Retrieval Augmented Generation >Source Code and tutorial |
![]() |
A security camera with face recognition, package detection and automated theft response. Generates WANTED posters with Nano Banana and posts them to X when packages disappear. • Face detection & named recognition • YOLOv11 package detection • Automated WANTED poster generation • Real-time X posting >Source Code and tutorial |
![]() |
See DEVELOPMENT.md
Want to add your platform or provider? Reach out to nash@getstream.io.
Our favorite people & projects to follow for vision AI
![]() |
![]() |
![]() |
|---|---|---|
| @demishassabis CEO @ Google DeepMind Won a Nobel prize |
@OfficialLoganK Product Lead @ Gemini Posts about robotics vision |
@ultralytics Various fast vision AI models Pose, detect, segment, classify |
![]() |
![]() |
![]() |
|---|---|---|
| @skalskip92 Open Source Lead @ Roboflow Building tools for vision AI |
@moondreamai The tiny vision model that could Lightweight, fast, efficient |
@kwindla Pipecat / Daily Sharing AI and vision insights |
![]() |
![]() |
![]() |
|---|---|---|
| @juberti Head of Realtime AI @ OpenAI Realtime AI systems |
@romainhuet Head of DX @ OpenAI Developer tooling & APIs |
@thorwebdev Eleven Labs Voice and AI experiments |
![]() |
![]() |
![]() |
|---|---|---|
| @mervenoyann Hugging Face Posts extensively about Video AI |
@stash_pomichter Spatial memory for robots Robotics & AI navigation |
@Mentraglass Open-source smart glasses Open-Source, hackable AR glasses with AI capabilities built in |
![]() |
|---|
| @vikhyatk AI Engineer Open-source AI projects, Creator of Moondream AI |
uv run <agent.py> serve)Video AI is the frontier of AI. The state of the art is changing daily to help models understand live video. While building the integrations, here are the limitations we've noticed (Dec 2025)
Join the team behind this project - we’re hiring a Staff Python Engineer to architect, build, and maintain a powerful toolkit for developers integrating voice and video AI into their products.