Real-Time Webcam Captioning with LLaVA, Phi-3

Python application providing real-time, on-screen captions for live webcam video by leveraging local multimodal AI models (LLaVA with Phi-3) via Ollama.

Context

Technical Prototype

Role

Developer

Year

2025

Industry

AI, Real-Time Systems, Accessibility Technology

The Idea

The primary goal of this project was to explore the capabilities of locally-run multimodal AI models for real-time video understanding. The aim was to create a system that could capture a live webcam feed, analyze the visual content frame-by-frame, and generate descriptive captions using the LLaVA model (specifically a version incorporating Phi-3) running locally via Ollama, displaying these captions directly onto the video feed. This serves as a demonstration of accessible, privacy-preserving AI applications.

Development‍

This Python script (webcam_captioner.py) was developed to implement the following real-time captioning workflow:

Configuration: Requires Ollama to be installed and running with the llava-phi3 (or a similar LLaVA variant) model downloaded and available. The specific model name might need to be configured within the script.
Webcam Initialization: Utilizes OpenCV (cv2) to access and capture video frames from the system's default webcam.
Frame Processing & Sampling:
- Continuously reads frames from the webcam.
- To manage processing load and API call frequency, frames are sampled periodically (e.g., every few seconds or a set number of frames) rather than sending every single frame for analysis.
Image Encoding: Selected frames are converted into a format suitable for the Ollama API, typically by encoding the image data into Base64.
Ollama API Interaction (Local):
- Constructs a request payload containing the Base64 encoded image and a prompt for caption generation.
- Sends this payload to the local Ollama API endpoint (e.g., http://localhost:11434/api/generate) using the requests library, targeting the LLaVA model.
- Handles the streaming response from Ollama to progressively build the generated caption.
Caption Display:
- Parses the JSON response from Ollama to extract the descriptive caption.
- Uses OpenCV's text rendering functions (cv2.putText) to overlay the generated caption directly onto the live webcam video window. The caption updates as new descriptions are generated for subsequent frames.
Real-Time Loop & Termination: The application runs in a loop, continuously capturing, processing, and displaying captions. It can be terminated by closing the OpenCV display window (e.g., pressing the 'q' key).
Error Handling: Basic error handling is implemented for issues such as webcam access failure or inability to connect to the Ollama API.‍

Reflection‍

The script successfully demonstrates a functional real-time webcam captioning system using locally hosted AI models, showcasing the potential for on-device multimodal understanding. It effectively integrates webcam capture, local AI inference via Ollama, and visual feedback. The project highlights the advancements in making sophisticated AI models accessible without relying on cloud services, thereby enhancing privacy and reducing latency.

What Worked

Successful real-time capture and display of webcam feed using OpenCV.
Effective integration with a locally running Ollama instance to leverage the LLaVA-Phi-3 model.
Generation of relevant (though model-dependent) captions for live scenes.
Dynamic overlay of captions onto the video feed, providing immediate visual feedback.
Demonstrated the feasibility of low-latency, on-device multimodal AI applications.
Privacy is inherently enhanced by processing data locally.

What Did Not Work / Limitations

Performance & Latency: The "real-time" aspect is subject to the processing power of the local machine and the complexity of the LLaVA model. There can be noticeable latency between scene changes and caption updates.
Caption Accuracy & Detail: The quality and relevance of captions are entirely dependent on the LLaVA-Phi-3 model's capabilities and the clarity/content of the webcam input. Complex or ambiguous scenes might yield generic or inaccurate descriptions.
Resource Intensive: Running large multimodal models locally can be demanding on CPU, GPU (if utilized by Ollama), and RAM.
Setup Dependency: Requires users to have Ollama installed and configured correctly with the appropriate LLaVA model pulled, which can be a barrier for non-technical users.
Frame Sampling Rate: The balance between caption update frequency and system load needs careful tuning. Too frequent updates might overload the system; too infrequent might miss dynamic events.
No Temporal Context: Each caption is based on a single frame; the system doesn't inherently understand motion or sequences of events across multiple frames without further development.
Error handling is functional for basic cases but could be more robust for various Ollama API responses or unexpected model behaviors.

Github

calluxpore/Real-Time-Webcam-Captioning-with-LLaVA-Phi-3-and-Ollama