Yolo Diffusion

Combining Automated Object Detection with Creative Generative Modification

Context

Technical Prototype

Role

Developer / Researcher

Year

2025

Industry

Generative AI

The Idea

‍This project aimed to create an integrated pipeline combining the precision of object detection with the creative power of diffusion models. The core idea was to first automatically detect eyes in various visual inputs (images, videos, live webcam feeds) using a custom-trained YOLOv8 model. Subsequently, the location data of the detected eyes would be utilized within a diffusion model framework (like ComfyUI) to enable targeted creative modifications, such as inpainting or stylization, specifically applied to the eye regions. The goal was to bridge automated analysis with generative artistry for novel image and video editing possibilities.

Development

‍The pipeline was developed in two main stages:

Eye Detection Module:
- A custom object detection model was trained using the Ultralytics YOLOv8 framework.
- Training utilized a small, curated dataset of approximately 50-60 images featuring human faces/eyes, downloaded from the internet.
- Images were manually annotated using LabelImg, focusing on a single class: "eye".
- The trained model is capable of detecting eyes in real-time across different input types (images, videos, live webcam) and outputs bounding boxes with confidence scores.
Creative Diffusion Module:
- The bounding box coordinates identified by the YOLO model serve as input for the diffusion stage.
- Integration was achieved using ComfyUI workflows, allowing the detected eye regions to be specifically targeted for modification.
- Diffusion models (such as SD XL, FLUX) are employed for tasks like inpainting or applying styles, guided by text prompts and other parameters (LoRAs, ControlNets, etc.).
- The system supports operation via both Command Line Interface (CLI) and direct integration within ComfyUI nodes.

Results

ComfyUI Workflow - Integrating Yolo Custom Model with Diffusion Process

Integrating Custom Yolo Model in Comfy UI workflow for eye modification

Integrating Custom Yolo Model in Comfy UI wokflow for eye modification

Reflection

‍This project successfully demonstrated the powerful synergy achievable by combining different AI modalities. YOLOv8 provided efficient, automated eye detection, streamlining the initial analysis phase. Diffusion models then offered fine-grained creative control over the visual output, enabling targeted modifications. A key insight was the significant impact of dataset quality; the relatively small dataset used for YOLO training limited the detection model's robustness, underscoring the importance of data curation. The pipeline effectively balanced automation with manual artistic intervention, highlighting the potential for integrated AI workflows in creative fields. The choice of specific tools like YOLOv8 and LabelImg significantly influenced the workflow's efficiency.

What Worked

Training a functional custom YOLOv8 model for eye detection using a limited dataset.
Achieving real-time eye detection on images, video streams, and live webcam feeds (README Feature).
Successfully integrating the YOLO detection output into a ComfyUI diffusion pipeline.
Enabling targeted generative modifications (inpainting, stylization) specifically on the detected eye regions.
Demonstrating a practical workflow combining automated detection with controllable creative generation.

What Did Not Work / Limitations

The robustness and accuracy of the eye detection were constrained by the small size (~50-60 images) of the training dataset.
The overall quality of the final output depends significantly on both the detection accuracy and the effectiveness of the chosen diffusion model and its settings.
Setting up the complete end-to-end pipeline involving both YOLO and a diffusion environment like ComfyUI can be complex.
Relies on external frameworks (Ultralytics YOLO, ComfyUI) and models (SD XL, FLUX).

Github

calluxpore/YOLO-Diffusion