Image to Captions

CLI tool auto-captions local images via Fal.ai's Florence-2 API.

Context

Technical Prototype

Role

Developer

Year

2025

Industry

AI Automation

The Idea

‍The goal of this project was to create a practical command-line tool that automates the generation of descriptive captions for a batch of local image files. Leveraging the power of advanced AI image understanding models, the script aims to streamline the process of creating metadata for images by interfacing with the Fal.ai API (specifically using the Florence-2 model) to generate relevant captions automatically.

‍

‍

Development

‍This Python script (caption_images.py) was developed to perform the following workflow:

Configuration: Requires setting the target image folder name and a Fal.ai API key.
Image Discovery: Scans a specified local folder (images by default) for image files with common extensions (.jpg, .jpeg, .png, .webp).
Preprocessing: Reads image files, checks their dimensions and file size, and automatically resizes them using the Pillow library if they exceed predefined limits, ensuring compatibility with API requirements.
API Interaction:
- Encodes image data into Base64 format and constructs a data URI.
- Sends the image data to the Fal.ai Florence-2 captioning API endpoint using the requests library.
- Handles authentication using the provided API key.
Caption Handling: Parses the JSON response from the API to extract the generated caption. Implements checks for different possible response structures.
Output: Saves the extracted caption to a text file (.txt) with the same base name as the original image file, within the same folder.
Efficiency: Skips processing for images where a corresponding caption file already exists. Includes a basic time.sleep delay between API calls to mitigate potential rate limiting.
Error Handling: Includes try-except blocks for common issues like file access, API request errors (HTTP errors with details), and response parsing failures, providing feedback via print statements.

‍

Reflection

‍The script successfully provides an automated solution for batch-captioning images, demonstrating the practical application of cloud-based AI models through API integration. It effectively handles image preprocessing (resizing) and API communication specifics. While functional for its purpose, the script relies on external service availability (Fal.ai) and manual configuration. Potential improvements could involve adding a graphical user interface, implementing more robust error handling and retry mechanisms, exploring parallel processing for speed, or adding support for different AI models or APIs.

‍

What Worked

Efficient automation of the captioning process for multiple images.
Successful integration and communication with the external Fal.ai Florence-2 API.
Automatic image resizing to handle potentially large input files.
Clear organization of generated captions into corresponding text files.
Logic to prevent re-processing already captioned images, saving time and API calls.

What Did Not Work / Limitations

Requires manual setup of API key and folder path within the script.
Complete dependency on the specific Fal.ai API endpoint and its availability/pricing.
Error handling is functional but basic; lacks sophisticated retry logic or network resilience.
Processing is sequential, which could be slow for very large image datasets.
Lacks a user-friendly graphical interface (command-line only).
Caption quality is entirely dependent on the performance of the external Florence-2 model.

‍

Github

calluxpore/Image-to-Captions-Florence

‍