Instruction-based image editing has garnered significant attention due to its direct interaction with users. However, real-world user instructions are immensely diverse, and existing methods often fail to generalize effectively to instructions outside their training domain, limiting their practical application. To address this, we propose our edit framework, which leverages the generalization capability of Multi-modal Large Language Model (MLLM) to organize a suite of model-level editing tools to tackle this challenge. It incorporates two key designs: (1) a model-level toolkit comprising diverse models efficiently trained on limited data and several image manipulation functions, enabling fine-grained composition of editing actions by the MLLM; and (2) a three-stage progressive reinforcement learning approach that uses feedback on unannotated, open-domain instructions to train the MLLM, equipping it with generalized reasoning capabilities for handling real-world instructions. Experiments demonstrate that it achieves state-of-the-art performance on GEdit-Bench and ImgBench. It exhibits robust reasoning capabilities for open-domain instructions and can utilize newly introduced editing tools without additional fine-tuning.
Our system comprises: 1) the Builder, an MLLM reasoning agent that generates workflows; 2) an Executor that parses and executes workflows; and 3) the Bricks, an external model-level tool library containing functions encapsulating models or logic processes. Given an input pair comprising a target image and an editing prompt, the Builder observing the input state generates a reasoning trace and a JSON-formatted workflow based on its strategy. This workflow is a tool invocation graph. The vertex set represents the selected tool instances. The edge set defines sequential dependencies. The Executor then parses the workflow, invokes the tools, and generates the edited image.
We present a visual comparison of editing results on flexible instructions, alongside the Builder's tool composition process. For the "swap" instruction, although the Builder was not explicitly trained on this task, it effectively decomposes the instruction into atomic operations by first removing object A using RES and INPAINT, then inserting object B via ADD-PRED and FILL. This example shows its ability to compose specialized tools for flexible editing instructions.
We demonstrates the Builder's adaptability to user feedback and new tools without retraining. For the reflection removal task, the Builder's initial workflow (RES and INPAINT) failed because RES could not segment reflections effectively. Users can provide direct instructions, such as "don't use RES before INPAINT" to prevent it. Guided by this feedback, the Builder revises its workflow: it uses SOS for foreground segmentation, INVERSE to infer the background, and then INPAINT to remove part of the reflection. Additionally, users can introduce a dedicated reflection-removal tool (RRF), which the Builder readily adopts to solve the task effectively.
We open-source this project for academic research. The vast majority of images
used in this project are either generated or licensed. If you have any concerns,
please contact us, and we will promptly remove any inappropriate content.
Any models related to FLUX.1-dev
base model must adhere to the original licensing terms.
This research aims to advance the field of generative AI. Users are free to
create images using this tool, provided they comply with local laws and exercise
responsible usage. The developers are not liable for any misuse of the tool by users.
@misc{jia2025legoeditgeneralimageediting,
title={Lego-Edit: A General Image Editing Framework with Model-Level Bricks and MLLM Builder},
author={Qifei Jia and Yu Liu and Yajie Chai and Xintong Yao and Qiming Lu and Yasen Zhang and Runyu Shi and Ying Huang and Guoquan Zhang},
year={2025},
eprint={2509.12883},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.12883},
}
Thanks to
for the page template.