InstructX

Towards Unified Visual Editing With MLLM Guidance

Corresponding author, Project lead

Intelligent Creation Lab, Bytedance

Research Paper GitHub
Overview
InsertPipe
InstructX is a unified framework for image and video editing. By integrating MLLMs with diffusion models, it enables flexible and precise instruction-guided manipulation across image and video.
Method
InsertPipe
InstructX framework.
Core questions
  • Q1: Do MLLMs actually help with visual editing tasks, and by how much specifically?
  • Q2: Among the many ways to combine MLLMs and diffusion models, which is better suited for editing tasks?
  • Q3: Can a large amount of high-quality image-editing data be used to improve video editing tasks?
Our answers
  • A1: Yes. As shown in the below Figure (a), the model using MLLM outperforms the diffusion-only model across all tasks.
  • A2: As shown in Figure (b), among the four structures, the combination of metaquery + LoRA fine-tuned MLLM + small connector achieves the best performance. The key insight here is that the editing should be accomplished within the MLLM itself; LoRA is helpful for adapting the MLLM to the editing task; and a large connector is unnecessary.
  • A3: Yes. Training on image data enables capability transfer to video editing, which unlocks zero-shot abilities on video tasks. As shown in the below Figure (c).
Replace task
Loading...
Loading...
Loading...
Loading...
Add&Remove task
Loading...
Loading...
Loading...
Loading...
Hybrid editing task
Loading...
Loading...
Loading...
Loading...
Reference-based editing task
Loading...
Loading...
Loading...
Loading...
Style Transfer task
Loading...
Loading...
Loading...
Loading...
Image Editing task
InsertPipe

Ethical Considerations

The reference images and videos used in these demos are sourced from public domains or generated by models, and are intended solely to demonstrate the capabilities of this research. If there are any concerns, please contact us and we will delete it in time.

Visitors

BibTeX

@misc{mou2025instructx,
            title={InstructX: Towards Unified Visual Editing with MLLM Guidance}, 
            author={Chong Mou, Qichao Sun, Yanze Wu, Pengze Zhang, Xinghui Li, Fulong Ye, Songtao Zhao, Qian He},
            year={2025},
            eprint={2510.08485},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2510.08485}, 
        }