InstructX

Towards Unified Visual Editing With MLLM Guidance

^† Corresponding author, ^‡ Project lead

Intelligent Creation Lab, Bytedance

Research Paper

GitHub

Core questions

Q1: Do MLLMs actually help with visual editing tasks, and by how much specifically?
Q2: Among the many ways to combine MLLMs and diffusion models, which is better suited for editing tasks?
Q3: Can a large amount of high-quality image-editing data be used to improve video editing tasks?

Our answers

A1: Yes. As shown in the below Figure (a), the model using MLLM outperforms the diffusion-only model across all tasks.
A2: As shown in Figure (b), among the four structures, the combination of metaquery + LoRA fine-tuned MLLM + small connector achieves the best performance. The key insight here is that the editing should be accomplished within the MLLM itself; LoRA is helpful for adapting the MLLM to the editing task; and a large connector is unnecessary.
A3: Yes. Training on image data enables capability transfer to video editing, which unlocks zero-shot abilities on video tasks. As shown in the below Figure (c).

@misc{mou2025instructx, title={InstructX: Towards Unified Visual Editing with MLLM Guidance}, author={Chong Mou, Qichao Sun, Yanze Wu, Pengze Zhang, Xinghui Li, Fulong Ye, Songtao Zhao, Qian He}, year={2025}, eprint={2510.08485}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.08485}, }

InstructX

Towards Unified Visual Editing With MLLM Guidance

Core questions

Our answers

Ethical Considerations

Visitors

BibTeX