DreamO: A Unified Framework for Image Customization

1Intelligent Creation Team, ByteDance 2School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University
teaser

The image customization capability of the proposed DreamO.

Abstract

Recently, extensive research on image customization (e.g., identity, subject, style, background, etc.) demonstrates strong customization capabilities in large-scale generative models. However, most approaches are designed for specific tasks, restricting their generalizability to combine different types of condition. Developing a unified framework for image customization remains an open challenge. In this paper, we present DreamO, an image customization framework designed to support a wide range of tasks while facilitating seamless integration of multiple conditions. Specifically, DreamO utilizes a diffusion transformer (DiT) framework to uniformly process input of different types. During training, we construct a large-scale training dataset that includes various customization tasks, and we introduce a feature routing constraint to facilitate the precise querying of relevant information from reference images. Additionally, we design a placeholder strategy that associates specific placeholders with conditions at particular positions, enabling control over the placement of conditions in the generated results. Moreover, we employ a progressive training strategy consisting of three stages: an initial stage focused on simple tasks with limited data to establish baseline consistency, a full-scale training stage to comprehensively enhance the customization capabilities, and a final quality alignment stage to correct quality biases introduced by low-quality data. Extensive experiments demonstrate that the proposed DreamO can effectively perform various image customization tasks with high quality and flexibly integrate different types of control conditions.

Capablities

Methods

Overview of our proposed DreamO, which can uniformly handle commonly used consistency-aware generation control in a DiT framework.

Features

  • Trained with only 8 GPUs and ~100k iterations, using LoRA (<500MB)
  • Fast inference: ~7s for a 1024px image with FLUX-turbo + CFFG distillation
  • SOTA performance across all tasks, with particularly strong fidelity. Robust multi-subject handling: our feature routing constraint greatly reduces attribute entanglement, even in challenging same-category cases like two dogs or two people.

Technical details

  • Unified sequence conditioning format for handling diverse inputs in a consistent way
  • Feature routing constraint encourages content fidelity and disentangles control conditions
  • Progressive training in 3 stages:
    1. Warm-up with simple tasks to establish consistency
    2. Full-task training to enhance customization capability
    3. Quality alignment to mitigate biases from low-quality data

More Results