✨ Think-Then-Generate: Evolutionary Reasoning for Image Generation

🚀 The Paradigm Shift

Traditional text-to-image models often treat the text encoder as a frozen dictionary, mapping words to pixels without truly understanding the intent or implicit semantics behind a prompt.

🧠 How it Works

To break this limitation, we introduce the Think-Then-Generate paradigm. Before a single pixel is drawn, the model "thinks" through the instruction:

  1. Chain-of-Thought (CoT) Reasoning: By fine-tuning Qwen2.5-VL, we activate its latent world knowledge. The model reasons about the scene and objects before generating an "optimized prompt."
  2. Dual-GRPO Reinforcement Learning: A collaborative RL strategy where the LLM encoder and DiT generator evolve together. The LLM learns to produce better instructions, while the DiT enhances its rendering capability based on visual feedback.
  3. Bridging Logic and Vision: The optimized prompt serves as a semantic bridge, ensuring the final generation is a deep realization of user intent.
Try these examples 👇

Note: This is a research preview. The model first reasons about your prompt to optimize the visual description.