✨ Think-Then-Generate: Evolutionary Reasoning for Image Generation
🚀 The Paradigm Shift
Traditional text-to-image models often treat the text encoder as a frozen dictionary, mapping words to pixels without truly understanding the intent or implicit semantics behind a prompt.
🧠 How it Works
To break this limitation, we introduce the Think-Then-Generate paradigm. Before a single pixel is drawn, the model "thinks" through the instruction:
- Chain-of-Thought (CoT) Reasoning: By fine-tuning Qwen2.5-VL, we activate its latent world knowledge. The model reasons about the scene and objects before generating an "optimized prompt."
- Dual-GRPO Reinforcement Learning: A collaborative RL strategy where the LLM encoder and DiT generator evolve together. The LLM learns to produce better instructions, while the DiT enhances its rendering capability based on visual feedback.
- Bridging Logic and Vision: The optimized prompt serves as a semantic bridge, ensuring the final generation is a deep realization of user intent.
Try these examples 👇
Note: This is a research preview. The model first reasons about your prompt to optimize the visual description.