Can image models understand what we’re asking for?

High-quality graphics vs high-quality understanding — which one matters more?

Can image models understand what we’re asking for?
“Imagen 3 is our best diffusion model for text-to-image generation, capable of following descriptive prompts"