When generating videos from text prompts, directly mapping language to high-res video tends to produce inconsistent, blurry results. The high dimensionality overwhelms models.
Researchers at Meta took a different approach - first generate a high-quality image from the text, then generate a video conditioned on both image and text.
The image acts like a "starting point" that the model can imagine moving over time based on the text prompt. This stronger conditioning signal produces way better videos.
They built a model called Emu Video using diffusion models. It sets a new SOTA for text-to-video generation:
- "In human evaluations, our generated videos are strongly preferred in quality compared to all prior work– 81% vs. Google’s Imagen Video, 90% vs. Nvidia’s PYOCO, and 96% vs. Meta’s Make-A-Video."
- "Our factorizing approach naturally lends itself to animating images based on a user’s text prompt, where our generations are preferred 96% over prior work."
The key was "factorizing" into image and then video generation.
Being able to condition on both text AND a generated image makes the video task much easier. The model just has to imagine how to move the image, instead of hallucinating everything.
They can also animate user-uploaded images by providing the image as conditioning. Again, reported to be way better than previous techniques.
It's cool to see research pushing text-to-video generation forward. Emu Video shows how stronger conditioning through images sets a new quality bar. This is a nice compliment to the Emu Edit model they released as well.
TLDR: By first generating an image conditioned on text, then generating video conditioned on both image and text, you can get better video generation.
Full summary is here. Paper site is here.
Those are just **four** links **I myself** have posted and they have ~75 comments on them. I pointed that out in the original comment but you skipped over that.