Afaik seedance only has like three modes of generation:
- Text to video
- First frame + text to video
- First frame + last frame + text to video
From what I've seen you can do ranges of time in the text for certain things like 1-3s: slow pan in etc.
People will use something like Google's nano banana to generate still frames in a storyboard-like prompt then have seedance generate the video for each 12 or so second portion

Works the same way as any other software optimization: lower quality computation as a shortcut.
The predicted frames don't use the same full stack of data that a true frame uses to render, they just use the previous frames data and the motion vectors. The rest is a very efficient nueral-network guessing algorithm based on those two pieces of data instead of the full shader stack.