There are plenty of datasets, Just take the ones meant for stable diff training, rip out the prompt text, profit
Heres some high quality captions used for dalle3, etc:
https://huggingface.co/datasets/laion/dalle-3-dataset https://huggingface.co/datasets/laion/gpt4v-dataset https://huggingface.co/datasets/laion/wuerstchen-dataset https://huggingface.co/datasets/laion/220k-GPT4Vision-captions-from-LIVIS https://huggingface.co/datasets/laion/gpt4v-emotion-dataset