I'd like to fine-tune a text-to-image model to know certain faces of people I'm working with. I've been experimenting a bit and I can get some images that are reminiscent of a person but really doesn't look like them. I'm also needing to provide more in the prompt than I would expect.
For example, there is one person who is a big guy with a mustache and glasses. I fine-tuned using a few images of him with the caption being his actual name in the training dataset.
When I generate images with his name as the subject, none of the faces will have a mustache or glasses. If I prompt it "Mark Smith with mustache and glasses doing xyz" it does look slightly more reminiscent of him, but still not quite right.
What should my strategy be to improve this? Do I need more images of him? Should I hash his name (or similar) into a common caption to make sure other weights in the model are not interfering? Other ideas?
I realize I could experiment, but it's very expensive to keep fine-tuning and I don't want to go the wrong direction too many times.
The best robots out there can barely even walk without toppling over lol. I do think we're getting close with the "knowledge and understanding" side of AI with LLMs, but the "navigating the physical world" part seems further out.