I spun up a simple project (home surveillance system) to play around with ShareGPT4V-7B and made quite a bit of progress over the last few days. However, I'm having a really hard time figuring out how to send a simple prompt along with the image-to-text request. Here is the relevant code:
document.getElementById('send-chat').addEventListener('click', async () => { const
message = document.getElementById('chat-input').value;
appendUserMessage(message);
document.getElementById('chat-input').value = '';
const imageElement = document.getElementById('frame-display');
const imageUrl = imageElement.style.backgroundImage.slice(5, -2);
try {
const imageBlob = await fetch(imageUrl).then(res => res.blob());
const reader = new FileReader();
reader.onloadend = async () => {
const base64data = reader.result.split(',')[1];
const imageData = {
data: base64data,
id: 1
};
const payload = {
prompt: message,
image_data: [imageData],
n_predict: 256,
top_p: 0.5,
temp: 0.2
};
const response = await fetch("http://localhost:8080/completion", {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(payload)
});
const data = await response.json();
console.log(data);
appendAiResponse(data.content);
};
reader.readAsDataURL(imageBlob);
} catch (error) {
console.error('Error encoding image or sending request:', error);
}
});
The only thing that works is sending an empty space or sometimes a question mark and i'll get a general interpretation of the image but what I really want is to be able to instruct the model so it knows what to look for. Is that something that's currently possible? basically system prompting the vision model.
I ended up just scrutinizing the server code to understand it better and found that the prompt needs to follow a very specific format or else it won't work well:
prompt: \
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\nUSER:[img-12]${message}\nASSISTANT:``