LocalLLaMA

3 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago

MODERATORS

communick@poweruser.forum

struggling to include text prompts along with image-data (multimodal) for inferencing (alien.top)

submitted 1 year ago by LyPreto@alien.top to c/localllama@poweruser.forum

2 comments fedilink hide all child comments

I spun up a simple project (home surveillance system) to play around with ShareGPT4V-7B and made quite a bit of progress over the last few days. However, I'm having a really hard time figuring out how to send a simple prompt along with the image-to-text request. Here is the relevant code:

document.getElementById('send-chat').addEventListener('click', async () => {  const       

  message = document.getElementById('chat-input').value;
  appendUserMessage(message);
  document.getElementById('chat-input').value = '';
  const imageElement = document.getElementById('frame-display');
  const imageUrl = imageElement.style.backgroundImage.slice(5, -2);

  try {
    const imageBlob = await fetch(imageUrl).then(res => res.blob());
    const reader = new FileReader();
    reader.onloadend = async () => {
    const base64data = reader.result.split(',')[1];

    const imageData = {
      data: base64data,
      id: 1
    };

    const payload = {
      prompt: message,
      image_data: [imageData],
      n_predict: 256,
      top_p: 0.5,
      temp: 0.2
    };

    const response = await fetch("http://localhost:8080/completion", {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify(payload)
    });

    const data = await response.json();
    console.log(data);
    appendAiResponse(data.content);
  };
  reader.readAsDataURL(imageBlob);
  } catch (error) {
    console.error('Error encoding image or sending request:', error);
  }
});

The only thing that works is sending an empty space or sometimes a question mark and i'll get a general interpretation of the image but what I really want is to be able to instruct the model so it knows what to look for. Is that something that's currently possible? basically system prompting the vision model.

you are viewing a single comment's thread
view the rest of the comments

[–] paryska99@alien.top 1 points 1 year ago (1 children)

Doesn't the LlamaCpp server host a GUI for multimodal? You could potentially visit it, open the developer panel in your browser, and observe the HTTP requests being sent.

[–] LyPreto@alien.top 1 points 1 year ago

I ended up just scrutinizing the server code to understand it better and found that the prompt needs to follow a very specific format or else it won't work well:

prompt: \A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\nUSER:[img-12]${message}\nASSISTANT:``