LyPreto

joined 1 year ago
[–] LyPreto@alien.top 1 points 11 months ago (4 children)

I saw their 7B model closing in on gpt-4 scores in some benchmarks which is absolutely wild but also sus

[–] LyPreto@alien.top 1 points 11 months ago

I ended up just scrutinizing the server code to understand it better and found that the prompt needs to follow a very specific format or else it won't work well:

prompt: \A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\nUSER:[img-12]${message}\nASSISTANT:``

[–] LyPreto@alien.top 1 points 11 months ago

you have all the APIs whats stopping you from putting something like this together? personally for me the only challenge is finding projects compatible with M1 that offer Metal offloading— but for linux it should be relatively straightforward to implement

 

I spun up a simple project (home surveillance system) to play around with ShareGPT4V-7B and made quite a bit of progress over the last few days. However, I'm having a really hard time figuring out how to send a simple prompt along with the image-to-text request. Here is the relevant code:

document.getElementById('send-chat').addEventListener('click', async () => {  const       

  message = document.getElementById('chat-input').value;
  appendUserMessage(message);
  document.getElementById('chat-input').value = '';
  const imageElement = document.getElementById('frame-display');
  const imageUrl = imageElement.style.backgroundImage.slice(5, -2);

  try {
    const imageBlob = await fetch(imageUrl).then(res => res.blob());
    const reader = new FileReader();
    reader.onloadend = async () => {
    const base64data = reader.result.split(',')[1];

    const imageData = {
      data: base64data,
      id: 1
    };

    const payload = {
      prompt: message,
      image_data: [imageData],
      n_predict: 256,
      top_p: 0.5,
      temp: 0.2
    };

    const response = await fetch("http://localhost:8080/completion", {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify(payload)
    });

    const data = await response.json();
    console.log(data);
    appendAiResponse(data.content);
  };
  reader.readAsDataURL(imageBlob);
  } catch (error) {
    console.error('Error encoding image or sending request:', error);
  }
});

The only thing that works is sending an empty space or sometimes a question mark and i'll get a general interpretation of the image but what I really want is to be able to instruct the model so it knows what to look for. Is that something that's currently possible? basically system prompting the vision model.

[–] LyPreto@alien.top 1 points 11 months ago (1 children)

not prefer it bur recognize its user base— metal + the unified memory have a lot to offer and the compute is there.. there just rly no adoption other than a few select projects like llama.cpp and some of the other text-inferencing engines.

[–] LyPreto@alien.top 1 points 11 months ago (5 children)

I rly wish MPS was more widely adopted by now… hate seeing just CUDA or CPU in all these new libraries

[–] LyPreto@alien.top 1 points 11 months ago (1 children)

tried coqui and had issues with performance— read online and its doesnt seem to fully support MPS.

for now i’m using upon edge-tts which is doing the trick for now and is pretty decent/free.

is xtts supported on macs?

[–] LyPreto@alien.top 1 points 11 months ago (1 children)

Update: quick reddit search (which i should've done prior to posting tbh) led me to this post: ai_voicechat_script

 
  • So, i've been doing all my LLM-tinkering on an M1-- using llama.cpp/whisper.cpp for to run a basic voice powered assistant, nothing new at this point.
  • Currently adding a visual component to it-- ShareGPT4V-7B, assuming I manage to convert to gguf. Once thats done i should be able to integrate it with llama.cpp and wire it to a live camera feed-- giving it eyes.
  • Might even get crazy and throw in a low level component to handle basic object detection, letting the model know when something is being "shown" to the to it-- other than that it will activate when prompted to do so (text or voice).

The one thing I'm not sure about is how to run a TTS engine locally like StyleTTS2-LJSpeech? are there libraries that support tts models?

[–] LyPreto@alien.top 1 points 11 months ago

the licensing on this blows but they have a very unique model IMO: StyleTTS

it picks up the appropriate voice/intonation according to the text which i personally haven’t seen being done yet!

[–] LyPreto@alien.top 1 points 11 months ago

claude is dogshit for code generation from my experience

[–] LyPreto@alien.top 1 points 11 months ago (1 children)
[–] LyPreto@alien.top 1 points 11 months ago (1 children)

damn llama.cpp has a monopoly indirectly 😂

 

have been thinking about this for a while-- does anyone know how feasible this is? Basically just applying some sort of "LoRa" on top of models to give them vision capabilities-- making then multimodal.

view more: next ›