CodeGriot

joined 10 months ago
[–] CodeGriot@alien.top 1 points 9 months ago

I don't think you should be surprised that a 34B model is mostly failing, considering the fact that a 200B model (GPT-3.5) is only getting to 40%. What you're asking the LLM to do is very hard for it without further training/tuning.

 

Hi all, admit I didn't pay much attention to OpenAI's dev day, so I got tripped up this evening when I did a virtual env refresh, and all my local LLM access code broke. Turns out they massively revved their API. This is mostly news for folks like me who either maintain an LLM-related project, or who just prefer to write their own API access clients. I think we probably have enough of us here to share some useful notes—I see ya'll posting Python code now and then.

Anyway, the best news is that they deprecated the nasty old "hack this imported global resource approach" in favor of something more encapsulated.

Here's a quick example of using the updated API with a local LLM (at localhost). Works with my llama-cpp-python hosted LLM. Their new API docs are a bit on the sparse side, so I had to do some spelunking in the upstream code to straighten it all out.

from openai import OpenAI
client = OpenAI(api_key='dummy', base_url='http://127.0.0.1:8000/v1/')
chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Say this is a test",
        }
    ],
    # Just use whatever llama-cpp-python or whatever mounts for the model
    model="dummy",
)
print(chat_completion.choices[0].message.content)

Final line prints just the actual first choice response message text you might be familiar with from the underlying JSON structure.

I'll continue to maintain notes in this ticket as I update OgbujiPT (open source client-side LLM toolkit), but I'll also update this thread with any other really interesting bits I come across.

[–] CodeGriot@alien.top 1 points 10 months ago

You probably need to wait for the Mac Studio refresh announcements for something more clearly relevant to LLM devs. Hopefully those will have 256GB or more unified memory configs, but likely something for 2024.

That said, it's handy to be able to run inference on a q8 70b model on your local dev box, so the 96GB & 128GBs are interesting for that.