this post was submitted on 27 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

https://preview.redd.it/3krgd1sg2z2c1.png?width=800&format=png&auto=webp&s=b76c5fb9fa22938c74ec3095f63adaec8ff2219d

I came across this new finetuned model based on Openchat 3.5 which is apparently trained used Reinforcement Learning from AI Feedback (RLAIF).

https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha

Check out this tweet: https://twitter.com/bindureddy/status/1729253715549602071

top 49 comments
sorted by: hot top controversial new old
[–] LocoMod@alien.top 1 points 11 months ago (1 children)
[–] r3tardslayer@alien.top 1 points 11 months ago
[–] metalman123@alien.top 1 points 11 months ago

Was wondering how long this would take to show up.

[–] allinasecond@alien.top 1 points 11 months ago (1 children)

That gap in coding is what makes me stay with GPT-4 until I don't.

[–] RelevantFoundation14@alien.top 1 points 11 months ago (1 children)

Have you tried DeepSeek, it’s pretty good at doing most things I’ve asked it to with Python.

[–] geepytee@alien.top 1 points 11 months ago

It is pretty good, just not as good

[–] bot-333@alien.top 1 points 11 months ago (1 children)

"New RLAIF Finetuned 7b Model" Interesting. "beats Openchat 3.5" Nice! "and comes close to GPT-4" Bruh.

[–] Evening_Ad6637@alien.top 1 points 11 months ago (1 children)

heheh i can't read that any more.. i really have become very prejudiced when comes to that.. to be honest, when it comes to any comparison with GPT-4.

People have really to understand that even GPT-4 has been aligned, lobotomized and it has been massively downgraded in terms of its perfomance – due to security reasons (what is understandable for me), but anyway this thing still is an absolute beast. if we consider all the restrictions GPT-4 has to undergo, all the smartness at openAI, all the ressources at microsoft and so on, we have to realize that currently nothing is really comparable to GPT-4. Especially not 7B models.

[–] noeda@alien.top 1 points 11 months ago (3 children)

I've seen the "... beats GPT-4" enough times that now whenever I see a title that suggests a tiny model can compete with GPT-4 I see it as a negative signal; that the authors are bullshitting through some benchmarks or some other shenanigans.

It's annoying because the models might be legitimately good models for being open and within their weight class but now you've put my brain in BS detecting mode and I can't trust you've done good faith measurement anymore.

[–] Evening_Ad6637@alien.top 1 points 11 months ago (1 children)

Yeah I dont think authors are intentionally bullshitting or intentionally doing "benchmark cosmetics", but maybe it's more lack of knowledge on whats going on in terms of (most of) benchmarks and their the image that has become ruined in the meantime.

[–] Competitive_Ad_5515@alien.top 1 points 11 months ago

Sure, but name-dropping the biggest name in the game and comparing yourself favourably to it is a big swing. It's either a naive at best marketing claim or it's untrue.

[–] bot-333@alien.top 1 points 11 months ago

There are SO many models "bullshitting through some benchmarks or some other shenanigans" that I'm cooking my own benchmark system LOL.

[–] Kep0a@alien.top 1 points 11 months ago

Yeah I just roll my eyes and continue onwards

[–] jeffwadsworth@alien.top 1 points 11 months ago

Hard to believe but can’t wait to try.

[–] PrometheusZer0@alien.top 1 points 11 months ago (1 children)

Does somebody have a prompt template for this? Trying to run in ollama

[–] PrometheusZer0@alien.top 1 points 11 months ago (1 children)

Here's what I'm using:

FROM starling-lm-7b-alpha.Q5_K_M.gguf

PARAMETER stop <|end_of_turn|>

PARAMETER stop <|im_sep|>

TEMPLATE """

GPT4 User: {{.Prompt}}<|end_of_turn|>GPT4 Assistant:

"""

[–] visarga@alien.top 1 points 11 months ago (2 children)

how do you add your own gguf into ollama? it seems to be storing models as cryptic binary blobs in a folder.

[–] dododragon@alien.top 1 points 11 months ago

generate the sha256 hash using sha256sum your_model.gguf

rename your_model.gguf to "sha256:_hash_" (replace _hash_ with the actual hash)

move it to /usr/share/ollama/.ollama/models/blobs folder

copy a manifest from a similar model in /usr/share/ollama/.ollama/models/

manifests/registry.ollama.ai/library and update the hash & filesize to match your model in the "image.model" entry.

repeat last step for the params entry

you can call the manifest folder/file whatever you like

[–] noeda@alien.top 1 points 11 months ago (1 children)

The first image posted; looks like it's not even close to GPT-4?

[–] Real-Elk-6109@alien.top 1 points 11 months ago

Considering “close” as a relative word, it came closer than other open-source models. But you have a point too.

[–] alexthai7@alien.top 1 points 11 months ago (1 children)

Do someone know why it writes the line feed code all the time in its answer ? <0 x 0 A>

Besides this, I find the model amazing.

[–] HenkPoley@alien.top 1 points 11 months ago (1 children)

Has been fixed in the unquantized model. They forgot to upload the tokenizer files https://twitter.com/banghuaz/status/1729375878612922724?s=12

https://huggingface.co/TheBloke/Starling-LM-7B-alpha-GGUF/discussions/1#65657dc79bf6665f10ebd941

Looks like TheBloke hasn’t picked it up. But then it has only been an hour 😂

[–] -Shasho-@alien.top 1 points 11 months ago

It's fixed now.

[–] thereisonlythedance@alien.top 1 points 11 months ago

I was sceptical, but darn it's good. Mistral is a fantastic base and with this technique these guys have pushed it another step closer. A lot of the answers I'm getting are on on par with old GPT-4 (pre-turbo, turbo in the API is a step up on old GPT-4 IMO).

[–] Dankmemexplorer@alien.top 1 points 11 months ago (1 children)

the model can have a little of the test data as a treat

[–] Sweet_Protection_163@alien.top 1 points 11 months ago (1 children)

I can't wait for the trustworthy closed sourced benchmarks. Can't believe I'm saying that.... but it's honestly what we need.

[–] liqui_date_me@alien.top 1 points 11 months ago

Wonder if that’s a good startup idea? Something that can benchmark language models and charges a fee for doing so

[–] pseudonerv@alien.top 1 points 11 months ago (2 children)

Form huggingface model card,

Starling-RM-7B-alpha is a reward model trained from Llama2-7B-Chat.

From their webpage, https://starling.cs.berkeley.edu

Our reward model is fine-tuned from Llama2-7B-Chat

Yet, the model config.json

"max_position_embeddings": 8192,
"model_type": "mistral",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"rms_norm_eps": 1e-05,
"rope_theta": 10000.0,
"sliding_window": 4096,

SO? Whoever is doing the PR has no f***ing idea what their student labors are actually doing.

[–] visarga@alien.top 1 points 11 months ago

yeah I was put off by the lack of mention on the base model

[–] Warm_Shelter1866@alien.top 1 points 11 months ago

What does it mean that an LLM is a reward model ? , I always thought of rewards only in the RL field . And how would the reward model be used during finetuning?

[–] georgejrjrjr@alien.top 1 points 11 months ago

If there is something somehow inherently superior about having a separate reward model, that should be teased out.

It would be nice to see stronger baselines / ablations for this reason. I realize it’s nigh impossible to keep up with the unrelenting pace of advances, so I don’t fault the authors here. That said, if there isn’t a compelling reason to keep the separate preference model, community people-hours will probably be best spent sticking with DPO/IPO to avoid the hyper-parameter tuning rabbit hole.

My guess: the way things are going, we’ll soon see a rough consensus emerge around a sane default DPO or Identity-PO recipe for fine-tunes (the same way we’ve seen gradual convergence around decoder-only transformer + rotational positional embeddings + group query attention + FlashAttention 2) to be applied absent a compelling reason to use a different reward signal.

No matter what, preference datasets like this are helpful. Pity about the license being claimed here, it’s hard to imagine it would hold up, but the specter is a bit of a hindrance.

[–] Thistleknot@alien.top 1 points 11 months ago

rm is the reward model... not the same as the lm model. I tried the lm, wasn't impressed. Gpt-3.5 did better for summarizing quotes. It was good, but I honestly think open hermes and or synthia 1.3b do better

[–] OC2608@alien.top 1 points 11 months ago (1 children)

How to earn VC money 101: "Beats GPT-4!"

And voila! you're rich now.

[–] _Lee_B_@alien.top 1 points 11 months ago

And voila! You work for investors now.

[–] OccasionallyImmortal@alien.top 1 points 11 months ago (1 children)

It repeats itself and seems incapable of giving a response shorter than 200 words.

[–] -Shasho-@alien.top 1 points 11 months ago

They forgot to include the tokenizer files from openchat 3.5, which caused some weirdness for me with new line characters among other things in the GGUF I got from TheBloke. The original repo has been fixed but I have yet to see a new GGUF.

[–] sahil1572@alien.top 1 points 11 months ago (2 children)

Every other model nowadays claims to be GPT-4, and they turn out to be < GPT-3. I don't know what kind of test they use to score .

[–] Sakura9095@alien.top 1 points 11 months ago

can it do nsfw? anyone has examples? roleplay?

[–] Wonderful_Ad_5134@alien.top 1 points 11 months ago (1 children)

"Close to GPT4" is as true as "Me, Close to Usain bolt in the 100m dash" lol

[–] Maleficent_Employ693@alien.top 1 points 11 months ago

Nope the research and proof is here not the parameters but the quality of data is the way my brotha

[–] SocketByte@alien.top 1 points 11 months ago

7B model "close" to 1.7T MoE GPT-4? What's next? 3B AGI? Tired with those meaningless claims honestly.

[–] Silly-Cup1391@alien.top 1 points 11 months ago

11B is quite nice too

[–] wispiANt@alien.top 1 points 11 months ago (1 children)

Prompt:

Write a brief request (one paragraph) to a college professor asking for updates on funding for a project. Do not use the phrase "I hope this email finds you well" or similar.

Starling:

Subject: Request for Funding Updates on [Project Name]

Dear Professor [Last Name],

I hope you're doing well. I am writing to inquire about the current status of funding for our project, [Project Name]. As you are aware, we have been working diligently on this project and it is crucial for us to receive updates on the financial aspect, particularly regarding any changes or new opportunities that may have arisen.

We understand that securing funding can be a complex process, and we appreciate your continuous efforts in this regard. Your support and guidance have been invaluable, and we eagerly await your response.

Thank you in advance for your attention to this matter, and we look forward to your prompt response.

Best regards,

[Your Name]


Hmm. Even llama-2-7b-chat doesn't struggle with this one.

[–] tamereen@alien.top 1 points 11 months ago

Subject: Request for Funding Update on [Project Name]

Dear Professor [Last Name],

I am writing to inquire about the progress of our research project, "[Project Name]", which we began working on together earlier this year. As we continue to make significant advancements in our research, securing adequate funding remains crucial to ensure its successful completion. It would be greatly appreciated if you could update us on the current status of funding acquisition efforts and share any relevant news or developments. Additionally, if there are any upcoming deadlines or milestones that we should be aware of, kindly inform us so that we can plan accordingly. Your guidance and support throughout this process have been invaluable, and we look forward to receiving your response at your earliest convenience. Thank you very much for your attention to this matter.

Sincerely,

[Your Name]

[–] tamereen@alien.top 1 points 11 months ago

I find this model really good even at coding.

Got better result than 34b Yi

[–] ex-arman68@alien.top 1 points 11 months ago

Here is some info I posted for the 11b version of this model, but it is probably useful for the original 7B version as well.

I think I found the key to avoid repetitions and long rambling answers, which this model has a tendency to do. Hopefully a further fine tune will reduce it. The key is to reduce creativity all the way down, and make the model deterministic. How do you do that?, you may ask. Easy, it is controlled by the following 3 inference parameters: temp, top_p, and top_k

With the following default settings I often get repetitions or additional rambling information:

    "top_k": 40,
    "top_p": 0.95,
    "temp": 0.8,

If I use the following values instead, to make the model deterministic, the problem seems to be gone:

    "top_k": 1,
    "top_p": 0.1,
    "temp": 0.1,

Please note that if you want to use the model for story writing, maybe you get better results by dialing up the creativity.

Here is my complete config file for LM Studio:

{
  "name": "OpenChat",
  "inference_params": {
    "top_k": 1,
    "top_p": 0.1,
    "temp": 0.1,
    "input_prefix": "GPT4 Correct User: ",
    "input_suffix": "&lt;|end_of_turn|>GPT4 Correct Assistant:",
    "antiprompt": [
      "GPT4",
      "&lt;|end_of_turn|>",
      "[End of Turn]",
      "[]"
    ],
    "pre_prompt": "Below is an instruction that describes a task. Write a concise response that appropriately completes the request. Ensure all essential details are provided. Each of your statements must be unique.",
    "pre_prompt_suffix": "&lt;|end_of_turn|>",
    "pre_prompt_prefix": "GPT4 System: "
  }
}

A few words about the above:

  • I only include necessary options to avoid overwriting user settings when loading the model or switching prompt format. If you export a config file, please make sure you then edit it manually to clean it up.
  • GPT Correct User/Assistant. The Correct keyword is important. It refers to the training data, where the answers were verified as correct. If you do not use it (eg: GPT4 User or Human User), it will still works, but it will give more weight to training data which was unverified.
  • GPT4 Sytem or just System are the 2 official recommended ways to prefix system messages. Either work.
  • In my system message (pre_promt), I avoid any negative (eg: I do not instruct : Do not repeat yourself"). Remember this is just a language model: if it sees the word "repeat", it will have a tendency to see this as an instruction to create repetitions! Instead I turned it around into a positive statement based on the word "unique".

As a bonus, here is my config for generating code, which according to my limited testing, this model seems to be surprisingly good at:

{
  "name": "OpenChat Code",
  "inference_params": {
    "top_k": 1,
    "top_p": 0.1,
    "temp": 0.1,
    "input_prefix": "Code User: ",
    "input_suffix": "&lt;|end_of_turn|>Code Assistant:",
    "antiprompt": [
      "GPT4",
      "&lt;|end_of_turn|>",
      "[End of Turn]",
      "[]"
    ],
    "pre_prompt": "You are a helpful coding assistant. Respond concisely, but ensure all essential details are provided. Each of your statements must be unique.",
    "pre_prompt_suffix": "&lt;|end_of_turn|>",
    "pre_prompt_prefix": "GPT4 System: "
  }
}