I think they changed it to it’s still an experiment and they are finishing evaluations to better understand the model.
bot-333
There are SO many models "bullshitting through some benchmarks or some other shenanigans" that I'm cooking my own benchmark system LOL.
"New RLAIF Finetuned 7b Model" Interesting. "beats Openchat 3.5" Nice! "and comes close to GPT-4" Bruh.
Another day, another people trying to make a whole story based on two tokens.
Not sure if self promotion here is allowed. I found my own IS-LM 3B to be the most coherent, verbose, and factual/correct 3B I've tried. IMO it's better than Rocket 3B, but it scores worse in benchmarks. I suspect a contamination in Rocket 3B.
Can you try my new IS-LM? GGUF: https://huggingface.co/UmbrellaCorp/IS-LM-3B_GGUF. I found it really good. Thanks.
I suggest you to try IS-LM 3B.
Are you using the correct prompt template?
I think I need to remind people of the benchmarks used, MT-Bench and AlpacaEval are terrible benchmarks.
I see that their distilled model is much worse than StableLM 3E1T, so the finetuning improved a lot. Unfortunately they didn’t release the datasets(Would that still be considered Open Source?). Also I’m pretty sure my StableLM finetunes are better in the Open LLM Benchmarks, they just don’t allow StableLM models to be submitted.
I guess they might open source the 600B one? They have different names, so maybe different training approaches.