this post was submitted on 26 Nov 2023

1 points (100.0% liked)

LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

Is Open LLM Leaderboard reliable source ? yi:34B is at the top but I get better results with neural-chat:7B model (alien.top)

submitted 2 years ago by grigio@alien.top to c/localllama@poweruser.forum

23 comments fedilink hide all child comments

I use in both cases q4_K_M

top 23 comments

sorted by: hot top controversial new old

[–] ThisGonBHard@alien.top 1 points 2 years ago

While the benchmarks then to be cheated, especially by small models, I honestly think something is wrong with how you run it.

Yi-34B trades blows with Lllama 2 70B from my personal tests, making it do novel tasks invented by me, not the gamed benchmarks.

ALL 7B models are like putting a 7 year old vs an renowned professor when they are compared to 34B and 70B.

[–] meetrais@alien.top 1 points 2 years ago (1 children)

Same experience here. I got excellent results from quantized models of Intel-Neural-7B and Mistral-7B but bad results with quantized model of Yi-34B.

[–] Inevitable_Host_1446@alien.top 1 points 2 years ago (2 children)

I'm not sure what the point of Neural-7B is, given that it's super censored corporate safety bot. If that's what people want they might as well just use ChatGPT, which is faster and better otherwise.

[–] Nixellion@alien.top 1 points 2 years ago

Privacy and cost Also no, 7B is as fast or faster than ChatGPT depending on ChatGPT load.

[–] grigio@alien.top 1 points 2 years ago

neural-chat from Intel is not censored! Just use a good system prompt

[–] FullOf_Bad_Ideas@alien.top 1 points 2 years ago

Are you talking about base yi-34B or a fine-tuned one? Base model will be hard to use but will score pretty high. Benchmarks are generally written with completion in mind, so they work really well on base models and instruct tuning may make it much easier to work with but not necessarily score higher on benchmarks.

[–] out_of_touch@alien.top 1 points 2 years ago (3 children)

I'm curious what results you're seeing from the Yi models. I've been playing around with LoneStriker_Nous-Capybara-34B-5.0bpw-h6-exl2 and more recently LoneStriker_Capybara-Tess-Yi-34B-200K-DARE-Ties-5.0bpw-h6-exl2 and I'm finding them fairly good with the right settings. I found the Yi 34B models almost unusable due to repetition issues until I tried settings recommended in this discussion:

https://www.reddit.com/r/LocalLLaMA/comments/182iuj4/yi34b_models_repetition_issues/

I've found it much better since.

I tried out one of the neural models and found it couldn't keep track of details at all. I wonder if my setting weren't very good or something. I would have been using a EXL2 or GPTQ version though.

[–] TeamPupNSudz@alien.top 1 points 2 years ago

I found the Yi 34B models almost unusable due to repetition issues until I tried settings recommended in this discussion:

I have the same issue with LoneStriker_Nous-Capybara-34B-5.0bpw-h6-exl2. Whole previous messages will often get shoved into the response. I basically gave up and went back to Mistral-OpenHermes.

[–] bacocololo@alien.top 1 points 2 years ago (1 children)

To stop any repetition. you could try to add a stop token in model as ‘### Human’ it works well for me

[–] TeamPupNSudz@alien.top 1 points 2 years ago

Capybara doesn't use Alpaca format, so that wouldn't do anything. Regardless, it's not that type of repetition. It's not speaking for the user, it's literally just copy/pasting part of the conversation into the answer.

[–] USM-Valor@alien.top 1 points 2 years ago

I've had the same experiences with the Yi finetunes. I tried them on single-turn generations and they were very promising. However, starting with one from scratch I was having a ton of repetition and looping. Some models need a very tight set of parameters to get them to perform well, whereas other ones will function will under almost any sane set of guidelines. I'm thinking Yi leans more towards the former, which will have users thinking they are inferior to simpler, but more flexible models.

[–] phree_radical@alien.top 1 points 2 years ago (1 children)

Most of the benchmarks seem to measure regurgitation of factual knowledge, which IMO everyone should accept as a misguided idea for a task, from in-weights learning, instead of testing in-context learning, which I would argue was the goal of LLM training. I'd say they are probably harmful to the cause of improving future LLMs

[–] andrewlapp@alien.top 1 points 2 years ago

I agree, and The Leaderboard's newly added DROP metric is a step in the right direction.

[–] Yes_but_I_think@alien.top 1 points 2 years ago

You are hallucinating?

[–] Yes_but_I_think@alien.top 1 points 2 years ago

You are hallucinating?

[–] Sunija_Dev@alien.top 1 points 2 years ago

90% of the time a bigger model is "worse" because...

A) I messed up the prompt format

B) (For roleplaying) Smaller models seem more creative, because they're less consistent. But after some messages, the missing consistency makes them really bad.

[–] Cless_Aurion@alien.top 1 points 2 years ago

yeah, no lol

No 7B model is going to beat a 34B model anytime soon.

[–] VertexMachine@alien.top 1 points 2 years ago (1 children)

It's a source. But rarely synthetic benchmarks give you the whole picture. Plus those test sets are in the public, so there is some incentive for some people to game the system (and even without that those data sets most likely are already in the training data).

[–] TobyWonKenobi@alien.top 1 points 2 years ago

I’ve had the same experience. Are you using GGUF? I do, and I’ve heard that Yi may suffer from GGUF. So EXL2 might be better… I need to try it and see.

[–] idnc_streams@alien.top 1 points 2 years ago

Reliable? It never was informative to a certain extent yes

[–] nixudos@alien.top 1 points 2 years ago

Does anyone have a setup that works with a Yi34b model on 12GB vram and 32GB ram? I have tried using GGUF but I always end up with the answer. I'm on Oobabooga and any specific settings to overcome this hurdle with GGUF would be greatly appreciated! I

[–] tomccc@alien.top 1 points 2 years ago

I had very bad experiences using all the Yi models until recently. Going to chalk it up as user error on my part. LoneStriker_Capybara-Tess-Yi-34B-200K-DARE-Ties-4.0bpw-h6-exl2 is really good. I made sure to have all the right settings.

[–] FPham@alien.top 1 points 2 years ago

My private finetunes are about text rewriting - input text paragraph - rewrite it in a certain style.

No 7b finetuned model can grasp the idea of submitted text in entirety, tried maybe 100 different runs. It would make a common mistake of "someone" who just scan the text quickly while also watching youtube on a phone, failing to comprehend who is who or what the paragraph is about.

13b with the same finetuning does much better - it would comprehend the relations. For example if two people are speaking, it can keep track who is who, even without mentioning it in the text.

33b - gets even further - sometimes surprise with the way it understand the text. And so the rewritten text is a mirror image of the input, just with different style

7b are impressive if you want a small local LLM to give you answers on questions, but that's probably the limit. If you want an assistant that can also do other things, then it falls short, because your instructions are not necessary understood fully.