this post was submitted on 16 Nov 2023

1 points (100.0% liked)

LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Why is Mistral-7b so capable? Any ideas re: dataset? (alien.top)

submitted 2 years ago by Fun_Tangerine_1086@alien.top to c/localllama@poweruser.forum

24 comments fedilink hide all child comments

So Mistral-7b is a pretty impressive 7B param model ... but why is it so capable? Do we have any insights into its dataset? Was it trained very far beyond the scaling limit? Any attempts at open reproductions or merges to scale up # of params?

top 24 comments

sorted by: hot top controversial new old

[–] Nkingsy@alien.top 1 points 2 years ago

Trained on a larger # of tokens. All the llama models are under trained it appears, especially the 70b

[–] dipittydoop@alien.top 1 points 2 years ago

They didn't lobotomize it for safety.

[–] meetrais@alien.top 1 points 2 years ago (2 children)

I second this. Mistral-7B gave me good results. After fine-tuning it's result is even better.

[–] kaszebe@alien.top 1 points 2 years ago

Mistral-7B gave me good results

Can you expand upon that? Do you mean in terms of its ability to write at a college level without major grammatical errors?

[–] PwanaZana@alien.top 1 points 2 years ago (1 children)

Are there notable finetunes to your knowledge? I've started using LLMs today, starting with openorca mistral 7B and it seems pretty good.

[–] meetrais@alien.top 1 points 2 years ago

On HuggingFace you can find many fine-tuned/quantized models. Look for models from TheBloke on HuggingFace.

[–] involviert@alien.top 1 points 2 years ago

I assume the progress is based on well structured, high quality training data, combined with an incremental "learning schedule". At least that's where some reports of massive progress seem to be coming from and it's also very intuitive that this would help a lot.

[–] PookaMacPhellimen@alien.top 1 points 2 years ago

Lack of censorship is a key factor as it maximises the predictive abilities of the model.

[–] Flamenverfer@alien.top 1 points 2 years ago

It doesn’t seem too capable. Has anyone else tried running this locally or on runpod?

[–] Charuru@alien.top 1 points 2 years ago

The results are okay, but I'm hard-pressed to call it "very capable". My perspective on it is that other bigger models are making mistakes they shouldn't be making because they were "trained wrong".

[–] kindacognizant@alien.top 1 points 2 years ago (1 children)

I'm guessing GQA helped. Llama2 70b and 34b used Grouped Query Attention but it wasn't used for Llama2 7/13b.

https://preview.redd.it/je2q9vhllq0c1.png?width=871&format=png&auto=webp&s=d23b1cdd307dfa54fb4dd788a0f6ea90ee23fa94

[–] Monkey_1505@alien.top 1 points 2 years ago

Knowledge is a strange goal for any model when we have the internet. IMO. Just connect your model to a web search.

[–] Dorialexandre@alien.top 1 points 2 years ago (1 children)

My current hunch is that they use a lot of non easily accessible online ressources (including a specific archive owned by someone named Anna).

[–] Hulksulk666@alien.top 1 points 2 years ago

oh, anna !

[–] Technical_Spirit_622@alien.top 1 points 2 years ago

Is there any version of mistral or llama2 with RHLF applied to make tasks of text summarisation without having the censorship. Sometimes the output is totally different from what one could expect with the input sentences. Even if I state in the prompt to avoid applying censorship and focus on the input.

[–] qubedView@alien.top 1 points 2 years ago

Do people find that it holds up in use? Or are we mostly going on benchmarks? I’m skeptical of benchmarks, and a highly performant 7B model would be of great use.

[–] Commercial_Jicama561@alien.top 1 points 2 years ago

French qualité. Yes, this is a thing now. Get used to it. HuggingFace is french too.

[–] obeymypropaganda@alien.top 1 points 2 years ago (1 children)

They matched parameters and tokens when training.

Podcast on Spotify "No Priors" has the CEO of Mistral on who discusses this.

[–] selflessGene@alien.top 1 points 2 years ago

I don’t know what this means but will listen to the podcast to find out

[–] cleverestx@alien.top 1 points 2 years ago

Why can we get a 20 - 34b version of this very capable Mistral?

[–] Monkey_1505@alien.top 1 points 2 years ago

Having used it a lot, I can say for sure that without much prompting it readily produces junk web text, urls etc, so it is not a fully filtered or fully synthetic dataset.

My guess would be that it's just 'a bit better filtered than llama-2', and maybe slightly more trained on that set. Slightly better quality set, slightly more trained on that set.

My intuition based on this, is that per parameter size EVERYTHING open source could be optimized considerably more.

[–] Feztopia@alien.top 1 points 2 years ago

As far as I know (I might be wrong) it's partly the team that made llama1 (and maybe made the first steps for llama2?). So they already knew what they were doing. How llama could be improved* and so on.

*The dataset

[–] FPham@alien.top 1 points 2 years ago

It's simply the time bonus - coming after all the big models.

- better filtering - kill outright junk

- you use already big models (OpenAI and LLama) that you can use for data tuning and filtering

- use available synthetic data

[–] synaesthesisx@alien.top 1 points 2 years ago

We’re only in the first inning too. Buckle up