dogesator

joined 10 months ago
[–] dogesator@alien.top 1 points 9 months ago

Predicting the loss is very different from predicting real world abilities, they are able to top the former, not the latter.

Predicting the future loss once you’re already 10% into training is fairly trivial. Predicting the actual abilities though is not.

[–] dogesator@alien.top 1 points 9 months ago (1 children)

Yea I’m saying that ChatGPT outputs are contained on internet posts in the year 2023, so simply training from 2023 internet data would end up with training on ChatGPT data as a side effect.

[–] dogesator@alien.top 1 points 10 months ago (1 children)

The people obsessed with making money? Do you know what a CTO is? Greg Brockman is an engineer and was leading all technical operations at stripe. He himself has been involved in multiple big papers, also the lead researcher of GPT-4 just quit OpenAI to join Greg and Sam… Ilya is not even listed as a main contributor for GPT-4 and Greg was…

[–] dogesator@alien.top 1 points 10 months ago

He’s been programming since 8 and accepted into stanford as a teen for computer science, he quite literally was effectively a computer scientist for years before starting his first company.

[–] dogesator@alien.top 1 points 10 months ago

I don’t think you realize Sam is literally the CEO of this new company, it’s not just some small Microsoft’s department, this is effectively a new company owned by Microsoft just like deepmind has its own CEO but is also technically owned by google.

The GPT-4 Lead has already quit OpenAI to join Sam and Greg, and Greg has experience himself being listed as lead of infrastructure team on the GPT-4 paper, actually Greg Brockman has more significant main contributions on the official GPT-4 paper than Ilya does.

[–] dogesator@alien.top 1 points 10 months ago

Even if it can get just half way between gpt-3.5 vs 4… that would be big in my opinion

[–] dogesator@alien.top 1 points 10 months ago
  • MoE

You gloss over “MoE just helps with FLOPS issues” as if that’s not a hugely important factor.

So many people have a 16 or 24GB GPU, or even 64GB + Macbooks that aren’t being fully utilized.

Sure people can load a 30B Q5 model into their 24GB GPU or a 70B Q5 model into their 48GB+ of memory in a macbook, but the main reason we don’t is because it’s so much slower, because it takes so much more FLOPS…

People are definitely willing to sacrifice vram for speed and that’s what MoE allows you to do.

You can have a 16 sub-network MoE with 100B parameters loaded comfortably into a macbook pro with 96GB of memory at Q5 with the most useful 4 subnetworks activated (25B params) for any given token,

this would benchmark significantly higher than current 33B dense models when done right and act much smarter than a 33B model while also being around the same speed as a 33B model.

Its all around more smarts for the same speed and the only downside is that it’s just using the extra VRAM that you probably weren’t using before anyways

[–] dogesator@alien.top 1 points 10 months ago

Already Mistral 7B fine tunes reaching parity with gpt-3.5 in most benchmarks.

I’d be very surprised if Llama-3 70B fine tunes don’t significantly outperform GPT-3.5 in nearly every metric.

[–] dogesator@alien.top 1 points 10 months ago (3 children)

It referring to itself as a GPT could just be from pre-training internet data if it was trained on internet data from 2023.

[–] dogesator@alien.top 1 points 10 months ago

So far have only benchmarked Hellaswag and Arc Challenge but it’s significantly beating both WizardLM-13B and GPT4-X-Vicuna-13B on both benchmarks! These are not the latest sota models ofcourse but it’s amazing to see how this 3B model is surpassing the best 13B models of just 6 months ago.

I’ll see if we can have it benchmarked officially on the HF leaderboard this week so people can see how it compares with latest models.

[–] dogesator@alien.top 1 points 10 months ago (3 children)

I can almost guarantee you that Capybara 3B and Obsidian 3B will perform would perform even significantly better than orca mini. The base model that I’m using for training 3B is the much newer StableLM 3B model trained for 4 trillion tokens of training while orca mini base model is open llama 3B which was only trained on around 1-2 Trillion tokens and performs significantly worse.

 

Happy to announce my release of Nous-Capybara 7B and 3B V1.9!

7B V1.9 version is now trained with Mistral instead of V1 that was trained on Llama. Also some significant dataset improvements under the hood.

As for the 3B size, it's the first sub-7B model released under Nous Research and leverages the same dataset as 7B V1.9, efficient enough to run briskly on even a non-pro iphone! This is what's currently being used as well for the foundation of the worlds first 3B parameter multi-modal model called Obsidian (Should be released by the time of this posting.)

Capybara uses a new method called Amplify-Instruct for data creation, this uses existing single-turn popular datasets like Airoboros, EverythingLM and Know_logic as the seeds for which synthetic long context back and forth conversational examples are synthesized from.(Paper releasing soon with more details)

Amongst the dataset process is also thousands of top posts scraped regarding certain subjects on the website LessWrong that discuss deep complex long form concepts surrounding the nature of reality, reasoning, futurism and philosophy, and then using the Amplify-instruct technique on this data to leverage this into advanced long context multi-turn examples. It is also trained on tasks of summarizing these multiple thousand token long posts, papers and articles regarding such topics,and then having back and forth conversations discussing things surrounding variations of such summaries.

Part of the development of the dataset was with the goal of an unbiased, natural casual prose and great conversational abilities, while having very logical analytical prowess and robustness in back and forth conversation. V1.9 further improves this by putting further emphasis on improving realistic prose, identifying and removing dataset examples that were shown to hurt certain reasoning capabilities, and identifying biases that hurt problem solving abilities as well.

There was also found to be instances of the model being biased towards a more robotic identity through the training data and even certain physical identity biases regarding self-identity, like pre-conceived notions a model could have about being physical versus metaphysical, pre-conceived notions relating to what knowledge was held within the self of Capybara etc... Identifying and fixing these biases within the distribution for V1.9 seemed to give significant improvements overall in terms of how well the model works with little to no instructions and no system prompt, but also seems to significantly improve the steerability of the model and how well it can now follow more complex and difficult system prompts.

Although I didn't intend to optimize this model for Roleplay specifically, I was very surprised to see people messaging me about how Capybara V1 was one of their favorite models for RolePlay, and based on some early testers it seems that Capybara V1.9 is a further significant jump in not just the logical analytical capabilities, but also the coherency and casual steerable prose for roleplay, several telling me it's now their new favorite model for such use cases.

I'm excited that I finally have this released and I hope I can get feedback from any of you as well that might be interested in trying it out! Here is the quantized version by TheBloke of 7B V1.9: https://huggingface.co/TheBloke/Nous-Capybara-7B-v1.9-GGUF

And here is the quantized version of 3B: https://huggingface.co/TheBloke/Nous-Capybara-3B-v1.9-GPTQ

 

Hey everyone, happy to say I’m officially announcing Obsidian V0.5 as part of my work at Nous Research and building upon my work creating the Capybara V1.9 dataset.

This model is blazing fast and is likely the first Multi-modal model that is efficient enough to fit within the ram constraints of even a non-pro iphone! at practical speeds as well!

This model in its current state is largely a multi-modal version of Nous-Capybara-3B which I also only recently released, I’ve designed the dataset with novel synthesis methods (Paper currently being done) it’s made to be robust with conversational abilities and even includes multi-turn data that has been synthesized as a continuation of single turn data examples contained within datasets like Airoboros, Know_logic, EverythingLM and more.

It’s built using Llava 1.5 techniques but instead of a 7B llama as a base, we choose to use the new StableLM 3B model trained for 4 trillion tokens. (We plan to train upon Mistral likely as well)

Any questions or feedback are much appreciated!

Download here: https://huggingface.co/NousResearch/Obsidian-3B-V0.5

Or download quantized version here, Courtesy of Nisten: https://huggingface.co/nisten/obsidian-3b-multimodal-q6-gguf

[–] dogesator@alien.top 1 points 10 months ago (1 children)

Important to keep in mind that Ilya Sutskever and Andrey Karpathy were not anywhere near as popular when they were first joined OpenAI as they are now. There is a lot of hidden skills and talent within the team at imbue that we might not ens up considering to be one of the “greats” until years from now.

view more: next ›