overview for vatsadev

1

Fitting 70B models in a 4gb GPU, The whole model, no quants or distil or anything! (alien.top)

submitted 2 years ago by vatsadev@alien.top to c/localllama@poweruser.forum

16 comments fedilink

Found out about air_llm, https://github.com/lyogavin/Anima/tree/main/air_llm, where it loads one layer at a time, allow each layer to be 1.6GB for a 70b with 80 layers. theres about 30mb for kv cache, and i'm not sure where the rest goes.

works with HF out of the box too apparently. The weaknesses appear to be ctxlen, and its gonna be slow, but anyway, anyone want to try goliath 120B unquant?

Would a merge between Neural Chat 7B v3.1 and OpenHermes-2.5 work? in c/localllama@poweruser.forum

[–] vatsadev@alien.top 1 points 2 years ago (2 children)

No its Victorian era frankenstein obvs

RWKV v5 7b, Fully Open-Source, 60% trained, approaching Mistral 7b in abilities or surpassing it. in c/localllama@poweruser.forum

[–] vatsadev@alien.top 1 points 2 years ago (1 children)

Hmm, will have to check this stuff with the people on the rwkv discord server.

V5 is stable at context usage, and V6 is trying to get better at using the context, so we might see improvement on this

RWKV v5 7b, Fully Open-Source, 60% trained, approaching Mistral 7b in abilities or surpassing it. in c/localllama@poweruser.forum

[–] vatsadev@alien.top 1 points 2 years ago (1 children)

Um The dataset is opensource, its all public HF datasets

RWKV v5 7b, Fully Open-Source, 60% trained, approaching Mistral 7b in abilities or surpassing it. in c/localllama@poweruser.forum

[–] vatsadev@alien.top 1 points 2 years ago

Thats the point of rwkv, you could have a 10 mil contx len and it would be the same as 100 ctx len

RWKV v5 7b, Fully Open-Source, 60% trained, approaching Mistral 7b in abilities or surpassing it. in c/localllama@poweruser.forum

[–] vatsadev@alien.top 1 points 2 years ago (1 children)

Its trained on 100+ languages, the focus is multilingual

1

RWKV v5 7b, Fully Open-Source, 60% trained, approaching Mistral 7b in abilities or surpassing it. (alien.top)

submitted 2 years ago by vatsadev@alien.top to c/localllama@poweruser.forum

32 comments fedilink

So RWKV 7b v5 is 60% trained now, saw that multilingual parts are better than mistral now, and the english capabilities are close to mistral, except for hellaswag and arc, where its a little behind. all the benchmarks are on rwkv discor, and you can google the pro/cons of rwkv, though most of them are v4.

Thoughts?

ctransformers VS llama-cpp-python which one should I use? in c/localllama@poweruser.forum

[–] vatsadev@alien.top 1 points 2 years ago (1 children)

Also AWQ has entire engines for efficieny, look into aphrodite engine, supposably the fastest for awq

ctransformers VS llama-cpp-python which one should I use? in c/localllama@poweruser.forum

[–] vatsadev@alien.top 1 points 2 years ago

"Do I need to learn llama.cpp or C++ to deploy models using llama-cpp-python library?" No its pure python

Are 7b models useful? in c/localllama@poweruser.forum

[–] vatsadev@alien.top 1 points 2 years ago

it outputs the call

https://twitter.com/abacaj/status/1727747892922769751

Are 7b models useful? in c/localllama@poweruser.forum

[–] vatsadev@alien.top 1 points 2 years ago (6 children)

OpenHermes 2.5 is amazing from what I've seen. it can call functions, summarize text, is extremely competitive, all the works

Is there a fine tune or dataset that focuses on creating prompts that are used in image generation like stable diffusion? in c/localllama@poweruser.forum

[–] vatsadev@alien.top 1 points 2 years ago

There are plenty of datasets, Just take the ones meant for stable diff training, rip out the prompt text, profit

Heres some high quality captions used for dalle3, etc:

https://huggingface.co/datasets/laion/dalle-3-dataset https://huggingface.co/datasets/laion/gpt4v-dataset https://huggingface.co/datasets/laion/wuerstchen-dataset https://huggingface.co/datasets/laion/220k-GPT4Vision-captions-from-LIVIS https://huggingface.co/datasets/laion/gpt4v-emotion-dataset

What model to “transcribe notes”? in c/localllama@poweruser.forum

[–] vatsadev@alien.top 1 points 2 years ago

RWKV v5 7b, its only half trained rn, but the model surpasses Mistral on all multilingual benchmarks, cause the is meant to be multilingual.

Intel neural-chat-7b-v3-1 in c/localllama@poweruser.forum

[–] vatsadev@alien.top 1 points 2 years ago (1 children)

OpenHermes 2.5 is the latest version, but the openHermes series has a history in ai models of being good, and I used it for some function calling, its really good

1

Why not test all models for training on the test data with Min-K% Prob? (alien.top)

submitted 2 years ago by vatsadev@alien.top to c/localllama@poweruser.forum

5 comments fedilink

So there detect pretrain data, https://swj0419.github.io/detect-pretrain.github.io/ , where one can test if a model has been pretrained on the text or not, so why dont we just test all the models going on the leaderboard, and just reject those detected for pretrain data? It would end the "train on test" issue

1

Thinking about what people ask for in llama 3 (alien.top)

submitted 2 years ago by vatsadev@alien.top to c/localllama@poweruser.forum

13 comments fedilink

So I was looking at some of the things people ask for in llama 3, kinda judging them over whether they made sense or were feasible.

Mixture of Experts - Why? This literally is useless to us. MoE helps with Flops issues, it takes up more vram than a dense model. OpenAI makes it work, it isn't naturally superior or better by default.

Synthetic Data - That's useful, though its gonna be mixed with real data for model robustness. Though the real issue I see is here is collecting that many tokens. If they ripped anything near 10T for openai, they would be found out pretty quick. I could see them splitting the workload over multiple different accounts, also using Claude, calling multiple model AI's (GPT-4, gpt-4-turbo), ripping data off third party services, and all the other data they've managed to collect.

More smaller models - A 1b and 3b would be nice. TinyLlama 1.1B is really capable for its size, and better models at the 1b and 3b scale would be really useful for web inference and mobile inference

More multilingual data - This is totally Nesc. I've seen RWKV world v5, and its trained on a lot of multilingual data. its 7b model is only half trained, and it already passes mistral 7b on multilingual benchmarks. They're just using regular datasets like slimpajama, they havent even prepped the next dataset actually using multilingual data like CulturaX and Madlad.

Multimodality - This would be really useful, also probably a necessity if they want LLama 3 to "Match GPT-4". The Llava work has proved that you can make image to text work out with llama. Fuyu Architecture has also simplified some things, considering you can just stuff modality embeddings into regular model and train it the same. it would be nice if you could use multiple modalities in, as meta already has experience in that with imagebind and anymal. It would be better than GPT 4 is it was multimodality in -> multimodal out

GQA, sliding windows - Useful, the +1% architecture changes, Meta might add them if they feel like it

Massive ctx len - If they Use RWKV, they may make any ctx len they can scale to, but they might do it for a regular transformer too, look at Magic.devs (not that messed up paper MAGIC!) ltm-1: https://magic.dev/blog/ltm-1, the model has a context len of 5,000,000.

Multi-epoch training, Dr. Vries scaling laws - StableLM 3b 4e 1t is still the best 3b base out there, and no other 3b bases have caught up to it so far. Most people attribute it to the Dr Vries scaling law, exponential data and compute, Meta might have really powerful models if they followed the pattern.

Function calling/ tool usage - If they made the models come with the ability to use some tools, and we instruction tuned to allow models to call any function through in context learning, that could be really OP.

Different Architecture - RWKV is good one to try, but if meta has something better, they may shift away from transformers to something else.

1

[D] How Exactly does Fuyu's image to embedding with nn.Linear work? Could you do more with it? (alien.top)

submitted 2 years ago by vatsadev@alien.top to c/machinelearning@academy.garden

1 comments fedilink

As I was asking above, I've been looking at the Fuyu 8b model, and I've been able to break it down to

model takes in text the regular way, text -> tokens -> embeddings
it also takes image -> embeddings
it has a vanilla decoder, so only text comes out, they add special tokens around images, so i'm assuming the decoder ignores output images

So, from what I know, nn.Linear takes in a tensor and makes embeddings of your choice size. I not really sure with everything else though.

Since the linear layer just makes embeddings, does something like this even need training for the image encoder?
nn.Linear takes tensors as input, and they split an image into patches, so I'm assuming those patches are made into tensors. How do you turn an image into a tensor? A code snippet of image-embedding-image would be nice if possible
While Fuyu does not output images, wouldn't the model hidden state be making image or image-like embeddings? Could you generate images if you had an image decoder?

1

I have to ask, why is no one using fuyu? (alien.top)

submitted 2 years ago by vatsadev@alien.top to c/localllama@poweruser.forum

0 comments fedilink

I've been looking at fuyu for the past couple days now, and its incredible. It's Got OCR, can read graphs, gives bounding boxes. How is no one using this? I get that it might no be on a UI, but its avalible through all of HF's libraries, and it has Gradio. While I havent tested the last claim, it supposably matches LLama while being 8b instead of 13b. Thoughts?