I'm a complete noob to LLMs. What is the b in 33b model? And what would be the best place to start learning about building my own local models?
LocalLLaMA
Community to discuss about Llama, the family of large language models created by Meta AI.
It isn't practical for most people to make their own models, that requires industrial hardware. The "b" is billion, which indicates the size and potential intelligence of the model. Right now, the Yi-34b models are the best for that size.
I recommend using a Mistral 7b as your introduction to LLM. They are small but fairly smart for their size. Get your model from Huggingface. For your model, something like Mistral Dolphin should do fine.
I recommend KoboldCPP for running a model, as it is very simple to use. It uses GGUF format, allowing you to use your GPU, RAM, and CPU. Other formats are exclusively GPU, offering greater speed but less flexibility.
8bit? 4-bit qlora? You can train 34B models on 24GB. You might need to set up deepspeed if you want to use both, or just train on 24GB card. PSA if you are using axolotl - disabling sample packing is required to enable flash attention 2 and, otherwise flash attention will simply not be enabled. This can spare you some memory. I can train Yi-34B QLoRA with rank 16, ctx 1100 (and maybe some more) on 24GB Ampere card
start with Lora rank=1, 4bit, flash-attention-2, context 256, batchsize=1 until your reach your maximum. Qlora 33b definitely works on just 24gb, it worked back a few months ago.
i have some issues with flash attention and with 48gb i can go up to 512 rank with batch size 1 and max len 768. My last run was 1024 max len, batch 2, gradient 32, rank 128 and gives pretty nice results
Thanks for sharing!
Should work on single 24g gpu as well as either qlora or alpaca_lora_4bit. You won't get big batches or big context but it's good enough.
Thanks! I have some problems to load GPTQ models with transformer loader.
I think you may need to try to shard optimizer state and gradient. I've been using DeepSpeed and have had some good success. Here is a writeup that compares the different DeepSpeed iterations: [RWKV-infctx] DeepSpeed 2 / 3 comparisons | RWKV-InfCtx-Validation – Weights & Biases (wandb.ai). Look at the bottom of article for an accessible overview. I'm not the author, and I haven't validated the findings. I think more distributed tools are getting more and more necessary. I suppose the option is quantization but may risk quality loss. Here is a discussion on that: https://www.reddit.com/r/LocalLLaMA/comments/153lfc2/quantization_how_much_quality_is_lost/
Thank you! It looks very deep to me, I will look into it.