this post was submitted on 18 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

Currently I have 12+24GB VRAM and I get Out Of Memory all the time when try to fine tune 33B models. 13B is fine, but the outcome is not very good so I would like to try 33B. I wonder if it’s worthy to replace my 12GB GPU with a 24GB one. Thanks!

top 10 comments
sorted by: hot top controversial new old
[–] kevdawg464@alien.top 1 points 10 months ago (1 children)

I'm a complete noob to LLMs. What is the b in 33b model? And what would be the best place to start learning about building my own local models?

[–] Sabin_Stargem@alien.top 1 points 10 months ago

It isn't practical for most people to make their own models, that requires industrial hardware. The "b" is billion, which indicates the size and potential intelligence of the model. Right now, the Yi-34b models are the best for that size.

I recommend using a Mistral 7b as your introduction to LLM. They are small but fairly smart for their size. Get your model from Huggingface. For your model, something like Mistral Dolphin should do fine.

I recommend KoboldCPP for running a model, as it is very simple to use. It uses GGUF format, allowing you to use your GPU, RAM, and CPU. Other formats are exclusively GPU, offering greater speed but less flexibility.

[–] FullOf_Bad_Ideas@alien.top 1 points 10 months ago

8bit? 4-bit qlora? You can train 34B models on 24GB. You might need to set up deepspeed if you want to use both, or just train on 24GB card. PSA if you are using axolotl - disabling sample packing is required to enable flash attention 2 and, otherwise flash attention will simply not be enabled. This can spare you some memory. I can train Yi-34B QLoRA with rank 16, ctx 1100 (and maybe some more) on 24GB Ampere card

[–] Aaaaaaaaaeeeee@alien.top 1 points 10 months ago

start with Lora rank=1, 4bit, flash-attention-2, context 256, batchsize=1 until your reach your maximum. Qlora 33b definitely works on just 24gb, it worked back a few months ago.

[–] kpodkanowicz@alien.top 1 points 10 months ago (1 children)

i have some issues with flash attention and with 48gb i can go up to 512 rank with batch size 1 and max len 768. My last run was 1024 max len, batch 2, gradient 32, rank 128 and gives pretty nice results

[–] tgredditfc@alien.top 1 points 10 months ago

Thanks for sharing!

[–] a_beautiful_rhind@alien.top 1 points 10 months ago (1 children)

Should work on single 24g gpu as well as either qlora or alpaca_lora_4bit. You won't get big batches or big context but it's good enough.

[–] tgredditfc@alien.top 1 points 10 months ago

Thanks! I have some problems to load GPTQ models with transformer loader.

[–] Updittyupup@alien.top 1 points 10 months ago (1 children)

I think you may need to try to shard optimizer state and gradient. I've been using DeepSpeed and have had some good success. Here is a writeup that compares the different DeepSpeed iterations: [RWKV-infctx] DeepSpeed 2 / 3 comparisons | RWKV-InfCtx-Validation – Weights & Biases (wandb.ai). Look at the bottom of article for an accessible overview. I'm not the author, and I haven't validated the findings. I think more distributed tools are getting more and more necessary. I suppose the option is quantization but may risk quality loss. Here is a discussion on that: https://www.reddit.com/r/LocalLLaMA/comments/153lfc2/quantization_how_much_quality_is_lost/

[–] tgredditfc@alien.top 1 points 10 months ago

Thank you! It looks very deep to me, I will look into it.