LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Is it possible to fine tune a 33B model with 48GB vRAM? (alien.top)

submitted 2 years ago by tgredditfc@alien.top to c/localllama@poweruser.forum

10 comments fedilink hide all child comments

Currently I have 12+24GB VRAM and I get Out Of Memory all the time when try to fine tune 33B models. 13B is fine, but the outcome is not very good so I would like to try 33B. I wonder if it’s worthy to replace my 12GB GPU with a 24GB one. Thanks!

top 10 comments

sorted by: hot top controversial new old

[–] kevdawg464@alien.top 1 points 2 years ago (1 children)

I'm a complete noob to LLMs. What is the b in 33b model? And what would be the best place to start learning about building my own local models?

[–] Sabin_Stargem@alien.top 1 points 2 years ago

It isn't practical for most people to make their own models, that requires industrial hardware. The "b" is billion, which indicates the size and potential intelligence of the model. Right now, the Yi-34b models are the best for that size.

I recommend using a Mistral 7b as your introduction to LLM. They are small but fairly smart for their size. Get your model from Huggingface. For your model, something like Mistral Dolphin should do fine.

I recommend KoboldCPP for running a model, as it is very simple to use. It uses GGUF format, allowing you to use your GPU, RAM, and CPU. Other formats are exclusively GPU, offering greater speed but less flexibility.

[–] FullOf_Bad_Ideas@alien.top 1 points 2 years ago

8bit? 4-bit qlora? You can train 34B models on 24GB. You might need to set up deepspeed if you want to use both, or just train on 24GB card. PSA if you are using axolotl - disabling sample packing is required to enable flash attention 2 and, otherwise flash attention will simply not be enabled. This can spare you some memory. I can train Yi-34B QLoRA with rank 16, ctx 1100 (and maybe some more) on 24GB Ampere card

[–] Aaaaaaaaaeeeee@alien.top 1 points 2 years ago

start with Lora rank=1, 4bit, flash-attention-2, context 256, batchsize=1 until your reach your maximum. Qlora 33b definitely works on just 24gb, it worked back a few months ago.

[–] kpodkanowicz@alien.top 1 points 2 years ago (1 children)

i have some issues with flash attention and with 48gb i can go up to 512 rank with batch size 1 and max len 768. My last run was 1024 max len, batch 2, gradient 32, rank 128 and gives pretty nice results

[–] tgredditfc@alien.top 1 points 2 years ago

Thanks for sharing!

[–] a_beautiful_rhind@alien.top 1 points 2 years ago (1 children)

Should work on single 24g gpu as well as either qlora or alpaca_lora_4bit. You won't get big batches or big context but it's good enough.

[–] tgredditfc@alien.top 1 points 2 years ago

Thanks! I have some problems to load GPTQ models with transformer loader.

[–] Updittyupup@alien.top 1 points 2 years ago (1 children)

I think you may need to try to shard optimizer state and gradient. I've been using DeepSpeed and have had some good success. Here is a writeup that compares the different DeepSpeed iterations: [RWKV-infctx] DeepSpeed 2 / 3 comparisons | RWKV-InfCtx-Validation – Weights & Biases (wandb.ai). Look at the bottom of article for an accessible overview. I'm not the author, and I haven't validated the findings. I think more distributed tools are getting more and more necessary. I suppose the option is quantization but may risk quality loss. Here is a discussion on that: https://www.reddit.com/r/LocalLLaMA/comments/153lfc2/quantization_how_much_quality_is_lost/

[–] tgredditfc@alien.top 1 points 2 years ago

Thank you! It looks very deep to me, I will look into it.