this post was submitted on 28 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

I want to fine tune some LLM models with my own dataset which contains very long examples (a little > 2048 tokens). vRAM usage jumps up several GBs by just increasing the Cutoff Length from 512 to 1024.

Is there a way to feed those long examples into the models without increasing vRAM significantly?

top 2 comments
sorted by: hot top controversial new old
[โ€“] andrewlapp@alien.top 1 points 11 months ago

VRAM scales quadratically as sequence length increases. I'm not aware of any solutions. Even efficient implementations of long context fine tuning such as LongLoRA only improve speed and quality, but leave memory usage the same as LoRA.

I recommend ensuring you're reducing memory in other ways:

  1. Ensure you're using 4-bit QLoRA

  2. Ensure batch size is 1

  3. Ensure you're using FlashAttention-2

  4. Ensure your optimizer state in in CPU memory by utilizing a paged optimizer.

  5. Use gradient checkpointing.

You also could do something more experimental like employ Mistral with a sliding window of 1024 tokens to capture 2048 tokens of context while only using the memory of 1024 tokens.

Or you could just summarize or prune your long examples.

[โ€“] wind_dude@alien.top 1 points 11 months ago

you can try changing the attention to something like flash attention