this post was submitted on 02 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

I'm running a simple finetune of llama-2-7b-hf mode with the guanaco dataset. A test run with batch size of 2 and max_steps 10 using the hugging face trl library (SFTTrainer) takes a little over 3 minutes on Colab Free. But the same script is running for over 14 minutes using RTX 4080 locally. I'm running this under WSL with full CUDA support. The GPU is utilized 100% during training so I don't think that's a problem here.

Is there anything I'm missing or overlooking? The script itself is pretty simple and straight forward but for reference I'm using this version. The code is using bitsandbytes, 4bit loading, nf4 quant, float16, all standard stuff.

https://github.com/Vasanthengineer4949/NLP-Projects-NHV/blob/main/LLMs%20Related/Finetune%20Llama2%20using%20QLoRA/Finetune_LLamA.ipynb

Any help is appreciated.

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here