LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

RTX 4080 local fine tuning performance much worse compared to Colab Free T4 GPU (alien.top)

submitted 2 years ago by gamesntech@alien.top to c/localllama@poweruser.forum

0 comments fedilink hide all child comments

I'm running a simple finetune of llama-2-7b-hf mode with the guanaco dataset. A test run with batch size of 2 and max_steps 10 using the hugging face trl library (SFTTrainer) takes a little over 3 minutes on Colab Free. But the same script is running for over 14 minutes using RTX 4080 locally. I'm running this under WSL with full CUDA support. The GPU is utilized 100% during training so I don't think that's a problem here.

Is there anything I'm missing or overlooking? The script itself is pretty simple and straight forward but for reference I'm using this version. The code is using bitsandbytes, 4bit loading, nf4 quant, float16, all standard stuff.

https://github.com/Vasanthengineer4949/NLP-Projects-NHV/blob/main/LLMs%20Related/Finetune%20Llama2%20using%20QLoRA/Finetune_LLamA.ipynb

Any help is appreciated.

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here