I'm running a simple finetune of llama-2-7b-hf mode with the guanaco dataset. A test run with batch size of 2 and max_steps 10 using the hugging face trl library (SFTTrainer) takes a little over 3 minutes on Colab Free. But the same script is running for over 14 minutes using RTX 4080 locally. I'm running this under WSL with full CUDA support. The GPU is utilized 100% during training so I don't think that's a problem here.
Is there anything I'm missing or overlooking? The script itself is pretty simple and straight forward but for reference I'm using this version. The code is using bitsandbytes, 4bit loading, nf4 quant, float16, all standard stuff.
https://github.com/Vasanthengineer4949/NLP-Projects-NHV/blob/main/LLMs%20Related/Finetune%20Llama2%20using%20QLoRA/Finetune_LLamA.ipynb
Any help is appreciated.