this post was submitted on 09 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

Hello everyone i am currently trying to set up a small 7b llama 2 chat model. The unquantized full version runs but only very slowely in pytorch with cuda. I have an rtx 3060 laptop with 16gb of ram. The model is taking about 5 -8 min to reply to the example prompt given

I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?

and using kobold.cpp running on the llama-2-7b-chat.Q5_K_M.gguf it takes literall seconds. But i found no way to load those quantized modells in pytorch under windows where auto gptq doesnt work. Also is pytorch just alot slower then kobold.cpp ?

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here