LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

Beginner question: Is there any way to use quantized gguf models in python under windows since auto gptq doesnt work ? I swear i tired searching but did not find an answer. (alien.top)

submitted 2 years ago by Noxusequal@alien.top to c/localllama@poweruser.forum

0 comments fedilink hide all child comments

Hello everyone i am currently trying to set up a small 7b llama 2 chat model. The unquantized full version runs but only very slowely in pytorch with cuda. I have an rtx 3060 laptop with 16gb of ram. The model is taking about 5 -8 min to reply to the example prompt given

I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?

and using kobold.cpp running on the llama-2-7b-chat.Q5_K_M.gguf it takes literall seconds. But i found no way to load those quantized modells in pytorch under windows where auto gptq doesnt work. Also is pytorch just alot slower then kobold.cpp ?

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here