After the document/PDF is already indexed, generating a 256 token answer should take a few seconds (assuming you're using a 7-13B model).
Check that CUDA is being used (check your video card's RAM usage to see if the model is loaded into VRAM).
Community to discuss about Llama, the family of large language models created by Meta AI.
After the document/PDF is already indexed, generating a 256 token answer should take a few seconds (assuming you're using a 7-13B model).
Check that CUDA is being used (check your video card's RAM usage to see if the model is loaded into VRAM).
I know that cuda is used vram is full and i get the message in the beginning. What is your hardware setup ?
Do you also use llama_index and then langchain or did you build it more or less from llama_cpp and langchain without llama_index ?
I'm using langchain with qdrant as the vector store.
VRAM is full
How is a 7B model maxing out your VRAM? A 7B model at 4bit and 4k context should not use the 12GB VRAM on a 3060.
Its a 3060 laptop so only 6gb and model plus embedding etc. Is at like 5.8gb