this post was submitted on 19 Nov 2023
1 points (100.0% liked)
LocalLLaMA
1 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 10 months ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
After the document/PDF is already indexed, generating a 256 token answer should take a few seconds (assuming you're using a 7-13B model).
Check that CUDA is being used (check your video card's RAM usage to see if the model is loaded into VRAM).
I know that cuda is used vram is full and i get the message in the beginning. What is your hardware setup ?
Do you also use llama_index and then langchain or did you build it more or less from llama_cpp and langchain without llama_index ?
I'm using langchain with qdrant as the vector store.
How is a 7B model maxing out your VRAM? A 7B model at 4bit and 4k context should not use the 12GB VRAM on a 3060.
Its a 3060 laptop so only 6gb and model plus embedding etc. Is at like 5.8gb