this post was submitted on 19 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

Using a 5800h and rtx3060 laptop i constructed a rag pipline to do basically pdf Chat qith a local llama 7b 4bit quantized Modell in llama_index using llama.cpp as backend. I use an emmbeding and a vector store through postgresql. Under wsl.

With a context of 4k and 256 token output length generating an answer takes about 2-6min which seems relatively long. I wanted to know if that is expected or if i need to go on the hunt for what makes my code inefficient.

Also what kinds of speed ups would other gpus bring ?

Would be very happy to get some thoughts on the matter :)

you are viewing a single comment's thread
view the rest of the comments
[–] harrro@alien.top 1 points 10 months ago (3 children)

After the document/PDF is already indexed, generating a 256 token answer should take a few seconds (assuming you're using a 7-13B model).

Check that CUDA is being used (check your video card's RAM usage to see if the model is loaded into VRAM).

[–] Noxusequal@alien.top 1 points 10 months ago (2 children)

I know that cuda is used vram is full and i get the message in the beginning. What is your hardware setup ?

Do you also use llama_index and then langchain or did you build it more or less from llama_cpp and langchain without llama_index ?

[–] harrro@alien.top 1 points 10 months ago (1 children)

I'm using langchain with qdrant as the vector store.

VRAM is full

How is a 7B model maxing out your VRAM? A 7B model at 4bit and 4k context should not use the 12GB VRAM on a 3060.

[–] Noxusequal@alien.top 1 points 10 months ago

Its a 3060 laptop so only 6gb and model plus embedding etc. Is at like 5.8gb