Noxusequal

joined 10 months ago
[–] Noxusequal@alien.top 1 points 9 months ago

okay thank you guys so this only really makes sense if i want to run different models on the different gpus or if i have something so big i need the 48gb of vram for and i can deal with the slower speeds :) thanks for the feedback

[–] Noxusequal@alien.top 1 points 9 months ago (1 children)

the question is is that still faster then system memory or not ?

 

As the title says when combining a p40 and a rtx 3090 a few use casese come to mind and i wanted to know if they could be done ? greatly appreciate your help:
first could you run larger modells where they are computed on the 3090 and the p40 is just used for vram offloading and would that be faster then system memory ?

Could you compute on both of them in a asymetric fashion like putting some layers on the RTX3090 and fewer on the p40 ?

Lastly and that one probably works you could run two different instances of LLms for example a bigger one on the 3090 and a smaller on the p40 i asume.

 

hello i am currently trying to set up a rag pipline and i noticed that as my prompts get longer and filled with context the tokens per second decrease drastically. from 10-20 ish down to 2 or less i am using llama.cpp running currently a 4q llama2 7b model on my 3060 laptop with 6gb of vram(using cuda).

I dont understand why this is happening and it makes the responsese painfully slow of course i expect the time it need to process a longer prompt to increase but why is it increasing the time it needs per token.

i would love to hear if this is normal and if no what i might do about it ?

here an exact example:

llm.complete("tell me what is a cat using the following context,  here is a bit of context about cats: Female domestic cats can have kittens from spring to late autumn in temperate zones and throughout the year in equatorial regions, with litter sizes often ranging from two to five kittens. Domestic cats are bred and shown at events as registered pedigreed cats, a hobby known as cat fancy. Animal population control of cats may be achieved by spaying and neutering, but their proliferation and the abandonment of pets has resulted in large numbers of feral cats worldwide, contributing to the extinction of bird, mammal and reptile species.  ")

llama_print_timings: prompt eval time = 80447.41 ms / 153 tokens ( 525.80 ms per token, 1.90 tokens per second)

compared to

llm.complete("tell me what is a cat")

llama_print_timings: prompt eval time = 319.16 ms / 4 tokens ( 79.79 ms per token, 12.53 tokens per second)

 

hello i am currently trying to set up a rag pipline and i noticed that as my prompts get longer and filled with context the tokens per second decrease drastically. from 10-20 ish down to 2 or less i am using llama.cpp running currently a 4q llama2 7b model on my 3060 laptop with 6gb of vram(using cuda).

I dont understand why this is happening and it makes the responsese painfully slow of course i expect the time it need to process a longer prompt to increase but why is it increasing the time it needs per token.

i would love to hear if this is normal and if no what i might do about it ?

here an exact example:

llm.complete("tell me what is a cat using the following context,  here is a bit of context about cats: Female domestic cats can have kittens from spring to late autumn in temperate zones and throughout the year in equatorial regions, with litter sizes often ranging from two to five kittens. Domestic cats are bred and shown at events as registered pedigreed cats, a hobby known as cat fancy. Animal population control of cats may be achieved by spaying and neutering, but their proliferation and the abandonment of pets has resulted in large numbers of feral cats worldwide, contributing to the extinction of bird, mammal and reptile species.  ")

llama_print_timings: prompt eval time = 80447.41 ms / 153 tokens ( 525.80 ms per token, 1.90 tokens per second)

compared to

llm.complete("tell me what is a cat")

llama_print_timings: prompt eval time = 319.16 ms / 4 tokens ( 79.79 ms per token, 12.53 tokens per second)

[–] Noxusequal@alien.top 1 points 10 months ago

Its a 3060 laptop so only 6gb and model plus embedding etc. Is at like 5.8gb

[–] Noxusequal@alien.top 1 points 10 months ago (2 children)

I know that cuda is used vram is full and i get the message in the beginning. What is your hardware setup ?

Do you also use llama_index and then langchain or did you build it more or less from llama_cpp and langchain without llama_index ?

 

Using a 5800h and rtx3060 laptop i constructed a rag pipline to do basically pdf Chat qith a local llama 7b 4bit quantized Modell in llama_index using llama.cpp as backend. I use an emmbeding and a vector store through postgresql. Under wsl.

With a context of 4k and 256 token output length generating an answer takes about 2-6min which seems relatively long. I wanted to know if that is expected or if i need to go on the hunt for what makes my code inefficient.

Also what kinds of speed ups would other gpus bring ?

Would be very happy to get some thoughts on the matter :)

[–] Noxusequal@alien.top 1 points 10 months ago

Okay its working now i need to install nvcc seperatly and change the CUDA_HOME evironment. Also to install nvcc i needed to get the simlinks working manually. but with 15 minutes of google search i got it to work :D thank you all :)

[–] Noxusequal@alien.top 1 points 10 months ago

jup that was part of it :) its working now tahnk you

[–] Noxusequal@alien.top 1 points 10 months ago

iam looking to do something similar using RAG piplines might be usefull as far as i understand to give the model extra context about the sides you want to summarize.

https://agi-sphere.com/retrieval-augmented-generation-llama2/

maybe you already know all this but i am also new and just recently stumbled upon this :)

[–] Noxusequal@alien.top 1 points 10 months ago

I did this :) should have specified but when reinstalling i set both flags as env variables again.

 

I tried everything at this point i think i am doing something wrong or i have discovered some very strange bug. i was thinking on posting on their github but i am not sure if i am not simply making a very stupid error.

```

in a fresh conda install set up with python 3.12

i used export LLAMA_CUBLAS=1

then i copied this:

CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

it runs without complaint creating a working llama-cpp-python install but without cuda support. I know that i have cuda working in the wsl because nvidia-sim shows cuda version 12.

i have tried to set up multiple environments i tried removing and reinstalling, i tried different things besides cuda that also dont work so something seems to be off with the backend part but i dont know what. Best guess i do something very basic wrong like not setting the environmental variable correctly or somthing.

when i reinstalled i used this option

pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

Also it does simply not create the llama_cpp_cuda folder in so llama-cpp-python not using NVIDIA GPU CUDA - Stack Overflow does not seem to be the problem.

Hardware:

Ryzen 5800H

RTX 3060

16gb of ddr4 RAM

WSL2 Ubuntu

TO test it i run the following code and look at the gpu mem usage which stays at about 0

from llama_cpp import Llama

llm = Llama(model_path="/mnt/d/Maschine learning/llm models/llama_2_7b/llama27bchat.Q4_K_M.gguf", n_gpu_layers=20,

n_threads=6, n_ctx=3584, n_batch=521, verbose=True)

output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)

So any help or idea what could be going on here would be of great help because i am out of ideas. Thank you very much :)

 

Hello everyone i am currently trying to set up a small 7b llama 2 chat model. The unquantized full version runs but only very slowely in pytorch with cuda. I have an rtx 3060 laptop with 16gb of ram. The model is taking about 5 -8 min to reply to the example prompt given

I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?

and using kobold.cpp running on the llama-2-7b-chat.Q5_K_M.gguf it takes literall seconds. But i found no way to load those quantized modells in pytorch under windows where auto gptq doesnt work. Also is pytorch just alot slower then kobold.cpp ?