this post was submitted on 29 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

I’m using a100 pcie 80g. Cuda11.8 toolkit 525.x

But when i inference codellama 13b with oobabooga(web ui)

It just make 5tokens/s

It is so slow.

Is there any config or something else for a100???

top 8 comments
sorted by: hot top controversial new old
[–] opi098514@alien.top 1 points 9 months ago

Sounds like you might be using the standard transformer loader. Try exllama or exlamav2

[–] uti24@alien.top 1 points 9 months ago

Sounds like you run it on CPU. If you using oobabooga you have to explicitly set how many layers you offload to GPU and by default everything runs on CPU (at least gguf models)

[–] hudimudi@alien.top 1 points 9 months ago

Uhmmm where did you buy that a100? Was it a good deal? lol. Just kidding, you probably set sth up wrong or the drivers are messing up. Is the card working fine otherwise in benchmarks?

[–] a_beautiful_rhind@alien.top 1 points 9 months ago

Something is wrong with your environment. even P40s give more than that.

Other option is you don't get enough tokens to get proper t/s speed. What was the total inference time?

[–] SativaSawdust@alien.top 1 points 9 months ago

Have you tried: import torch print(torch.cuda.is_available())

[–] easyllaama@alien.top 1 points 9 months ago

Try use GGUF, this format likes single GPU especially you have 80GB vram. I think you can run 70gb GGUF with all layers in GPU.

[–] nuvalab@alien.top 1 points 9 months ago

That sounds like CPU speed. What you see from `watch nvidia-smi -d -n 0.1` while you're running inference ?

[–] henk717@alien.top 1 points 9 months ago

Tried a 13B model with Koboldcpp on one of the runpod A100's, its Q4 and FP16 speed both clocked in around 20T/S at 4K context, topping at 60T/S for smaller generations.