LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Why is a single a100 so slow? (alien.top)

submitted 2 years ago by Radiant-Practice-270@alien.top to c/localllama@poweruser.forum

8 comments fedilink hide all child comments

I’m using a100 pcie 80g. Cuda11.8 toolkit 525.x

But when i inference codellama 13b with oobabooga(web ui)

It just make 5tokens/s

It is so slow.

Is there any config or something else for a100???

top 8 comments

sorted by: hot top controversial new old

[–] opi098514@alien.top 1 points 2 years ago

Sounds like you might be using the standard transformer loader. Try exllama or exlamav2

[–] uti24@alien.top 1 points 2 years ago

Sounds like you run it on CPU. If you using oobabooga you have to explicitly set how many layers you offload to GPU and by default everything runs on CPU (at least gguf models)

[–] hudimudi@alien.top 1 points 2 years ago

Uhmmm where did you buy that a100? Was it a good deal? lol. Just kidding, you probably set sth up wrong or the drivers are messing up. Is the card working fine otherwise in benchmarks?

[–] a_beautiful_rhind@alien.top 1 points 2 years ago

Something is wrong with your environment. even P40s give more than that.

Other option is you don't get enough tokens to get proper t/s speed. What was the total inference time?

[–] SativaSawdust@alien.top 1 points 2 years ago

Have you tried: import torch print(torch.cuda.is_available())

[–] easyllaama@alien.top 1 points 2 years ago

Try use GGUF, this format likes single GPU especially you have 80GB vram. I think you can run 70gb GGUF with all layers in GPU.

[–] nuvalab@alien.top 1 points 2 years ago

That sounds like CPU speed. What you see from `watch nvidia-smi -d -n 0.1` while you're running inference ?

[–] henk717@alien.top 1 points 2 years ago

Tried a 13B model with Koboldcpp on one of the runpod A100's, its Q4 and FP16 speed both clocked in around 20T/S at 4K context, topping at 60T/S for smaller generations.