LocalLLaMA

14 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Why is a single a100 so slow? (alien.top)

submitted 2 years ago by Radiant-Practice-270@alien.top to c/localllama@poweruser.forum

8 comments fedilink hide all child comments

I’m using a100 pcie 80g. Cuda11.8 toolkit 525.x

But when i inference codellama 13b with oobabooga(web ui)

It just make 5tokens/s

It is so slow.

Is there any config or something else for a100???

you are viewing a single comment's thread
view the rest of the comments

[–] uti24@alien.top 1 points 2 years ago

Sounds like you run it on CPU. If you using oobabooga you have to explicitly set how many layers you offload to GPU and by default everything runs on CPU (at least gguf models)

permalink
fedilink
source