LocalLLaMA

14 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

submitted 2 years ago by Radiant-Practice-270@alien.top to c/localllama@poweruser.forum

8 comments fedilink hide all child comments

I’m using a100 pcie 80g. Cuda11.8 toolkit 525.x

But when i inference codellama 13b with oobabooga(web ui)

It just make 5tokens/s

It is so slow.

Is there any config or something else for a100???

you are viewing a single comment's thread
view the rest of the comments

[–] easyllaama@alien.top 1 points 2 years ago

Try use GGUF, this format likes single GPU especially you have 80GB vram. I think you can run 70gb GGUF with all layers in GPU.