LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

How fast is 3090 for Codellama 70B 4/8bit? (alien.top)

submitted 2 years ago by Snoo-83094@alien.top to c/localllama@poweruser.forum

5 comments fedilink hide all child comments

I am considering purchasing a 3090 primarily for use with Code Llama. Is it a good investment? I haven't been able to find any relevant videos on YouTube and would like to understand more about its performance speeds.

top 5 comments

sorted by: hot top controversial new old

[–] EgeTheAlmighty@alien.top 1 points 2 years ago

I have a 4090 at work, and quantized 34B models barely fit in the 24GB of VRAM. I get around 20 tokens per second of output. My personal computer has a laptop 3080ti with 16GB of VRAM, that one can't do more than 13B models but still get about 20 tokens per second from it. Although these are for quantizations optimized for speed so depending on what model you're trying to use it might be slower.

[–] Herr_Drosselmeyer@alien.top 1 points 2 years ago

With a 3090 and sufficient system RAM, you can run 70b models but they'll be slow. About 1.5 tokens/second. Plus quite a bit of time for prompt ingestion. It's doable but not fun.

[–] a_beautiful_rhind@alien.top 1 points 2 years ago

one is not enough

[–] opi098514@alien.top 1 points 2 years ago

For a 34b model you should be fine. I run 34b models on my duel 3060s and it’s very nice. Usually like 20ish tokens a second. If you want to run like a 7b model you can get basically instant results. With Mistal 7b I’m getting almost 60 tokens a second. It’s crazy. But it really depends on what you are using it for and how much accuracy you need.

[–] flossraptor@alien.top 1 points 2 years ago

With a dedicated 3090 (another card for OS) a 34b 5bpw just fits and runs very fast. Like 10-20t/s. The quality is good for my application, but I'm not coding.