this post was submitted on 12 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

Hey everyone,

Looking to get into some ML.

Can a GTX 4070ti with 12gb VRAM alongside 32gb ram run 13b comfortably?

I seem to read conflicting opinions on this.

Thank you!

top 5 comments
sorted by: hot top controversial new old
[–] donotdrugs@alien.top 1 points 10 months ago

I seem to read conflicting opinions on this.

This is probably due to the fact that most people don't use the original version of llama 13b and instead use quantized versions. The original model requires more than 12 GB VRAM but the quantized versions of llama 13b fit in less than 10 GBs of VRAM.

Quantization works by using lower precision integers for each parameter. So instead of having 13 billion parameters with 16 bit precision, quantized models have 13 billion parameters with just 8 or even 4 bits precision. This drastically reduces model size while retaining most of the performance.

You can download the quantized models from huggingface. User thebloke has uploaded quantized versions of pretty much every model in existence ever. You can find a link for llama2 13b here: https://huggingface.co/TheBloke/Llama-2-13B-GGML. There is a table with all the available versions as well as recommendations on what version to use.

To run these models you need to get llama.cpp. It's a framework/program for running these kinds of models.

[–] zodireddit@alien.top 1 points 10 months ago

I'm no expert in this but I'm using a 13b llama 2 model I'm happy with, with just a 3060 and it runs fine. It's probably not the raw model but it's pretty good. I also have 32gb of ram

[–] FeedMeSoma@alien.top 1 points 10 months ago

Yeah works great.

[–] Arcturus17@alien.top 1 points 10 months ago

I've got a 3060 Ti 8GB and 16 GB RAM and I can run 13B GGUFs with 30 layers offloaded to GPU and get 8-12 t/s no problem. I cannot run a 20B GGUF at all though.

If you want to run GPU inference only though, you'll need 16+ (more likely 20+) GB of VRAM.

[–] letchhausen@alien.top 1 points 10 months ago

Since I have a 4070 Ti, could I add another to get better performance? Or would it be better to just can the 4070 and get a 4090?